Ekrem Kocaguneli: new-paper

Showing posts with label new-paper. Show all posts

Thursday, July 18, 2013

Looking for something on transfer learning (a.k.a. cross-company learning) ?

For the past few days I have been trying to design a new set of experiments aiming to help prediction when there is no or very little local data. I know it sounds kind of weird to talk about prediction, when we have no local data. Yet, there is a tremendous amount of research invested in this problem. If you want to read more about the topic, you may look for the term "transfer learning" in the machine learning community. The basic idea of transfer learning is being able to learn when the source and the target domains are different or when the source and the target tasks are different.

In software engineering (SE) domain, the transfer learning experiments have often been referred to as cross-company learning. So, in case you want to see similar studies from SE, another term to search for would be "cross-company". The problem of transfer learning in SE could be defined as having the same source and target task (i.e. prediction), yet having different source and target domains. To put things into context, let's assume that your organization has either no training data at all or has some training data, but no label information; then, your organization would be the target domain. Also, let's assume that there is another organization with training data that also has labels, then this organization would be the source domain. In such a scenario, it would be nice and also cost effective, if we could make use of the information of the source domain. Because, both collecting our own local training data from scratch or finding the label information (a.k.a. dependent variable information) for existing local data may be very time consuming and costly.

Until now, the focus of the transfer learning studies in SE have been using labeled source domain data and being able to predict the target domain data. However, merely giving predicted numbers to a local expert does not bring much local information that can be interpreted. In other words, merely predicting a local test instance does not provide any information to be interpreted.

So, we started thinking how we could bring more local domain information into transfer learning. The initial study in this direction was some sort of a position paper, which was published at the RAISE workshop in ICSE 2013. This position paper essentially provides details regarding how we could provide synergies between different learning tasks so as to improve transfer learning. Then we followed this position paper with an experimentation on software effort estimation, where we provided some evidence that it is possible to introduce interpretable local information into transfer learning studies. The paper containing the follow up experiments is recently accepted to PROMISE 2013 conference.

If you have read this far of this post, then I can safely assume that you would be interested in seeing the abstracts as well as having the links of the two papers that I mentioned in the previous paragraphs. So here you go:

The position paper:

ABSTRACT: Thanks to the ever increasing importance of project data, its collection has been one of the primary focuses of software organizations. Data collection activities have resulted in the availability of massive amounts of data through software data repositories. This is great news for the predictive modeling research in software engineering. However, widely used supervised methods for predictive modeling require labeled data that is relevant to the local context of a project. This requirement cannot be met by many of the available data sets, introducing new challenges for software engineering research. How to transfer data between different contexts? How to handle insufficient number of labeled instances? In this position paper, we investigate synergies between different learning methods (transfer, semi-supervised and active learning) which may overcome these challenges.
Link: http://bit.ly/posPaper

The follow up:
ABSTRACT: Background: Developing and maintaining a software effort estimation (SEE) data set within a company (within data) is costly. Often times parts of data may be missing or too difficult to collect, e.g. effort values. However, information about the past projects -although incomplete- may be helpful, when incorporated with the SEE data sets from other companies (cross data).
Aim: Utilizing cross data to aid within company estimates and local experts; Proposing a synergy between semi-supervised, active and cross company learning for software effort estimation.
Method: The proposed method: 1) Summarizes existing unlabeled within data; 2) Uses cross data to provide pseudo-labels for the summarized within data; 3) Uses steps 1 and 2 to provide an estimate for the within test data as an input for the local company experts. We use 21 data sets and compare the proposed method to existing state-of-the-art within and cross company effort estimation methods subject to evaluation by 7 different error measures.
Results: In 132 out of 147 settings (21 data sets X 7 error measures = 147 settings), the proposed method performs as well as the state-of-the-art methods. Also, the proposed method summarizes the past within data down to at most 15% of the original data.
Conclusion: It is important to look for synergies amongst cross company and within-company effort estimation data, even when the latter is imperfect or sparse. In this research, we provide the experts with a method that: 1) is competent (performs as well as prior within and cross data estimation methods); 2) reflects on local data (estimates come from the within data); 3) is succinct (summarizes within data down to 15% or less); 4) cheap (easy to build).

Link: http://bit.ly/fllwUpPaper

Monday, July 15, 2013

Distributed Development Considered Harmful?

Distributed development is a good use of contemporary collaboration technology and the universal talent pool of software engineers. However, it also comes with various possible challenges.

Imagine reducing costs dramatically by having multiple teams that are thousands of miles away from each other, living in completely different time zones. Saving money is good, but also imagine what such a distributed development scenario may lead to: Difficulty in inter-team communication, lack of coordination due to communication difficulties and ultimately higher post-release bugs. Therefore, it is important to take a look at the effects of distributed development. In other words, do we benefit from distributed development or does it come with unwanted side effects?

During my 2012 summer internship at Microsoft Research (MSR) at Redmond, I researched on distributed development and I was lucky to work with the incredible researchers of the Empirical Software Engineering (ESE) Group. It was a great learning experience and I am particularly thankful to Thomas Zimmermann, Christian Bird, Nachiappan Nagappan and Tim Menzies for their mentorship and support. A summary of the research that we have conducted regarding distributed development is recently published in the "Software Engineering in Practice" track of "International Conference on Software Engineering". Below is the abstract and the link of this paper. I hope you enjoy it.

ABSTRACT:

We offer a case study illustrating three rules for reporting research to industrial practitioners. Firstly, report “relevant” results; e.g. this paper explores the effects of distributed development on software products.

Second: “recheck” old results if new results call them into question. Many papers say distributed development can be harmful to software quality. Previous work by Bird et al. allayed that concern but a recent paper by Posnett et al. suggests that the Bird result was biased by the kinds of files it explored. Hence, this paper rechecks that result and finds significant differences in Microsoft products (Office 2010) between software built by distributed or collocated teams. At first glance, this recheck calls into question the widespread practice of distributed development.

Our third rule is to “reflect” on results to avoid confusing practitioners with an arcane mathematical analysis. For example, on reflection, we found that the effect size of the differences seen in the collocated and distributed software was so small that it need not concern industrial practitioners.

Our conclusion is that at least for Microsoft products, distributed development is not considered harmful.

The pdf of the article can be found here: http://dl.acm.org/citation.cfm?id=2486908

Thursday, January 24, 2013

How big is your data?

It has been a while since my last post and today seems like a good day to write something new. In this post I want to talk about a concept called "popularity" and how it can aid us reduce the amount of data we are looking at. I know the name "popularity" may sound a bit non-sciency, but bear with me, you will see that it is a pretty interesting idea. In fact, it is the idea that we used to develop an active learning solution called QUICK, which is accepted to IEEE Transactions on Software Engineering (link).

Popularity

I think one of the best ways to describe the popularity concept is to start with a toy example. Assume that we have a data set of N instances (rows) and D features (columns). Let's say that we found out the k-many nearest neighbors of every instance. For the sake of this example let's fix k=1, in other words we find the 1st nearest neighbor of every instance. Every time we pick an instance as another instance's 1st nearest neighbor, we mark the picked-up instance. At the end we count how many marks each instance received and that is our so called "popularity" or "popularity count". For example, if instance A was marked as the nearest neighbor of 5 other instances, then its popularity is 5.

Of course reading the above non-formal discussion may result in various questions such as: How to calculate the distance between instances to find nearest neighbors, how to use the popularity concept, what is its relation to active learning and so on. Next paragraphs are the brief answers to these questions.

How to of QUICK

To begin with, the concept of popularity and the QUICK algorithm (developed on top of the popularity concept) are really simple. Popularity uses simple Euclidean distance function to calculate the distance of the instances and it marks k=1-many nearest neighbors. The more important question is how to use the popularity counts of instances (i.e. how many marks each instance received as the 1st nearest neighbor). We opted to use the popularity concept for 2 purposes: Selecting features and selecting instances, which are also the two steps of the QUICK algorithm.

Selecting Features: Remember that we defined our data set at the beginning of this post in such a way that the instances correspond to rows and the features correspond to columns. In the feature selection phase, we transpose the data set. This gives us a data set where the features will be treated like instances and the instances will be treated like the features of the new space. Once we calculate the popularity counts, we only keep the non-popular instances (popularity counts of zero), which are the non-popular features of the original data set. The idea behind selecting non-popular instances of the transposed data set is to select the features of the original data set that are very different to all the other features. In other words in this phase, we want to select the features that are non-popular, unlike other features; hence, that are expected to have different information.

Selecting Instances: This phase is where the active learning comes in. To put in very very simple terms, the idea behind active learning is the following: The dependent variable information (a.k.a. label) is unknown and an oracle can be queried for that information. However, each query comes with a cost, so we want to ask as few labels as possible.

In the instance selection phase we use only the features that were selected in the previous phase and calculate the popularity counts of the instances. Then instances are sorted from the most popular to the least popular. We start asking the labels for the instances starting from the most popular instance down to the least popular instance. As soon as there is no performance gain in asking for more labels, we stop asking.

Estimation of QUICK for a test instance is basically the label of the test instance's nearest neighbor among the labeled popular instances.

Underlying Dimensionality

At the end of the feature and instance selection phases, we end up with a small subset of the original features and instances. In a data set of N instances and F features, the selected instances and the features can be represented with N' and F', respectively. N' and F' define the underlying dimensionality of our data set in the context of QUICK algorithm. An interesting thing to check is the ratio of (N'*F')*100/(N*F), which basically tells us what percent of the cells in the original data set is identified to be the essential content. For the effort data sets we experimented with the essential content can be as low as the 4% of the original data set. On the median of 18 data sets we experimented with, the essential content was 11% of the original data set.

The next question to ask is the performance of the QUICK method. By using only a small percent of the original data, can we still do as good as standard learners like k-NN and CART? Our experiments showed that "we can" and that is the beauty of it. We can find a really small subset of the data that contains the essential content and still perform as good as standard learners that use all the data. To summarize, our data sets may be not as big as we think they are!

If you have read until this point, I believe I have your attention and I may convince you to take a look at our paper for much more details. In case I convinced you, below is the abstract and here is the link of the paper.

Active Learning and Effort Estimation: Finding the Essential Content

of Software Effort Estimation Data

Ekrem Kocaguneli, Student Member, IEEE, Tim Menzies, Member, IEEE, Jacky Keung, Member, IEEE, David Cok, Member, IEEE, Ray Madachy, Member, IEEE

Background: Do we always need complex methods for software effort estimation (SEE)?

Aim: To characterize the essential content of SEE data; i.e. the least number of features and instances required to capture the information within SEE data. If the essential content is very small then (1) the contained information must be very brief and (2) the value-added of complex learning schemes must be minimal.

Method: Our QUICK method computes the Euclidean distance between rows (instances) and columns (features) of SEE data; then prunes synonyms (similar features) and outliers (distant instances); then assesses the reduced data by comparing predictions from (1) a simple learner using the reduced data and (2) a state-of-the-art learner (CART) using all data. Performance is measured using hold-out experiments and expressed in terms of mean and median MRE, MAR, PRED(25), MBRE, MIBRE, or MMER.

Results: For 18 data sets, QUICK pruned 69% to 96% of the training data (median=89%). K=1 nearest neighbor predictions (in the reduced data) performed as well as CART’s predictions (using all data).

Conclusion: The essential content of some SEE data sets is very small. Complex estimation methods may be over-elaborate for such data sets and can be simplified. We offer QUICK as an example of such a simpler SEE method.

Wednesday, April 4, 2012

New Paper Accepted to Automated Software Engineering Journal

I am very glad to learn today that our paper is accepted to Automated Software Engineering Journal. This paper proposes a nice ordering scheme for various machine learning algorithms that are used in software effort estimation. Furthermore, it also suggests a good sanity check section, in which we propose an evaluation scheme that makes use of two statistical tests to cluster algorithms, whose performances are not statistically significantly different. I am very lucky to work with my supervisor Dr. Tim Menzies and our brilliant collaborator Dr. Jacky Keung in this project. Below is the abstract of this paper and here is the link to the latest version (but not yet the camera ready version).
Abstract
Background: Conclusion instability in software effort estimation (SEE) refers to the inconsistent results produced by a diversity of predictors using different datasets. This is largely due to the “ranking instability” problem, which is highly related to the evaluation criteria and the subset of the data being used.
Aim: To determine stable rankings of different predictors.
Method: 90 predictors are used with 20 datasets and evaluated using 7 performance measures, whose results are subject to Wilcoxon rank test (95%). These results are called the “aggregate results”. The aggregate results are challenged by a sanity check, which focuses on a single error measure (MRE) and uses a newly developed evaluation algorithm called CLUSTER. These results are called the “specific results.”

Results: Aggregate results show that: (1) It is now possible to draw stable conclusions about the relative performance of SEE predictors; (2) Regression trees or analogy-based methods are the best performers. The aggregate results are also confirmed by the specific results of the sanity check.

Conclusion: This study offers means to addresses the conclusion instability issue in SEE, which is an important finding for empirical software engineering.

Pages