Thursday, July 18, 2013

Looking for something on transfer learning (a.k.a. cross-company learning) ?

For the past few days I have been trying to design a new set of experiments aiming to help prediction when there is no or very little local data. I know it sounds kind of weird to talk about prediction, when we have no local data. Yet, there is a tremendous amount of research invested in this problem. If you want to read more about the topic, you may look for the term "transfer learning" in the machine learning community. The basic idea of transfer learning is being able to learn when the source and the target domains are different or when the source and the target tasks are different. 

In software engineering (SE) domain, the transfer learning experiments have often been referred to as cross-company learning. So, in case you want to see similar studies from SE, another term to search for would be "cross-company". The problem of transfer learning in SE could be defined as having the same source and target task (i.e. prediction), yet having different source and target domains. To put things into context, let's assume that your organization has either no training data at all or has some training data, but no label information; then, your organization would be the target domain. Also, let's assume that there is another organization with training data that also has labels, then this organization would be the source domain. In such a scenario, it would be nice and also cost effective, if we could make use of the information of the source domain. Because, both collecting our own local training data from scratch or finding the label information (a.k.a. dependent variable information) for existing local data may be very time consuming and costly.

Until now, the focus of the transfer learning studies in SE have been using labeled source domain data and being able to predict the target domain data. However, merely giving predicted numbers to a local expert does not bring much local information that can be interpreted. In other words, merely predicting a local test instance does not provide any information to be interpreted.

So, we started thinking how we could bring more local domain information into transfer learning. The initial study in this direction was some sort of a position paper, which was published at the RAISE workshop in ICSE 2013. This position paper essentially provides details regarding how we could provide synergies between different learning tasks so as to improve transfer learning. Then we followed this position paper with an experimentation on software effort estimation, where we provided some evidence that it is possible to introduce interpretable local information into transfer learning studies. The paper containing the follow up experiments is recently accepted to PROMISE 2013 conference. 

If you have read this far of this post, then I can safely assume that you would be interested in seeing the abstracts as well as having the links of the two papers that I mentioned in the previous paragraphs. So here you go:

The position paper:

ABSTRACT: Thanks to the ever increasing importance of project data, its collection has been one of the primary focuses of software organizations. Data collection activities have resulted in the availability of massive amounts of data through software data repositories. This is great news for the predictive modeling research in software engineering. However, widely used supervised methods for predictive modeling require labeled data that is relevant to the local context of a project. This requirement cannot be met by many of the available data sets, introducing new challenges for software engineering research. How to transfer data between different contexts? How to handle insufficient number of labeled instances? In this position paper, we investigate synergies between different learning methods (transfer, semi-supervised and active learning) which may overcome these challenges. 

The follow up:
ABSTRACT: Background: Developing and maintaining a software effort estimation (SEE) data set within a company (within data) is costly. Often times parts of data may be missing or too difficult to collect, e.g. effort values. However, information about the past projects -although incomplete- may be helpful, when incorporated with the SEE data sets from other companies (cross data).
Aim: Utilizing cross data to aid within company estimates and local experts; Proposing a synergy between semi-supervised, active and cross company learning for software effort estimation. 
Method: The proposed method: 1) Summarizes existing unlabeled within data; 2) Uses cross data to provide pseudo-labels for the summarized within data; 3) Uses steps 1 and 2 to provide an estimate for the within test data as an input for the local company experts. We use 21 data sets and compare the proposed method to existing state-of-the-art within and cross company effort estimation methods subject to evaluation by 7 different error measures. 
Results: In 132 out of 147 settings (21 data sets X 7 error measures = 147 settings), the proposed method performs as well as the state-of-the-art methods. Also, the proposed method summarizes the past within data down to at most 15% of the original data. 
Conclusion: It is important to look for synergies amongst cross company and within-company effort estimation data, even when the latter is imperfect or sparse. In this research, we provide the experts with a method that: 1) is competent (performs as well as prior within and cross data estimation methods); 2) reflects on local data (estimates come from the within data); 3) is succinct (summarizes within data down to 15% or less); 4) cheap (easy to build).