Wednesday, June 10, 2015

It has been some time... Or what I have been up to...

Since my last post, I think I have given up on posting blog entries for a while. It was a pretty busy time frame though: Moving to Seattle for Microsoft, finishing up a couple of final papers and writing for 2 book projects. Plus I think I did more paper reviews than I had done in my entire grad school time, pretty time consuming stuff those reviews... And of course these are somewhat side projects, for the past 2 years I have been working as a software engineer under Bing Ads providing metrics and data insight for core Bing Ads teams. It has been an awesome experience on its own right so far, but it deserves another post to share some ideas about academia and industry.

Outside of work, most of my time was dedicated to book projects that I was lucky enough to be involved in. The first book that I had the chance to co-author, with amazing collaborators Tim Menzies, Leandro Minku, Fayola Peters and Burak Turhan, is called "Sharing Data and Models in Software Engineering" (here is the Amazon link: http://bit.ly/shrngDtMdls). I would like to think that different parts of this book speak to different levels of comfort with various data mining and machine learning techniques. The book is mainly divided into 4 parts. The first two parts start with high level discussions of how to approach a data science project, what factors to consider in terms of user engagement and data collection, then it continues with a brief tutorial on common concepts such as instance and feature selection, algorithms for prediction and clustering and so on. The other two parts deal with more advanced material such as transfer learning, active learning and data privacy. The whole book is a nice overhaul of the practical research we have been doing in the past couple of years. Particularly for data scientists and researchers with a focus on software engineering data, the book can provide a nice revisit of fundamental concepts as well as some more advanced ideas to try out on their data sets.

Machine Learning and Data Science Conference
at Microsoft, explaining the steps, ideas and pitfalls
 that led to our chapter  in
"The Art and Science of Analyzing Software Data"
Another project that I was lucky to take part is somewhat a sequel to the first book, which is called "The Art and Science of Analyzing Software Data: Analysis Patterns" (the Amazon link: http://bit.ly/analysisPatterns). This book is the result of a joint effort from various prominent research groups in the junction of data science and software engineering and it was edited by Christian Bird, Tim Menzies and Thomas Zimmermann. Besides awesome folks involved in this project, what makes this book really cool is that it is a distillation of various research groups' approaches to data science. In other words, it was a reflection time for the researchers to step back and ask: "Hey, what were we doing all along in all these projects? What are the patterns that we instinctively follow, can we distill this information into a formal structure and share?" My contribution to this book is being a co-author of a chapter with the amazing co-authors: Ayse Basar Bener, Ayse Tosun Misirli, Bora Caglayan and Gul Calikli. We have worked together in multiple data science projects in a number of different companies and successfully delivered tangible business outcomes for these companies. In our chapter entitled "Lessons Learned from Software Analytics in Practice", we share the patterns that worked for us in multiple data science projects. Of course there were a ton of pitfalls that we learned along the way, which are probably as important as the patterns to be aware of. In this chapter we also share these pitfalls and our recommendations as to how we have avoided them in the past.

In case you are embarking on a career in data science or you want to see how other research groups in software engineering have approached data science projects, you may also want to take a look at these books.



Thursday, July 18, 2013

Looking for something on transfer learning (a.k.a. cross-company learning) ?

For the past few days I have been trying to design a new set of experiments aiming to help prediction when there is no or very little local data. I know it sounds kind of weird to talk about prediction, when we have no local data. Yet, there is a tremendous amount of research invested in this problem. If you want to read more about the topic, you may look for the term "transfer learning" in the machine learning community. The basic idea of transfer learning is being able to learn when the source and the target domains are different or when the source and the target tasks are different. 

In software engineering (SE) domain, the transfer learning experiments have often been referred to as cross-company learning. So, in case you want to see similar studies from SE, another term to search for would be "cross-company". The problem of transfer learning in SE could be defined as having the same source and target task (i.e. prediction), yet having different source and target domains. To put things into context, let's assume that your organization has either no training data at all or has some training data, but no label information; then, your organization would be the target domain. Also, let's assume that there is another organization with training data that also has labels, then this organization would be the source domain. In such a scenario, it would be nice and also cost effective, if we could make use of the information of the source domain. Because, both collecting our own local training data from scratch or finding the label information (a.k.a. dependent variable information) for existing local data may be very time consuming and costly.

Until now, the focus of the transfer learning studies in SE have been using labeled source domain data and being able to predict the target domain data. However, merely giving predicted numbers to a local expert does not bring much local information that can be interpreted. In other words, merely predicting a local test instance does not provide any information to be interpreted.

So, we started thinking how we could bring more local domain information into transfer learning. The initial study in this direction was some sort of a position paper, which was published at the RAISE workshop in ICSE 2013. This position paper essentially provides details regarding how we could provide synergies between different learning tasks so as to improve transfer learning. Then we followed this position paper with an experimentation on software effort estimation, where we provided some evidence that it is possible to introduce interpretable local information into transfer learning studies. The paper containing the follow up experiments is recently accepted to PROMISE 2013 conference. 

If you have read this far of this post, then I can safely assume that you would be interested in seeing the abstracts as well as having the links of the two papers that I mentioned in the previous paragraphs. So here you go:

The position paper:

ABSTRACT: Thanks to the ever increasing importance of project data, its collection has been one of the primary focuses of software organizations. Data collection activities have resulted in the availability of massive amounts of data through software data repositories. This is great news for the predictive modeling research in software engineering. However, widely used supervised methods for predictive modeling require labeled data that is relevant to the local context of a project. This requirement cannot be met by many of the available data sets, introducing new challenges for software engineering research. How to transfer data between different contexts? How to handle insufficient number of labeled instances? In this position paper, we investigate synergies between different learning methods (transfer, semi-supervised and active learning) which may overcome these challenges. 
Link: http://bit.ly/posPaper 

The follow up:
ABSTRACT: Background: Developing and maintaining a software effort estimation (SEE) data set within a company (within data) is costly. Often times parts of data may be missing or too difficult to collect, e.g. effort values. However, information about the past projects -although incomplete- may be helpful, when incorporated with the SEE data sets from other companies (cross data).
Aim: Utilizing cross data to aid within company estimates and local experts; Proposing a synergy between semi-supervised, active and cross company learning for software effort estimation. 
Method: The proposed method: 1) Summarizes existing unlabeled within data; 2) Uses cross data to provide pseudo-labels for the summarized within data; 3) Uses steps 1 and 2 to provide an estimate for the within test data as an input for the local company experts. We use 21 data sets and compare the proposed method to existing state-of-the-art within and cross company effort estimation methods subject to evaluation by 7 different error measures. 
Results: In 132 out of 147 settings (21 data sets X 7 error measures = 147 settings), the proposed method performs as well as the state-of-the-art methods. Also, the proposed method summarizes the past within data down to at most 15% of the original data. 
Conclusion: It is important to look for synergies amongst cross company and within-company effort estimation data, even when the latter is imperfect or sparse. In this research, we provide the experts with a method that: 1) is competent (performs as well as prior within and cross data estimation methods); 2) reflects on local data (estimates come from the within data); 3) is succinct (summarizes within data down to 15% or less); 4) cheap (easy to build). 





Monday, July 15, 2013

Distributed Development Considered Harmful?

Distributed development is a good use of contemporary collaboration technology and the universal talent pool of software engineers. However, it  also comes with various possible challenges.

Imagine reducing costs dramatically by having multiple teams that are thousands of miles away from each other, living in completely different time zones. Saving money is good, but also imagine what such a distributed development scenario may lead to: Difficulty in inter-team communication, lack of coordination due to communication difficulties and ultimately higher post-release bugs. Therefore, it is important to take a look at the effects of distributed development. In other words, do we benefit from distributed development or does it come with unwanted side effects?

During my 2012 summer internship at Microsoft Research (MSR) at Redmond, I researched on distributed development and I was lucky to work with the incredible researchers of the Empirical Software Engineering (ESE) Group. It was a great learning experience and I am particularly thankful to Thomas Zimmermann, Christian Bird, Nachiappan Nagappan and Tim Menzies for their mentorship and support. A summary of the research that we have conducted regarding distributed development is recently published in the "Software Engineering in Practice" track of "International Conference on Software Engineering". Below is the abstract and the link of this paper. I hope you enjoy it.


ABSTRACT: 
We offer a case study illustrating three rules for reporting research to industrial practitioners. Firstly, report “relevant” results; e.g. this paper explores the effects of distributed development on software products.
Second: “recheck” old results if new results call them into question. Many papers say distributed development can be harmful to software quality. Previous work by Bird et al. allayed that concern but a recent paper by Posnett et al. suggests that the Bird result was biased by the kinds of files it explored. Hence, this paper rechecks that result and finds significant differences in Microsoft products (Office 2010) between software built by distributed or collocated teams. At first glance, this recheck calls into question the widespread practice of distributed development.
Our third rule is to “reflect” on results to avoid confusing practitioners with an arcane mathematical analysis. For example, on reflection, we found that the effect size of the differences seen in the collocated and distributed software was so small that it need not concern industrial practitioners.
Our conclusion is that at least for Microsoft products, distributed development is not considered harmful.
The pdf of the article can be found here: http://dl.acm.org/citation.cfm?id=2486908



Tuesday, May 7, 2013

In the market for a non-academic job after a CS (or possibly an EE) PhD?

Anyone reading this title may immediately say: "Why on earth are you looking for a non-academic job, if you enrolled to a PhD program?" That is an entirely valid question and I have a couple of arguments as a reply, one of which starts like this: "Some of the current problems in the industry often requires computer scientists with a PhD level knowledge." Another one starts with: "In the industry, product groups are likely to face problems that can benefit from the research background of a PhD graduate." One may also ask: "Is an advanced graduate degree a must to tackle difficult problems?" My answer is simply no, it really is "not" a must, but it doesn't hurt.

If you made it past the previous paragraph, you probably expect this paragraph to also talk about the pro's and con's of a graduate degree, its relation to industry etc. I am sorry to disappoint you, but this paragraph will talk about something else. Although I believe that getting your PhD will help you in the industry, when you start applying to industrial positions, you will quickly realize (as I did) that the application as well as the evaluation process is much different than the academic processes. The difference is completely fine, yet after finishing my grad school, I needed some time to brush up on my undergraduate CS fundamentals. Also, I experienced that solving as many coding questions as possible is really really helpful. There are a ton of online sources, so I am not sure how helpful this post will be. However, for what its worth, I wanted to share some of the sources that I found particularly helpful.

It is probably a good idea to have an easy start, so the following link with easy to medium questions was a good start point for me: http://goo.gl/34lpH

Once you are through these easy to medium questions, you may want to take a look at more challenging examples that repeatedly appear as interview questions. A good friend of mine, Arden Dertat, has a really cool website that I benefited a lot. Here is the link: http://goo.gl/s7NpG

Also, if you want even more examples and challenges, here is a link to a set of 50 questions prepared by Xiu Zichuan: http://goo.gl/yrKVd These questions come with answers, the code of the answers, suggested time to solve the questions and even the difficulty level of each question!

In case you want something more organized and something that talks about the entire industry job search process, then I can recommend the following books: "Programming Interviews Exposed"
and "Cracking the Coding Interview". Here are the Amazon links: http://goo.gl/oYAGI and http://goo.gl/fF3SJ

I am not sure how helpful it is to solve each and every one of these questions, so I only solved the ones that I thought were possible candidates during the interviews. So, in case you want to get a feel of the latest questions that appeared in the interviews, www.glassdoor.com and www.careercup.com are quite good and up-to-date sources.

I hope this post can help a little bit and good luck with the job search!


Thursday, January 24, 2013

How big is your data?

It has been a while since my last post and today seems like a good day to write something new. In this post I want to talk about a concept called "popularity" and how it can aid us reduce the amount of data we are looking at. I know the name "popularity" may sound a bit non-sciency, but bear with me, you will see that it is a pretty interesting idea. In fact, it is the idea that we used to develop an active learning solution called QUICK, which is accepted to IEEE Transactions on Software Engineering (link). 

Popularity
I think one of the best ways to describe the popularity concept is to start with a toy example. Assume that we have a data set of N instances (rows) and D features (columns). Let's say that we found out the k-many nearest neighbors of every instance. For the sake of this example let's fix k=1, in other words we find the 1st nearest neighbor of every instance. Every time we pick an instance as another instance's 1st nearest neighbor, we mark the picked-up instance. At the end we count how many marks each instance received and that is our so called "popularity" or "popularity count". For example, if instance A was marked as the nearest neighbor of 5 other instances, then its popularity is 5. 

Of course reading the above non-formal discussion may result in various questions such as: How to calculate the distance between instances to find nearest neighbors, how to use the popularity concept, what is its relation to active learning and so on. Next paragraphs are the brief answers to these questions.

How to of QUICK
To begin with, the concept of popularity and the QUICK algorithm (developed on top of the popularity concept) are really simple. Popularity uses simple Euclidean distance function to calculate the distance of the instances and it marks k=1-many nearest neighbors. The more important question is how to use the popularity counts of instances (i.e. how many marks each instance received as the 1st nearest neighbor). We opted to use the popularity concept for 2 purposes: Selecting features and selecting instances, which are also the two steps of the QUICK algorithm.

Selecting Features: Remember that we defined our data set at the beginning of this post in such a way that the instances correspond to rows and the features correspond to columns. In the feature selection phase, we transpose the data set. This gives us a data set where the features will be treated like instances and the instances will be treated like the features of the new space. Once we calculate the popularity counts, we only keep the non-popular instances (popularity counts of zero), which are the non-popular features of the original data set. The idea behind selecting non-popular instances of the transposed data set is to select the features of the original data set that are very different to all the other features. In other words in this phase, we want to select the features that are non-popular, unlike other features; hence, that are expected to have different information.

Selecting Instances: This phase is where the active learning comes in. To put in very very simple terms, the idea behind active learning is the following: The dependent variable information (a.k.a. label) is unknown and an oracle can be queried for that information. However, each query comes with a cost, so we want to ask as few labels as possible. 

In the instance selection phase we use only the features that were selected in the previous phase and calculate the popularity counts of the instances. Then instances are sorted from the most popular to the least popular. We start asking the labels for the instances starting from the most popular instance down to the least popular instance. As soon as there is no performance gain in asking for more labels, we stop asking.

Estimation of QUICK for a test instance is basically the label of the test instance's nearest neighbor among the labeled popular instances.

Underlying Dimensionality
At the end of the feature and instance selection phases, we end up with a small subset of the original features and instances. In a data set of N instances and F features, the selected instances and the features can be represented with N' and F', respectively. N' and F' define the underlying dimensionality of our data set in the context of QUICK algorithm. An interesting thing to check is the ratio of (N'*F')*100/(N*F), which basically tells us what percent of the cells in the original data set is identified to be the essential content. For the effort data sets we experimented with the essential content can be as low as the 4% of the original data set. On the median of 18 data sets we experimented with, the essential content was 11% of the original data set.

The next question to ask is the performance of the QUICK method. By using only a small percent of the original data, can we still do as good as standard learners like k-NN and CART? Our experiments showed that "we can" and that is the beauty of it. We can find a really small subset of the data that contains the essential content and still perform as good as standard learners that use all the data. To summarize, our data sets may be not as big as we think they are!

If you have read until this point, I believe I have your attention and I may convince you to take a look at our paper for much more details. In case I convinced you, below is the abstract and here is the link of the paper.


Active Learning and Effort Estimation: Finding the Essential Content
of Software Effort Estimation Data 
Ekrem Kocaguneli, Student Member, IEEE, Tim Menzies, Member, IEEE, Jacky Keung, Member, IEEE, David Cok, Member, IEEE, Ray Madachy, Member, IEEE
Background: Do we always need complex methods for software effort estimation (SEE)? 
Aim: To characterize the essential content of SEE data; i.e. the least number of features and instances required to capture the information within SEE data. If the essential content is very small then (1) the contained information must be very brief and (2) the value-added of complex learning schemes must be minimal. 
Method: Our QUICK method computes the Euclidean distance between rows (instances) and columns (features) of SEE data; then prunes synonyms (similar features) and outliers (distant instances); then assesses the reduced data by comparing predictions from (1) a simple learner using the reduced data and (2) a state-of-the-art learner (CART) using all data. Performance is measured using hold-out experiments and expressed in terms of mean and median MRE, MAR, PRED(25), MBRE, MIBRE, or MMER. 
Results: For 18 data sets, QUICK pruned 69% to 96% of the training data (median=89%). K=1 nearest neighbor predictions (in the reduced data) performed as well as CART’s predictions (using all data). 
Conclusion: The essential content of some SEE data sets is very small. Complex estimation methods may be over-elaborate for such data sets and can be simplified. We offer QUICK as an example of such a simpler SEE method.

Tuesday, November 20, 2012

About PhD Defense

Two weeks ago, I successfully defended my PhD thesis: "A Principled Methodology: A Dozen Principles of Software Effort Estimation." I have been willing to write this post right after my PhD defense, but maybe it is better that some time has passed and I had more time to think about it.

I will not  necessarily write about the details of my PhD thesis in this post (but in case you want to read the details, here is my thesis in pdf format), I rather want to discuss some of the defense presentation tips, which were particularly helpful for me. In case you are nearing your PhD defense or in case you are just starting your PhD program, I think these tips may also be helpful to you. And if they don't... Well it is not too bad either, you just lose a couple of minutes.

Trade-off
The first thing that is strange about the PhD defense presentation is that you are expected to present years of work amounting to a number of papers, in less than an hour of time. That is really an issue of trade-off. If you go deep into every chapter, every research idea and every algorithm, the time will most likely not be enough. On the other hand, only talking about the high level ideas and not providing details may make your research look as if it lacks technical depth. Of course there may be different blends of high-level discussions and low-level technical details that works best for different people, but the one that works best for me is to keep the discussion high-level for most of the chapters and go into complex technical details in 1 or 2 of points. That was in fact a recommendation from my advisor and it works like a charm. However, I recommend to be cautious about the choice of where you will go into these details. For example, a bad choice would be an algorithm or a derivation, which is pretty complex and fancy to discuss, but which you are not really a comfortable with. The point of the detail slides is to show that you are knowledgeable about the technical depths of your thesis. A good choice is an algorithm that you have already presented before, possibly at a conference, where you have also received questions about this technical topic.

Steering the Discussion
After the presentation, there will obviously be a Q&A session, which was in fact longer than the presentation itself for me. It is difficult to explicitly steer that discussion, since the type of questions depends on your committee members. However, some implicit details may help.

To begin with, -although you may not want to admit this to yourself- it is highly likely that none of your committee members will read the entirety of your thesis in detail. So, the material you put into the slides and the way you present it will make up a giant chunk of the evaluation of your work. It is crucially important to make sure that your slides correctly present your experiments and evaluations. One type of questions is about the technical and experimental side of your research, which are dangerous questions really. Because, in case your committee starts to think that your experiments are flawed or that there is a big need for extra experimentation, you may be looking at an extension or extra experiments.

Another type of questions will be more philosophical type of questions about what the impact of your research is, how it contributes to your domain and how future work can be built on your research. These are somewhat safer questions and it is better to steer the discussion into these questions, because after all it is your dissertation, so you are likely to be the most knowledgable person in the room about the philosophical aspects of your research. I believe the best way to get more of these questions is to provide a good motivation of your research in the presentation, a good summary at the end (possibly revisiting the motivation) and a visionary future work slide(s).

Temporal Memory
I am not really sure of the temporal memory discussion, so I will keep it brief. However, my experience from my defense presentation was that the slides towards the end of the presentation are remembered somewhat more and I have received a lot questions about the topics in the last slides of my presentation. That was also the observation of my advisor. It may not really be universal, but I wanted to mention it. A good place to test whether this is the case with your committee may be to observe it during your proposal presentation.

Blood Sugar!!
I know this sounds funny, but I it has a truth to it. People are usually crankier and are more easily agitated when they have low blood sugar, and the last thing you want is a cranky committee. It is customary to bring refreshments to your PhD presentation anyway. I recommend a lot of carbohydrates, possibly cookies (or some other desert-like thing) and/or pizza etc.

Well, the stuff that I described are my observations and experiences, which means they may not be applicable to someone else. But for what it's worth, I am also sharing my presentation below.. I hope it helps and good luck to all the fellow grad students!!




Monday, October 1, 2012

Ground Zero: When do I have perfect data?

Probably anyone who has dealt with real-world data has a grin on his face after reading this title. Most likely, because of the fact that they know the answer is... well, never.

Currently I am writing my PhD thesis and I am forming the chapters in the form of principles. The title is most likely going to be the first chapter. And it is going to be the chapter with the least amount of numbers, where I will be sharing my ideas regarding "how to get the perfect data".

Unfortunately there is no formula or silver-bullet answer to perfect data. This is mainly due to the fact that reaching the right data is an interplay of various different factors such as:
  • domain knowledge,
  • the people generating, using and collecting the data,
  • the particular characteristics of the data

Hence, I believe that the principles (of the first chapter) to focus are along these lines:

Principle #1, Know your domain: Each domain or product has a way of reflecting on the data. Often times, what seems to be a recurring anomaly has a reasonable explanation. However, this knowledge is usually hidden to an outsider data analyst. For example, if you are analyzing the commit information of a particular company, it is best to spend some time understanding the generation and progression of their code in their particular branching structure.

Principle #2, Let the Experts Talk: The very initial results have a higher chance of being blown by the domain experts, this is only natural. Because, the very initial results are based on the data with the most limited domain knowledge of the analyst and with limited expert input. The success of the very initial results are better to be measured by the amount of discussion they stimulate among the experts or attendees of that meeting. If it creates a discussion and you are bombarded with questions and suggestions regarding how to improve your analysis, then the analysis is on a good track and this is the time to listen to experts regarding what false assumptions were made during data collection, what other considerations should be included and so on. Furthermore, it may be a good time to know the right people to ask domain specific questions.

Principle #3, Suspect your data: Anything too good or too surprising to be true has a high chance of being not true indeed. Once an analyst accumulates enough domain knowledge and enough feedback from the domain experts, it is time for the more mechanical part of the data analysis phase. In this phase, one can inspect the data manually by looking at e.g. the min-max values of each feature or by plotting the values of each individual feature value through different methods, e.g. a box-plot. This sometimes works like a charm to identify any errors in the data collection. It also can hint which instances are likely to be outliers.

Principle #4, Data collection is spiral: Note the principles associated with data collection are not necessarily sequential. In other words, they are intermingled, which requires one to get more domain feedback or expert input in the case of data anomalies, then updating the collected data as well as the analysis, then getting expert feedback again and so on.