Monday, March 12, 2012

What is the least you can do?

Last week we were having a discussion with my supervisor regarding what is the essential summary of all the things I had done until this point in my PhD. This is a difficult question and I will put together my answer(s) to that question in my thesis. However, this question also made me think about something else: What would be the essential recommendation that I would have for a manager? In other words, what is the least a manager could do if s/he wanted to do use some basic machine learning tricks for software effort estimation?

Compared to the first question, the second question seems easier to me. The following sentences will be my personal recipe to a project manager willing to use a bit of machine learning in her/his estimation process. Before the recipe comes a caution: The least-you-can-do type of a recommendation will most probably not be the best performing method you can imagine. Also, there are tons of other machine learning methods you can add into this recipe, but my concern is to keep it simple and understandable by someone with no machine learning background. So here it comes:

  1. Get a piece of paper and write down a couple of properties that are important effort/cost factors for your projects: e.g. number of data bases planned to be created, number of classes planned to be written, the seniority of the people involved (one entry for each group such as programmers, analysts etc.). Those are just examples and you can increase such properties.
  2. Go through the list of the previous step and eliminate the ones that are difficult to measure or that are not really relevant to the projects. Ultimately, I would say the number of final features should not exceed 10. Regardless of which features you choose, your last feature should be the amount of time (effort) that project took. To make it universal for all your projects consider how many people worked for that project too, i.e. if the project took 5 months with 10 programmers fully dedicated to that project, then the value you are looking for is 5x10 = 50 man-months.
  3. Now we start the more technical part. Open up an excel table and write your final features as column headings. Then start filling up the values of these features for your past projects. Note that some of the features are easier to collect than the others. You probably know the exact number of classes/functions written for your past project, but putting a number on the capability of the analysts who worked for that project is more difficult. I suggest using a 1-to-5 scale for such cases, where 1 means terrible and 5 means perfect. How many past projects you should have in your excel file is an open question, but the right answer is: As many as possible.
  4. Once you have a decent table with features represented by columns and past projects represented by rows, you can make an estimate for your upcoming project (yay!!).. Here is how it works.. Add another row to your table for your new project and fill all the features except the last one, the effort value. That is what we are trying to estimate anyway.
  5. Once you have filled up the features for your new project, you will need to use a bit of your math knowledge from high school. Firstly normalize every column (except the last one, the effort column) to a 0-1 interval. Normalization is pretty straightforward: Take the max and min values of a column. Subtract the min value from every value in that column, then divide the results by the (max minus min). If everything went smoothly, your column should now only contain values between 0 and 1. If so, do this for all the columns. Remember to take a back-up, you do not want to lose the original values.
  6. This step is more tricky! We will need to calculate the Euclidean distance between the last row (your new project) and all the past projects. For the Euclidean distance calculation, we will only use the normalized features. For the sake of demonstration, let's say we have only 2 features: f1 and f2. The Euclidean distance between two projects (p1 and p2) is: square-root((p1.f1 - p2.f1)^2 + (p1.f2 - p2.f2)^2). 
  7. Once you ace the distance calculation, you know which past projects are closest to your next project. Take a look at the effort value of the first closest project, this should give you an idea. If the closest project's effort value is far off the ballpark, then also look at the second (or maybe even the third and the fourth) closest projects and take the mean of their effort values. The values you get are the recommended practices of many software effort estimation papers. 
  8. Another thing to look is the difference between the features. It is not the case that the closest past project is exactly the same as your next project. See the differences and try to interpret how these differences would affect the effort/cost.
That is all! There you have yourself a k-nearest-neighbor (a.k.a. k-NN) algorithm and a nice data set that you can analyze. I know it sounds like it is a lot of hustle, but once you do it you will realize that it is really not a big deal. If you want to further dig into the details of the above method, the buzzwords you should be searching for in GoogleScholar are: "analogy-based estimation", "estimation by analogy", "software effort estimation".