I am currently preparing the camera ready version of our Promise 2012 paper and I thought it would be a good idea to write about it. Because it tackles an interesting and important issue: The need for size attributes like lines of code (LOC) in software effort models. There are many skeptics about the use of LOC (and I happen to be one of them). However, -particularly in software effort estimation domain- LOC sits at the very hearth of the most models. One can say that from a theoretical point of view, this does not create a big problem, because after all it is yet another independent feature. I have nothing to say against such an objection. But if you think about the practical implications of the problem of models requiring size attributes, the picture changes drastically. Practitioners may put very little trust in, say, LOC or they may be lacking resources to collect size information correctly. They may also not be able to predict their project size in the initial stages of the development. Such conditions may hinder the knowledge transfer from research to practice. The ultimate result of this scenario may even force software effort estimation to be mostly a theoretical field.
In our paper at Promise 2012 conference, we propose a method that can compensate for the lack of size attributes like LOC. If you would like to take a look at our paper, it is right here. If you are in a hurry, no problem, the abstract is right below this paragraph. As a final remark, probably I should say that I am very excited about the idea and the model of this paper, but a little more excited about the idea: Next generation models that do not require size attributes and even can compensate for the size attributes in retrospective data sets.
Background: Size features such as lines of code and function points are deemed essential. Are such size features a “must” for software effort estimation (SEE)?
Aim: To question the need for size features and to propose a method that compensates their absence.
Method: A baseline analogy-based estimation method (1NN) and a state-of-the-art learner (CART) are run on reduced (with no size features) and full (with all features) versions of 13 SEE data sets. 1NN is augmented with a popularity-based pre-processor to create “pop1NN”. The performance of pop1NN is compared to 1NN and CART using 10-way cross validation w.r.t MMRE, MdMRE, MAR, PRED(25), MBRE, MIBRE, and MMER.
Results: Without any pre-processor, removal of size features decreases the performance of 1NN and CART. For 11 out of 13 data sets, pop1NN removes the necessity of size features. pop1NN (using reduced data) has a comparable performance to CART (using full data).
Conclusion: Size features are important and their use is endorsed, provided that there are enough resources/means to collect such features correctly. If not, use of methods like pop1NN may compensate for size features and remove their necessity.