Monday, January 9, 2012

How many features do we need and how many can we use?

Obviously we need to define some features when forming a data set. In the initial phases of the data set formation, the tendency is to include as many features as possible: i.e. the attitude of "let's make sure we cover everything".

This attitude helps to see what sort of data is available to be collected. Probably a pilot collection activity on a few projects can tell us if the desired features (e.g. analyst capability, programmer experience, lines of code or function points and so on) can be collected at all. After this phase, it is important to know that the more features we add in, the more instances we will need and we will face the possibility of having correlated features. Therefore, at this point it is important to step back and ask:
  • Does each feature really convey important information related to the dependent variable? For example, for the SEE domain, does each feature tell us something about the effort/cost of a project?
  • Are we able to predict these features for a future project? Thinking that lines of code is a good indicator of effort/cost, we should also ask whether or not we would be able to predict how many lines of code we will develop for an upcoming project. If not, then it does not make much sense to include such a feature in the first place.
Currently we are using an active-learning solution to reduce the number of features (and also the number of instances) in standard SEE data sets. The main idea in this active-learning solution is to identify popular instances and popular features based on a nearest-neighbor algorithm and get rid off non-popular instance and popular features.Without going into too much technical details, I can summarize the results by saying that we are able to get rid off around 50% of the instances and features, without sacrificing from the estimation accuracy. The meaning of such a reduction is that an important portion of the features as well as instances are not necessary for a good estimation performance.


No comments:

Post a Comment