This attitude helps to see what sort of data is available to be collected. Probably a pilot collection activity on a few projects can tell us if the desired features (e.g. analyst capability, programmer experience, lines of code or function points and so on) can be collected at all. After this phase, it is important to know that the more features we add in, the more instances we will need and we will face the possibility of having correlated features. Therefore, at this point it is important to step back and ask:
- Does each feature really convey important information related to the dependent variable? For example, for the SEE domain, does each feature tell us something about the effort/cost of a project?
- Are we able to predict these features for a future project? Thinking that lines of code is a good indicator of effort/cost, we should also ask whether or not we would be able to predict how many lines of code we will develop for an upcoming project. If not, then it does not make much sense to include such a feature in the first place.