Monday, October 1, 2012

Ground Zero: When do I have perfect data?

Probably anyone who has dealt with real-world data has a grin on his face after reading this title. Most likely, because of the fact that they know the answer is... well, never.

Currently I am writing my PhD thesis and I am forming the chapters in the form of principles. The title is most likely going to be the first chapter. And it is going to be the chapter with the least amount of numbers, where I will be sharing my ideas regarding "how to get the perfect data".

Unfortunately there is no formula or silver-bullet answer to perfect data. This is mainly due to the fact that reaching the right data is an interplay of various different factors such as:
  • domain knowledge,
  • the people generating, using and collecting the data,
  • the particular characteristics of the data

Hence, I believe that the principles (of the first chapter) to focus are along these lines:

Principle #1, Know your domain: Each domain or product has a way of reflecting on the data. Often times, what seems to be a recurring anomaly has a reasonable explanation. However, this knowledge is usually hidden to an outsider data analyst. For example, if you are analyzing the commit information of a particular company, it is best to spend some time understanding the generation and progression of their code in their particular branching structure.

Principle #2, Let the Experts Talk: The very initial results have a higher chance of being blown by the domain experts, this is only natural. Because, the very initial results are based on the data with the most limited domain knowledge of the analyst and with limited expert input. The success of the very initial results are better to be measured by the amount of discussion they stimulate among the experts or attendees of that meeting. If it creates a discussion and you are bombarded with questions and suggestions regarding how to improve your analysis, then the analysis is on a good track and this is the time to listen to experts regarding what false assumptions were made during data collection, what other considerations should be included and so on. Furthermore, it may be a good time to know the right people to ask domain specific questions.

Principle #3, Suspect your data: Anything too good or too surprising to be true has a high chance of being not true indeed. Once an analyst accumulates enough domain knowledge and enough feedback from the domain experts, it is time for the more mechanical part of the data analysis phase. In this phase, one can inspect the data manually by looking at e.g. the min-max values of each feature or by plotting the values of each individual feature value through different methods, e.g. a box-plot. This sometimes works like a charm to identify any errors in the data collection. It also can hint which instances are likely to be outliers.

Principle #4, Data collection is spiral: Note the principles associated with data collection are not necessarily sequential. In other words, they are intermingled, which requires one to get more domain feedback or expert input in the case of data anomalies, then updating the collected data as well as the analysis, then getting expert feedback again and so on.