Currently I am writing my PhD thesis and I am forming the chapters in the form of principles. The title is most likely going to be the first chapter. And it is going to be the chapter with the least amount of numbers, where I will be sharing my ideas regarding "how to get the perfect data".
Unfortunately there is no formula or silver-bullet answer to perfect data. This is mainly due to the fact that reaching the right data is an interplay of various different factors such as:
- domain knowledge,
- the people generating, using and collecting the data,
- the particular characteristics of the data
Hence, I believe that the principles (of the first chapter) to focus are along these lines:
Principle #1, Know your domain: Each domain or product has a
way of reflecting on the data. Often times, what seems to be a recurring
anomaly has a reasonable explanation. However, this knowledge is usually hidden
to an outsider data analyst. For example, if you are analyzing the commit
information of a particular company, it is best to spend some time understanding
the generation and progression of their code in their particular branching
structure.
Principle #2, Let the Experts Talk: The very initial results
have a higher chance of being blown by the domain experts, this is only
natural. Because, the very initial results are based on the data with the most limited
domain knowledge of the analyst and with limited expert input. The success of the very initial
results are better to be measured by the amount of discussion they stimulate
among the experts or attendees of that meeting. If it creates a discussion and
you are bombarded with questions and suggestions regarding how to improve your
analysis, then the analysis is on a good track and this is the time to listen
to experts regarding what false assumptions were made during data collection, what
other considerations should be included and so on. Furthermore, it may be a
good time to know the right people to ask domain specific questions.
Principle #3, Suspect your data: Anything too good or too
surprising to be true has a high chance of being not true indeed. Once an analyst accumulates enough domain knowledge and enough feedback from the domain
experts, it is time for the more mechanical part of the data analysis phase. In
this phase, one can inspect the data manually by looking at e.g. the min-max values
of each feature or by plotting the values of each individual feature value
through different methods, e.g. a box-plot. This sometimes works like a charm to identify any
errors in the data collection. It also can hint which instances are likely to
be outliers.
Principle #4, Data collection is spiral: Note the principles
associated with data collection are not necessarily sequential. In other words,
they are intermingled, which requires one to get more domain feedback or expert
input in the case of data anomalies, then updating the collected data as well
as the analysis, then getting expert feedback again and so on.