Thursday, January 24, 2013

How big is your data?

It has been a while since my last post and today seems like a good day to write something new. In this post I want to talk about a concept called "popularity" and how it can aid us reduce the amount of data we are looking at. I know the name "popularity" may sound a bit non-sciency, but bear with me, you will see that it is a pretty interesting idea. In fact, it is the idea that we used to develop an active learning solution called QUICK, which is accepted to IEEE Transactions on Software Engineering (link). 

I think one of the best ways to describe the popularity concept is to start with a toy example. Assume that we have a data set of N instances (rows) and D features (columns). Let's say that we found out the k-many nearest neighbors of every instance. For the sake of this example let's fix k=1, in other words we find the 1st nearest neighbor of every instance. Every time we pick an instance as another instance's 1st nearest neighbor, we mark the picked-up instance. At the end we count how many marks each instance received and that is our so called "popularity" or "popularity count". For example, if instance A was marked as the nearest neighbor of 5 other instances, then its popularity is 5. 

Of course reading the above non-formal discussion may result in various questions such as: How to calculate the distance between instances to find nearest neighbors, how to use the popularity concept, what is its relation to active learning and so on. Next paragraphs are the brief answers to these questions.

How to of QUICK
To begin with, the concept of popularity and the QUICK algorithm (developed on top of the popularity concept) are really simple. Popularity uses simple Euclidean distance function to calculate the distance of the instances and it marks k=1-many nearest neighbors. The more important question is how to use the popularity counts of instances (i.e. how many marks each instance received as the 1st nearest neighbor). We opted to use the popularity concept for 2 purposes: Selecting features and selecting instances, which are also the two steps of the QUICK algorithm.

Selecting Features: Remember that we defined our data set at the beginning of this post in such a way that the instances correspond to rows and the features correspond to columns. In the feature selection phase, we transpose the data set. This gives us a data set where the features will be treated like instances and the instances will be treated like the features of the new space. Once we calculate the popularity counts, we only keep the non-popular instances (popularity counts of zero), which are the non-popular features of the original data set. The idea behind selecting non-popular instances of the transposed data set is to select the features of the original data set that are very different to all the other features. In other words in this phase, we want to select the features that are non-popular, unlike other features; hence, that are expected to have different information.

Selecting Instances: This phase is where the active learning comes in. To put in very very simple terms, the idea behind active learning is the following: The dependent variable information (a.k.a. label) is unknown and an oracle can be queried for that information. However, each query comes with a cost, so we want to ask as few labels as possible. 

In the instance selection phase we use only the features that were selected in the previous phase and calculate the popularity counts of the instances. Then instances are sorted from the most popular to the least popular. We start asking the labels for the instances starting from the most popular instance down to the least popular instance. As soon as there is no performance gain in asking for more labels, we stop asking.

Estimation of QUICK for a test instance is basically the label of the test instance's nearest neighbor among the labeled popular instances.

Underlying Dimensionality
At the end of the feature and instance selection phases, we end up with a small subset of the original features and instances. In a data set of N instances and F features, the selected instances and the features can be represented with N' and F', respectively. N' and F' define the underlying dimensionality of our data set in the context of QUICK algorithm. An interesting thing to check is the ratio of (N'*F')*100/(N*F), which basically tells us what percent of the cells in the original data set is identified to be the essential content. For the effort data sets we experimented with the essential content can be as low as the 4% of the original data set. On the median of 18 data sets we experimented with, the essential content was 11% of the original data set.

The next question to ask is the performance of the QUICK method. By using only a small percent of the original data, can we still do as good as standard learners like k-NN and CART? Our experiments showed that "we can" and that is the beauty of it. We can find a really small subset of the data that contains the essential content and still perform as good as standard learners that use all the data. To summarize, our data sets may be not as big as we think they are!

If you have read until this point, I believe I have your attention and I may convince you to take a look at our paper for much more details. In case I convinced you, below is the abstract and here is the link of the paper.

Active Learning and Effort Estimation: Finding the Essential Content
of Software Effort Estimation Data 
Ekrem Kocaguneli, Student Member, IEEE, Tim Menzies, Member, IEEE, Jacky Keung, Member, IEEE, David Cok, Member, IEEE, Ray Madachy, Member, IEEE
Background: Do we always need complex methods for software effort estimation (SEE)? 
Aim: To characterize the essential content of SEE data; i.e. the least number of features and instances required to capture the information within SEE data. If the essential content is very small then (1) the contained information must be very brief and (2) the value-added of complex learning schemes must be minimal. 
Method: Our QUICK method computes the Euclidean distance between rows (instances) and columns (features) of SEE data; then prunes synonyms (similar features) and outliers (distant instances); then assesses the reduced data by comparing predictions from (1) a simple learner using the reduced data and (2) a state-of-the-art learner (CART) using all data. Performance is measured using hold-out experiments and expressed in terms of mean and median MRE, MAR, PRED(25), MBRE, MIBRE, or MMER. 
Results: For 18 data sets, QUICK pruned 69% to 96% of the training data (median=89%). K=1 nearest neighbor predictions (in the reduced data) performed as well as CART’s predictions (using all data). 
Conclusion: The essential content of some SEE data sets is very small. Complex estimation methods may be over-elaborate for such data sets and can be simplified. We offer QUICK as an example of such a simpler SEE method.