Sunday, September 27, 2020

Multi-task Learning and Calibration for Utility-based Home Feed Ranking

Here is the blogpost that should to follow the video from the last blogpost. Click on the image to go to Pinterest's engineering blog. In this blogpost you will find how we utilized multi-task learning (MTL) and calibration towards a utility-based ranking experience for Pinners. Using such an MTL-based ranking framework, we were able to 1) provide pinners more relevant and engaging pins, 2) increase engineering velocity and 3) quickly update the ranking surface based on current business needs.




Thursday, March 5, 2020

Multi Task Learning in Homefeed Ranking

Recently I have been asked a few times how we do ranking at Pinterest homefeed, what type of models we used and what we optimize for.

Probably I shall prepare a new blog post with all the recent updates we have had in the ranking model in the past year and we have had quite a few with the introduction of MTL to an AutoML-based NN, complemented with a transfer layer on top for calibration of predictions as well as combining calibrated scores into a final utility value for ranking.

But until such a blogpost, I thought it would make sense to share a video recording from the latest Mad Science Fair at Pinterest. It is a bit dated, but if we assume most of what is discussed to be in the works in the below video as currently in production, it should provide a good overview of our current system. Hope you enjoy...





Wednesday, June 10, 2015

It has been some time... Or what I have been up to...

Since my last post, I think I have given up on posting blog entries for a while. It was a pretty busy time frame though: Moving to Seattle for Microsoft, finishing up a couple of final papers and writing for 2 book projects. Plus I think I did more paper reviews than I had done in my entire grad school time, pretty time consuming stuff those reviews... And of course these are somewhat side projects, for the past 2 years I have been working as a software engineer under Bing Ads providing metrics and data insight for core Bing Ads teams. It has been an awesome experience on its own right so far, but it deserves another post to share some ideas about academia and industry.

Outside of work, most of my time was dedicated to book projects that I was lucky enough to be involved in. The first book that I had the chance to co-author, with amazing collaborators Tim Menzies, Leandro Minku, Fayola Peters and Burak Turhan, is called "Sharing Data and Models in Software Engineering" (here is the Amazon link: http://bit.ly/shrngDtMdls). I would like to think that different parts of this book speak to different levels of comfort with various data mining and machine learning techniques. The book is mainly divided into 4 parts. The first two parts start with high level discussions of how to approach a data science project, what factors to consider in terms of user engagement and data collection, then it continues with a brief tutorial on common concepts such as instance and feature selection, algorithms for prediction and clustering and so on. The other two parts deal with more advanced material such as transfer learning, active learning and data privacy. The whole book is a nice overhaul of the practical research we have been doing in the past couple of years. Particularly for data scientists and researchers with a focus on software engineering data, the book can provide a nice revisit of fundamental concepts as well as some more advanced ideas to try out on their data sets.

Machine Learning and Data Science Conference
at Microsoft, explaining the steps, ideas and pitfalls
 that led to our chapter  in
"The Art and Science of Analyzing Software Data"
Another project that I was lucky to take part is somewhat a sequel to the first book, which is called "The Art and Science of Analyzing Software Data: Analysis Patterns" (the Amazon link: http://bit.ly/analysisPatterns). This book is the result of a joint effort from various prominent research groups in the junction of data science and software engineering and it was edited by Christian Bird, Tim Menzies and Thomas Zimmermann. Besides awesome folks involved in this project, what makes this book really cool is that it is a distillation of various research groups' approaches to data science. In other words, it was a reflection time for the researchers to step back and ask: "Hey, what were we doing all along in all these projects? What are the patterns that we instinctively follow, can we distill this information into a formal structure and share?" My contribution to this book is being a co-author of a chapter with the amazing co-authors: Ayse Basar Bener, Ayse Tosun Misirli, Bora Caglayan and Gul Calikli. We have worked together in multiple data science projects in a number of different companies and successfully delivered tangible business outcomes for these companies. In our chapter entitled "Lessons Learned from Software Analytics in Practice", we share the patterns that worked for us in multiple data science projects. Of course there were a ton of pitfalls that we learned along the way, which are probably as important as the patterns to be aware of. In this chapter we also share these pitfalls and our recommendations as to how we have avoided them in the past.

In case you are embarking on a career in data science or you want to see how other research groups in software engineering have approached data science projects, you may also want to take a look at these books.



Thursday, July 18, 2013

Looking for something on transfer learning (a.k.a. cross-company learning) ?

For the past few days I have been trying to design a new set of experiments aiming to help prediction when there is no or very little local data. I know it sounds kind of weird to talk about prediction, when we have no local data. Yet, there is a tremendous amount of research invested in this problem. If you want to read more about the topic, you may look for the term "transfer learning" in the machine learning community. The basic idea of transfer learning is being able to learn when the source and the target domains are different or when the source and the target tasks are different. 

In software engineering (SE) domain, the transfer learning experiments have often been referred to as cross-company learning. So, in case you want to see similar studies from SE, another term to search for would be "cross-company". The problem of transfer learning in SE could be defined as having the same source and target task (i.e. prediction), yet having different source and target domains. To put things into context, let's assume that your organization has either no training data at all or has some training data, but no label information; then, your organization would be the target domain. Also, let's assume that there is another organization with training data that also has labels, then this organization would be the source domain. In such a scenario, it would be nice and also cost effective, if we could make use of the information of the source domain. Because, both collecting our own local training data from scratch or finding the label information (a.k.a. dependent variable information) for existing local data may be very time consuming and costly.

Until now, the focus of the transfer learning studies in SE have been using labeled source domain data and being able to predict the target domain data. However, merely giving predicted numbers to a local expert does not bring much local information that can be interpreted. In other words, merely predicting a local test instance does not provide any information to be interpreted.

So, we started thinking how we could bring more local domain information into transfer learning. The initial study in this direction was some sort of a position paper, which was published at the RAISE workshop in ICSE 2013. This position paper essentially provides details regarding how we could provide synergies between different learning tasks so as to improve transfer learning. Then we followed this position paper with an experimentation on software effort estimation, where we provided some evidence that it is possible to introduce interpretable local information into transfer learning studies. The paper containing the follow up experiments is recently accepted to PROMISE 2013 conference. 

If you have read this far of this post, then I can safely assume that you would be interested in seeing the abstracts as well as having the links of the two papers that I mentioned in the previous paragraphs. So here you go:

The position paper:

ABSTRACT: Thanks to the ever increasing importance of project data, its collection has been one of the primary focuses of software organizations. Data collection activities have resulted in the availability of massive amounts of data through software data repositories. This is great news for the predictive modeling research in software engineering. However, widely used supervised methods for predictive modeling require labeled data that is relevant to the local context of a project. This requirement cannot be met by many of the available data sets, introducing new challenges for software engineering research. How to transfer data between different contexts? How to handle insufficient number of labeled instances? In this position paper, we investigate synergies between different learning methods (transfer, semi-supervised and active learning) which may overcome these challenges. 
Link: http://bit.ly/posPaper 

The follow up:
ABSTRACT: Background: Developing and maintaining a software effort estimation (SEE) data set within a company (within data) is costly. Often times parts of data may be missing or too difficult to collect, e.g. effort values. However, information about the past projects -although incomplete- may be helpful, when incorporated with the SEE data sets from other companies (cross data).
Aim: Utilizing cross data to aid within company estimates and local experts; Proposing a synergy between semi-supervised, active and cross company learning for software effort estimation. 
Method: The proposed method: 1) Summarizes existing unlabeled within data; 2) Uses cross data to provide pseudo-labels for the summarized within data; 3) Uses steps 1 and 2 to provide an estimate for the within test data as an input for the local company experts. We use 21 data sets and compare the proposed method to existing state-of-the-art within and cross company effort estimation methods subject to evaluation by 7 different error measures. 
Results: In 132 out of 147 settings (21 data sets X 7 error measures = 147 settings), the proposed method performs as well as the state-of-the-art methods. Also, the proposed method summarizes the past within data down to at most 15% of the original data. 
Conclusion: It is important to look for synergies amongst cross company and within-company effort estimation data, even when the latter is imperfect or sparse. In this research, we provide the experts with a method that: 1) is competent (performs as well as prior within and cross data estimation methods); 2) reflects on local data (estimates come from the within data); 3) is succinct (summarizes within data down to 15% or less); 4) cheap (easy to build). 





Monday, July 15, 2013

Distributed Development Considered Harmful?

Distributed development is a good use of contemporary collaboration technology and the universal talent pool of software engineers. However, it  also comes with various possible challenges.

Imagine reducing costs dramatically by having multiple teams that are thousands of miles away from each other, living in completely different time zones. Saving money is good, but also imagine what such a distributed development scenario may lead to: Difficulty in inter-team communication, lack of coordination due to communication difficulties and ultimately higher post-release bugs. Therefore, it is important to take a look at the effects of distributed development. In other words, do we benefit from distributed development or does it come with unwanted side effects?

During my 2012 summer internship at Microsoft Research (MSR) at Redmond, I researched on distributed development and I was lucky to work with the incredible researchers of the Empirical Software Engineering (ESE) Group. It was a great learning experience and I am particularly thankful to Thomas Zimmermann, Christian Bird, Nachiappan Nagappan and Tim Menzies for their mentorship and support. A summary of the research that we have conducted regarding distributed development is recently published in the "Software Engineering in Practice" track of "International Conference on Software Engineering". Below is the abstract and the link of this paper. I hope you enjoy it.


ABSTRACT: 
We offer a case study illustrating three rules for reporting research to industrial practitioners. Firstly, report “relevant” results; e.g. this paper explores the effects of distributed development on software products.
Second: “recheck” old results if new results call them into question. Many papers say distributed development can be harmful to software quality. Previous work by Bird et al. allayed that concern but a recent paper by Posnett et al. suggests that the Bird result was biased by the kinds of files it explored. Hence, this paper rechecks that result and finds significant differences in Microsoft products (Office 2010) between software built by distributed or collocated teams. At first glance, this recheck calls into question the widespread practice of distributed development.
Our third rule is to “reflect” on results to avoid confusing practitioners with an arcane mathematical analysis. For example, on reflection, we found that the effect size of the differences seen in the collocated and distributed software was so small that it need not concern industrial practitioners.
Our conclusion is that at least for Microsoft products, distributed development is not considered harmful.
The pdf of the article can be found here: http://dl.acm.org/citation.cfm?id=2486908



Tuesday, May 7, 2013

In the market for a non-academic job after a CS (or possibly an EE) PhD?

Anyone reading this title may immediately say: "Why on earth are you looking for a non-academic job, if you enrolled to a PhD program?" That is an entirely valid question and I have a couple of arguments as a reply, one of which starts like this: "Some of the current problems in the industry often requires computer scientists with a PhD level knowledge." Another one starts with: "In the industry, product groups are likely to face problems that can benefit from the research background of a PhD graduate." One may also ask: "Is an advanced graduate degree a must to tackle difficult problems?" My answer is simply no, it really is "not" a must, but it doesn't hurt.

If you made it past the previous paragraph, you probably expect this paragraph to also talk about the pro's and con's of a graduate degree, its relation to industry etc. I am sorry to disappoint you, but this paragraph will talk about something else. Although I believe that getting your PhD will help you in the industry, when you start applying to industrial positions, you will quickly realize (as I did) that the application as well as the evaluation process is much different than the academic processes. The difference is completely fine, yet after finishing my grad school, I needed some time to brush up on my undergraduate CS fundamentals. Also, I experienced that solving as many coding questions as possible is really really helpful. There are a ton of online sources, so I am not sure how helpful this post will be. However, for what its worth, I wanted to share some of the sources that I found particularly helpful.

It is probably a good idea to have an easy start, so the following link with easy to medium questions was a good start point for me: http://goo.gl/34lpH

Once you are through these easy to medium questions, you may want to take a look at more challenging examples that repeatedly appear as interview questions. A good friend of mine, Arden Dertat, has a really cool website that I benefited a lot. Here is the link: http://goo.gl/s7NpG

Also, if you want even more examples and challenges, here is a link to a set of 50 questions prepared by Xiu Zichuan: http://goo.gl/yrKVd These questions come with answers, the code of the answers, suggested time to solve the questions and even the difficulty level of each question!

In case you want something more organized and something that talks about the entire industry job search process, then I can recommend the following books: "Programming Interviews Exposed"
and "Cracking the Coding Interview". Here are the Amazon links: http://goo.gl/oYAGI and http://goo.gl/fF3SJ

I am not sure how helpful it is to solve each and every one of these questions, so I only solved the ones that I thought were possible candidates during the interviews. So, in case you want to get a feel of the latest questions that appeared in the interviews, www.glassdoor.com and www.careercup.com are quite good and up-to-date sources.

I hope this post can help a little bit and good luck with the job search!


Thursday, January 24, 2013

How big is your data?

It has been a while since my last post and today seems like a good day to write something new. In this post I want to talk about a concept called "popularity" and how it can aid us reduce the amount of data we are looking at. I know the name "popularity" may sound a bit non-sciency, but bear with me, you will see that it is a pretty interesting idea. In fact, it is the idea that we used to develop an active learning solution called QUICK, which is accepted to IEEE Transactions on Software Engineering (link). 

Popularity
I think one of the best ways to describe the popularity concept is to start with a toy example. Assume that we have a data set of N instances (rows) and D features (columns). Let's say that we found out the k-many nearest neighbors of every instance. For the sake of this example let's fix k=1, in other words we find the 1st nearest neighbor of every instance. Every time we pick an instance as another instance's 1st nearest neighbor, we mark the picked-up instance. At the end we count how many marks each instance received and that is our so called "popularity" or "popularity count". For example, if instance A was marked as the nearest neighbor of 5 other instances, then its popularity is 5. 

Of course reading the above non-formal discussion may result in various questions such as: How to calculate the distance between instances to find nearest neighbors, how to use the popularity concept, what is its relation to active learning and so on. Next paragraphs are the brief answers to these questions.

How to of QUICK
To begin with, the concept of popularity and the QUICK algorithm (developed on top of the popularity concept) are really simple. Popularity uses simple Euclidean distance function to calculate the distance of the instances and it marks k=1-many nearest neighbors. The more important question is how to use the popularity counts of instances (i.e. how many marks each instance received as the 1st nearest neighbor). We opted to use the popularity concept for 2 purposes: Selecting features and selecting instances, which are also the two steps of the QUICK algorithm.

Selecting Features: Remember that we defined our data set at the beginning of this post in such a way that the instances correspond to rows and the features correspond to columns. In the feature selection phase, we transpose the data set. This gives us a data set where the features will be treated like instances and the instances will be treated like the features of the new space. Once we calculate the popularity counts, we only keep the non-popular instances (popularity counts of zero), which are the non-popular features of the original data set. The idea behind selecting non-popular instances of the transposed data set is to select the features of the original data set that are very different to all the other features. In other words in this phase, we want to select the features that are non-popular, unlike other features; hence, that are expected to have different information.

Selecting Instances: This phase is where the active learning comes in. To put in very very simple terms, the idea behind active learning is the following: The dependent variable information (a.k.a. label) is unknown and an oracle can be queried for that information. However, each query comes with a cost, so we want to ask as few labels as possible. 

In the instance selection phase we use only the features that were selected in the previous phase and calculate the popularity counts of the instances. Then instances are sorted from the most popular to the least popular. We start asking the labels for the instances starting from the most popular instance down to the least popular instance. As soon as there is no performance gain in asking for more labels, we stop asking.

Estimation of QUICK for a test instance is basically the label of the test instance's nearest neighbor among the labeled popular instances.

Underlying Dimensionality
At the end of the feature and instance selection phases, we end up with a small subset of the original features and instances. In a data set of N instances and F features, the selected instances and the features can be represented with N' and F', respectively. N' and F' define the underlying dimensionality of our data set in the context of QUICK algorithm. An interesting thing to check is the ratio of (N'*F')*100/(N*F), which basically tells us what percent of the cells in the original data set is identified to be the essential content. For the effort data sets we experimented with the essential content can be as low as the 4% of the original data set. On the median of 18 data sets we experimented with, the essential content was 11% of the original data set.

The next question to ask is the performance of the QUICK method. By using only a small percent of the original data, can we still do as good as standard learners like k-NN and CART? Our experiments showed that "we can" and that is the beauty of it. We can find a really small subset of the data that contains the essential content and still perform as good as standard learners that use all the data. To summarize, our data sets may be not as big as we think they are!

If you have read until this point, I believe I have your attention and I may convince you to take a look at our paper for much more details. In case I convinced you, below is the abstract and here is the link of the paper.


Active Learning and Effort Estimation: Finding the Essential Content
of Software Effort Estimation Data 
Ekrem Kocaguneli, Student Member, IEEE, Tim Menzies, Member, IEEE, Jacky Keung, Member, IEEE, David Cok, Member, IEEE, Ray Madachy, Member, IEEE
Background: Do we always need complex methods for software effort estimation (SEE)? 
Aim: To characterize the essential content of SEE data; i.e. the least number of features and instances required to capture the information within SEE data. If the essential content is very small then (1) the contained information must be very brief and (2) the value-added of complex learning schemes must be minimal. 
Method: Our QUICK method computes the Euclidean distance between rows (instances) and columns (features) of SEE data; then prunes synonyms (similar features) and outliers (distant instances); then assesses the reduced data by comparing predictions from (1) a simple learner using the reduced data and (2) a state-of-the-art learner (CART) using all data. Performance is measured using hold-out experiments and expressed in terms of mean and median MRE, MAR, PRED(25), MBRE, MIBRE, or MMER. 
Results: For 18 data sets, QUICK pruned 69% to 96% of the training data (median=89%). K=1 nearest neighbor predictions (in the reduced data) performed as well as CART’s predictions (using all data). 
Conclusion: The essential content of some SEE data sets is very small. Complex estimation methods may be over-elaborate for such data sets and can be simplified. We offer QUICK as an example of such a simpler SEE method.