Criteo’s business model is based on performance advertising: we optimize sales on behalf of our clients, while charging for clicks only. To achieve this, we need accurate predictive models based on advanced algorithms and data.
We data scientists are at the core of both engineering and business. Our mission is to use the platform and tools developed by the engineers and the advertiser’s data to build the best models possible. As the lead of the Engine Data Analytics, I manage five data scientists working on bidding models specifically. We strive to improve the quality of the prediction to take the best decision of whether to display an ad or not, and for which client.
Big Data is a buzzword, but it’s also one of Criteo’s most important asset. We gather data from several sources, that can be divided in two broad categories:
- advertiser events, which take place on our clients’ web site (product views, basket events, sales, …)
- publisher events, which occur on our providers’ web site (displays, clicks, …)
The size of the data is a tremendous opportunity, but also a serious challenge (several billions of lines of log per day!).
Leveraging this data requires thorough analytical processing, e.g. Hive and/or Spark queries to analyze the distribution and correlation of features. As data scientists, we look for patterns and signal in the data which we can capture to improve the prediction models. We also look for irregularities to diagnose issues.
Key observation: we don’t have to use the data directly, we can transform it! There are two main ways of using the data in our models:
These techniques (among others) are widely known as feature engineering. Looking at the distribution of features, we can define relevant buckets for continuous variables, such as counters or timestamps. For instance, consider the number of products a user has seen. Seeing four products definitely reflects more engagement than seeing only one product. However, seeing a hundred and four products is not that different from seeing a hundred products, hence the need to bucketize continuous variables.
Moreover, some variables make more sense when crossing them together to analyze the co-occurrence effect, such as the number of banner a user has seen on a particular device. The “fatigue” effect is heavily dependent on the screen used, and consequently on the device type the user is on.
Recently, we began looking into more advanced features, such as recurrent neural network to build embeddings of product-related information. We also started using gradient-boosted decision trees (GBDTs) using the excellent XGBoost library. We can leverage GBDTs to improve our feature engineering by computing bucketization and cross-features in a more automated way (by looking at splits and co-occurrences, for instance).
In order to use these carefully crafted features, we need to think about which models are the best fit. Here again, the sheer size of the data imposes hard constraints on what models we can use. As data scientists we do not actually develop algorithm libraries, but we rather use what is built by other engineering teams. However, we need to have a thorough understanding of how these algorithms work behind the scenes to tune them and investigate issues when they arise. For instance, being able to detect overfitting is important to tune the regularization parameter.
We constantly try to improve our prediction models. But there’s a difficult question to answer: how can you compare two models to know which one is better? A key objective is to design metrics to do just that. A good metric should accurately mirror Criteo’s business objectives. Hence we work hand in hand with the Product team to design and maintain relevant metrics. We validate our models using offline metrics at first, because we don’t want to use a bad model in production. However, the offline behavior is not the same as online, in part because we replay traffic and can’t know the effect of the potential new traffic that would be acquired. In the end, we constantly run A/B tests to validate our models against actual production traffic. A/B test slots are a rare ressource, so we are very careful not to waste them by carefully selecting our candidates offline in the first place.
With the data we collect, the features we engineer, models and metrics, how do we go from the original idea to the production model? Enter custom logs and the TestFramework. When a data scientist has a new idea for a feature, the first step is usually to build a “custom log”: a production log (on which production models are learned) enriched with the new feature(s) to test. Once these logs are ready, we can proceed with offline experiments to benchmark the original against the production model. This happens on the “TestFramework”, a Hadoop-based environment enabling lots of concurrent offline tests without risking an impact on production. And with the quantity of data we have, we definitely need something like Hadoop. Needless to say, we data scientists are among the biggest users of Criteo’s TestFramework (and hence the Hadoop cluster).
Enough with machines and algorithms… on to the people! We form a very friendly and cohesive group, across the whole Engine department. Moreover, the data scientist group is very diverse, in gender, nationality and academic backgrounds. Data scientists come from very different backgrounds, from engineers to PhD in astrophysics!
Furthermore, we work in close cooperation with the Research team, in order to stay as close as possible to the state-of-the-art algorithms. There are lots of machine learning topics to be addressed, and we’re constantly looking for top talents to help us grow. Oh, and I almost forgot: we have cookies