September 10th, 2014

PoH – Part 3 – Distributed optimization

In the context of web advertising, it is crucial to be able to make predictions extremely quickly as little time is given to send a bid to the ad exchange. On average, Criteo is able to predict the click probability of a user in less than 100 microseconds, as opposed to the 50 milliseconds required by deep models, and does so up to 500 000 times per second. This is the main reason why generalized linear models like the logistic regression, which are simple, are still widely used in our industry. As such models are faster to train, the move to distributed learning was therefore not as much a priority as it might have been for other companies.

September 1st, 2014

PoH – Part 2 – Running C# on a Linux Hadoop cluster

Assume you have a code base in C# and you want to run it in a distributed way on Hadoop.  What do you do: rewrite your historic code base in Java? Or try to forget the lesson you learnt when you were 3 years old, that a square piece cannot fit into a triangle hole?

Taking advantage of the lessons learnt the hard way by others, we chose to give the second approach a try. Here is how we managed to give a triangle shape to our square piece. And to run C# code on Hadoop in production.

August 19th, 2014

Criteo releases its first public dataset : Conversion logs

We are committed at Criteo to scientific excellence. And one of the cornerstone of scientific progress is the reproducibility of experimental results.
We thus decided to publicly release the datasets used in our forthcoming papers.

And the first release is here! Olivier Chapelle will be presenting his paper on conversion modeling at KDD next week. And the associated dataset can be downloaded here.

Enjoy these gigabytes of conversion logs!

July 18th, 2014

How to win a free trip to San Francisco

One possible way was to qualify to the TopCoder Open onsite final, San Francisco, Nov 2014. Let me explain how I managed to take one of the last four places of the last qualification Marathon Match round. TopCoder Marathon Matches are complex algorithmic challenges to be solved in one or two weeks (hopefully the onsite final is only 12 hours long).