The best of Criteo Labs – a drive through 2017

By: CriteoLabs / 12 Jan 2018

BIGGER, BETTER, STRONGER

At Criteo, our best perk is amazing people. To the existing teams in Paris, Palo Alto and Grenoble we have added a new office in Ann Arbor (Michigan, US). Our R&D is truly global and our engineering teams are united by their energy and hunger for performance. Our strength comes from our commitment to challenge and be challenged. We believe 2017 was an amazing year for the Criteo Labs teams. The best indicator to reflect this is the number of features roll-outs which have flooded in 2017. Some of the subjects are confidential, of course, but we are proud to share some of them below.

Our Ann Arbor team in MichiganSeveral of our Ann Arbor engineers

CONTRIBUTIONS TO THE ENGINEERING COMMUNITY

Criteo processes more than two hundred billion requests per day, using several thousand servers. Under such stress, services sometimes don’t behave as predicted, and can fail in spectacular ways. Kévin Gosse and Christophe Nasarre published an amazing series of 9 articles describing in full detail how to diagnose complex C# issues they overcame. Dealing with memory dumps of tens of gigabytes, exposing complex interaction between components, they had to come up with exotic investigation methods and develop extension to WinDBG using CLR MD and other .Net tools and libraries.

In addition, we have deployed Zipkin as our distributed tracing solution in our datacenters. Unfortunately, Zipkin had no fully-featured C# client that we could use, so we developed our own and open-sourced it. We are very proud that our client has been adopted as the now standard  Zipkin standard library.

 

OPEN SOURCE CONTRIBUTIONS

Criteo is committed to contributing to the open source community. We use ample open-source software internally (Cassandra, Chef, Gerrit, GitLab, Graphite, Hadoop, Kafka, …). We publish the tools we believe to be of general interest, and contribute bug fixes and improvements to the open-source software we use.

To start with, the scale of our Hadoop ecosystem is big, as we produce tons of data each day. We run 300k jobs every day, processing around 7 PB of logs to produce trillions of new records. We do that using several frameworks such as Hive, raw Map/Reduce, Scalding or Spark. A few years ago we migrated from a centralized jobs scheduler based on Ruby to a distributed system based on Scala. “Cuttle” is now an OSS project everyone can use to schedule data production at Criteo scale.

Next, for our Business analysis, we mainly use Vertica as a datastore. We have some datasets of billions records per day and some analysis such as the distinct count become very difficult to achieve at this scale. HyperLogLog is an awesome approximation algorithm to this problem and allows us to achieve rollup of tremendous datasets in real time. We developed a very performant HyperLogLog support for Vertica and open-sourced it.

The thousands of data sets we have in production and the hugely distributed nature of operations at Criteo, both geographically and from a systems point of view, mean that it became very difficult to track the quality and availability of our data. We moved to a “guerilla” SLA mode where everyone can track their own SLI and their dependencies using “SLAB“, an extensible Scala framework for creating monitoring dashboards.

Finally, many of our projects need to use HTTP either as a server or to communicate with existing Web services. For that we use a small HTTP server and client library for Scala, “lolhttp“. It is built on top of functional streaming concepts and has built-in support for HTTP/2, so it is perfect to create robust real time web services.

Picture of our Ashburn datacenter

At Criteo, our Graphite cluster is now running on more than 100 nodes (R=6) and we totally got rid of Whisper backend. The DevOps engineers introduced Prometheus at Criteo and are building a bridge between both systems. Prometheus is an open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach. They created a fully featured graphite remote adapter for Prometheus.

They also contributed heavily to Graphite. Graphite 1.1.1 got released with the help of several Criteo engineers who also released new compatible “BigGraphite” plugins (simple scalable time series database). They contributed adding new features to Grafana and the Prometheus Alertmanager. If you are attending or following the talks at FOSDEM in 2018, you can have more insights.

All these achievements that are outlined above could not have been possible without our Site Reliability Engineering and Infrastructure teams. First of all, in Europe, they launched a new datacenter in Amsterdam, which required a full speed racking of 1200 machines in 3 months. On a worldwide scale, we set up 8800 additional machines, which brought us to a total of 25K machines. In the US, we also launched a new datacenter in Ashburn. It led to the creation of a network team of 7 people in our Palo-Alto office to ensure our growth of our infrastructure.

Or several years, we have been operating with the biggest Hadoop cluster in Europe. Within a year, its size tripled from 1200 to 2900 machines to reach 160 PB of HDFS storage and 400 TB of memory. From a network standpoint, we achieved a transition from 10Gbits/second to 100Gbits/second on the WAN.

Here are some examples of DevOps topics if you want to go further:

Anna Savarin shares some insights on scaling for DevOps at Ncrafts Conference

… and Corentin Chary presents BigGraphite:

 

RESEARCH AND MACHINE LEARNING

In collaboration with the SRE and Infrastructure teams, we installed our first batch of GPUs in our datacenters, allowing us to evaluate new kinds algorithms at scale. No need to say that our Researchers and Machine Learning Engineers are having fun with them!

The Research team has been very busy and launched the first Criteo Research Faculty Award, a grant that facilitates academic research in Machine Learning. To keep feeding the Machine Learning community, the team released the 6th public Criteo dataset on attribution modeling including conversion data.

In addition to publishing to all the leading conferences, the Research team organized the 3rd edition of the Machine Learning in the Real-World workshop with a stellar cast of speakers and was the finalist for Best Paper at the AdKDD Workshop for their work on improving bidding efficiency through attribution modeling.

Several of our Machine Learning Research Scientists

SHARING KNOWLEDGE

If you are still reading us, you must have understood that giving back to the tech community is part of our DNA. In 2017, we have participated in 78 events in 9 countries and sponsored 24 of them, greeted 2000 people in our offices during Meetups, successfully launched our Criteo-branded NABD Conference in Palo Alto and 21 Criteo talks were given by our talented R&D teams all over the world! The result is that  70% of our engineers have attended conferences worldwide!

Also, 2017 was the year of may firsts for the Palo Alto office. We hosted the first ever Meet up in April, Karthik R discussed Data Stream Processing @ Scale. Both Ann Arbor and Palo Alto took part in building the Criteo brand in 2017 and highlighted an ever growing and expanding team. Where topics from Machine Learning and Computer vision to Amazon Aurora and Bias on the Web were discussed amongst many amazing speakers.

We held the first ever Lean In Group of Palo Alto and hosted movie night where CODE: Defending the Gender Gap was viewed.

The Palo Alto office hosted the first ever NABD Conference, with huge success, over 10 top-notch speakers from all over the technical community and over 75 attendees.

Justin Coffey, Senior Staff Development Lead, during the first NABD Conference – Palo Alto

Both US an French recruiting teams continued to build the University relations program and participated in career fairs with over 12 schools in total. Building a solid foundation of technical knowledge to expand within the college communities and raise on campus awareness of the Criteo brand. The intern program also played a big part in that success with Criteo’s continued commitment to harnessing smart and free thinkers within the college community.

 

EMPOWERMENT AND INNOVATIVE CULTURE

Given our challenges, Criteo has developed an engineering-driven culture that encourages creative problem solving, cross-team collaboration and learning.  Our engineers, no matter where they are located are pro-active and always keen to challenge themselves to learn!

For instance, once again, our Paris engineers, took part in the annual French coding contest “Meilleur Dev de France”!  The Criteo Labs engineers achieved podium success and Stéphane Le Roy won the competition. 4 of the 2017 top 10 engineers came from our teams!

During 2017, we broke the record of Voyagers done within the R&D department: 26% of our engineers joined another team for a few weeks and experienced new technologies and projects, including a massive exchange between our French and US offices.

We created the Machine Learning Bootcamp, which is a three-month program with the objective of immersing volunteered and selected engineers in Criteo’s Machine Learning research projects by combining both theoretical and practical training with the opportunity to contribute to a live project. Members of the research team offer a set of classes that focus on the theoretical aspects of machine learning algorithms from simple categorization to reinforcement learning and, of course, deep learning.

This year, our internal hackathon was again run in all the Criteo offices in the world and spanned all teams, from R&D to S&O. Those events are very dear to us because they reflect our bottom-up culture and true go-go-go sprit of all Criteos! And each of them can have an impact! For example, two products we launched this year were bootstrapped from previous hackathons (Criteo Customer Acquistion & Criteo Audience Match).

 

WHAT’S NEXT FOR 2018?

Performance and scalability are at the heart of every department of Criteo Labs, whether in Europe or US: Site Reliability Engineering (SRE), Infrastructure, Platforms, Engine and Research. We do 80 million predictions per second, have 13 milliseconds to choose which ad to display and have a 150+ releases per week. Follow us in 2018, even more exciting challenges are coming!

  • CriteoLabs

    Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.