Building the Cross Device Graph at Criteo

By: CriteoLabs / 07 Jun 2016

Universal Match

The purpose of Universal Match, Criteo’s cross-device solution, is to accurately and consistently identify users across multiple environments and devices along with the associated shopping intent. In the past year, we have been changing the whole Criteo platform to leverage all the shopping history across all devices for a single user, leading to solid improvements to advertising campaign performance and better sales attribution.

Identification: single environment to cross device

In the beginning, we were only using an http cookie containing an internal identifier. This works pretty well on desktop computers and allows us to do retargeting browser by browser.

In mobile and tablets, identification gets more complicated as there are two environments: mobile web and applications. We can still use an http cookie on mobile web. Inside applications, it is possible to retrieve an advertising id, called IDFA on iOS and Android id on Android. It enables tracking as it is consistent across all applications, even though both environments (browser and in app) are independent.

 Browser

 In App

 Android

 Http cookie

 AAID

 iOS

 Http cookie

 IDFA

Things are even more complicated in mobile as many publishers are using WebViews and opening internal browsers inside their applications with their own set of cookies. As a result, a single device like a smartphone can have many ids according to environment, leading to performance loss on single device.

Cross Device

Of course, once we have identified ids on a single device, we want to match devices from the same user as described in the graph below.

Building the graph

Criteo integration to thousands of advertisers is a unique asset in the industry. For cross device, we can leverage customer ids and emails in order to match all these ids together. As privacy is a major concern, we hash all these ids before sending them to our messaging infrastructure.

We gather all the messages into hadoop file system logs and we do all the graph computation in hadoop. In order to make the cross device graph available online, we use an in-house application that is able to read the hadoop file system and write to couchbase.

archi

Graph computation infrastructure

Computing the graph offline is acceptable as it is not critical to take an email into account as soon as we discover it. The benefits is that we can recompute the whole graph every day.

Filtering

Since data is coming from thousands of specific integrations, it is critical to be able to filter bad data that can easily pollute the graph by matching all devices together. We use simple metrics at the hashed email or advertiser level in order to detect statistical outliers. For instance, hashed logins used on shared devices or too common hashed emails will be filtered.

Connected Components

A natural way to build the graph is to compute connected components, which is the set of neighbors at any given distance. Given good properties (small size, diameter), a connected component can represent a single user.

We have implemented efficient algorithms for MapReduce as described in Finding connected components in map-reduce in logarithmic rounds. As iterative algorithms tend to be very slow in MapReduce, these methods are minimizing the number of rounds to convergence while shuffling acceptable amount of data between mappers and reducers.

There is an ongoing study to use Spark and GraphX instead of MapReduce in order to avoid unnecessary I/O and shuffling at each iteration. Results are very promising so far as iterations last from 30 seconds to 1 minute using vertex distributed dataset provided by Spark.

Graph Study

We leveraged the igraph package to run network analysis. Because of the large size of the graph, we worked on a sampled version based on the connected components. This ensures that we get a fair representation of the cross device graph.

Results: 

  • 80% of connected components in graph have 2, 3 or 4 devices
  • In average, connected component are composed of 3.2 devices
  • Worldwide, users matched through our Universal Match solution generated 40% of our Revenue excluding Traffic Acquisition Costs, so that the full graph represents a large amount of our coverage
  • Our graph has been built with a sales focus which may have created some interesting characteristics which happen to work particularly well in the performance arena.
Data visualization of Cross device Graph sample

Data visualization of Cross device Graph sample

As some connected components are very large, we cannot always consider them as users. They represent in fact a group of users with shared devices. We used igraph algorithms to find clusters within big connected components. This allows us to identify users by splitting the large connected components on potential shared devices.

Looking Forward

We now have found many interesting trends within our internal graph. We are able to industrialize our workflow at large scale and make it available for both online usage and analytics studies.

Of course, this is just the beginning, cross device has lot of uses cases on the whole platform and opens lots of doors for the future, like improving prediction or handling shared computers. If you are excited by the challenges we tackle in this article, please have a look at our job offers for top engineers and scientists!

Post written by:


joel-181x200

Joel Trigalo

Senior Software Engineer R&D

Engine – Cross Device

https://github.com/jtrigalo

Zach-200x200

Zacharie Delpierre Coudert

Product Manager R&D

Product Engine – Creative

 

 

  • CriteoLabs

    Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.