Criteo Engineering > Cluster > Real time metrics on tracker calls

Real time metrics on tracker calls

By: CriteoLabs / 18 Feb 2016

Several Months ago the Integrate Team in our R&D built a solution allowing live feedback and metrics on tracker calls. This solution based on kafka, druid and storm was built with scaling in mind.

Trackers integration was the less easy step of the advertiser enrollment at Criteo. Engineers are helping clients to successfully achieve this step, but even with their precious help it took a long time for an advertiser to have all of its trackers up and ready. There were several reasons for this delay but the more annoying one was the fact there there was no immediate feedback on our client trackers and our technical account managers had to wait for business metrics to know whether the tags were well implemented or not, making the iterations painful.

For thus the Integrate team have provided a solution to help clients by giving them live feedback on their integration : The Tag Validation Dashboard.

What we wanted to achieve

We wanted to show metrics on format issues advertisers could have with their javascript trackers but also issues on mismatches they could have between their trackers and their product feed.
We wanted these metrics to be real-time, so that when our clients correct an issue, they see an improvement within a few second
We also wanted to have very fine-grained data for each of our clients, on a period of at least 24h, in order to be able to provide them with a complete event report
And finally we wanted it be available worldwide.

Giving the situation, we had few technical challenges to overcome, therefore we needed to choose carefully our architecture and technology stack so that it fits our needs.

How we did it

The solution is based on the existing process handling of all user events. Each time a user sees a product or put a product in his basket, Criteo will receive an event and store it in order to display a relevant ad later on. We needed to plug our solution just after Criteo received the event.

The first solution we iterate on had this simple model:

Audit the tracker events. Check the event for mandatory parameters, parameters format, check if the event is related to one or several products, check whether these products actually exist in our system (this can be missed due to incomplete product feed or just because the advertiser passed us a wrong product id)… And compute with all this an audit event as a result
Send this audit event to Kafka, the famous Apache messaging system, where a global scale mirroring system have been set up, allowing us to aggregate data from all around the world.
Consume Kafka from Druid, which is a really nice column-oriented distributed data store built on a delta architecture, allowing use to do sub-second query on the huge amount of metrics we needed to compute.
Finally, query Druid from Integrate website to display the live metrics.

This model was pretty satisfying except for one problem: We don’t have the right information on the products when we query for them.

During the event tracking, Criteo needs to retrieve the products concerned by the call and see if they exist in its system. For performance purpose, in the normal process Criteo calls a “lazy refreshing” memcache, and not directly the real products, which are stored on a Couchbase cluster. A “lazy refreshing” memcache will answer the “not here” value if the entry is not in cache while asking for an asynchronous reload of the entry. This means that when you are calling the memcache for a product and receive a response telling you it does not exist, the product either really does not exist or exists but is not in the memcache at the moment. Why a “lazy refreshing” you may ask? Because don’t want to store the 4 billions of products we have in couchbase. And this is perfectly fine for integrated advertisers since their most common products are already in cache so the cache misses due to cache laziness are fews. But in integrate case the advertisers integrating are new, so none of their products are expected to be in cache.

In order to fix this issue for Tag Validation Dashboard, we have added a Storm topology to query the real products in couchbase and update audit data with the real status of the product. Here is how the design architecture finally looks like:

kaf

Tag Validation Dashboard Results

On top of this technology stack we built a user friendly UI letting our client understand easily what they needed to improve on their tags.

kaff

Thanks to this feature our first clients have been integrated 4 times faster which is a great achievement but we don’t want to stop here. Currently only few clients are integrated via Integrate, as the product is in bêta, but we built our architecture with scaling in mind since Criteo has more than 10 000 clients that are generating 35 Millions of calls per minute on an average day.

Looking forward

The Tag Validation Dashboard is a first step in helping advertiser integrate easily and faster with Criteo but we don’t plan to stop here. On the tracker side we will soon release a feature allowing the user to do live debugging on his tracker pages that will complete the tracking tools an advertiser can use to integrate easily.

We are also working on metrics on the feed import, allowing the user to have feedback when he uploads his feed even before the feed is fully downloaded!

Photo credit: Our technology stack

Resources:

https://storm.apache.org/, http://druid.io/, http://kafka.apache.org/

Post written by

CamilleBenoit-300x267

Camille Coueslant, Snr Software Engineer R&D, Platform

Benoit Jehanno – Software Engineer R&D, Platform

CriteoLabs
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.
See DevOps Engineer roles