How to find the root cause of latency in a distributed environment

By: CriteoLabs / 11 Sep 2017

If you’re already running a horizontally-scaled set of services and having trouble pin-pointing outlier high-latency cross-service calls, skip ahead to Introducing Zipkin

Creating a tiny web application is not a complicated task nowadays. You have plenty of frameworks that help you abstract your logic. Once you choose the framework that suits you the best, you begin to code the logic behind your idea. You write the web application and put it online.

Users begin to report some bugs. Hopefully, you are able to reproduce them and for the trickiest ones, you can even attach a debugger and understand where the problem comes from.

You have created a great web application and it begins to be successful! That’s really wonderful news but people are starting to experience some lag. It’s probably due to the load but as a professional, you want to check this assumption; you add metrics. You finally realize that the CPU usage and the memory consumption are sky high and so…  you get a bigger server.

The bigger server has been handling the extra load for a month now but more and more people are really enjoying your web application and you are getting more and more users. The problem now is that you can’t afford a bigger server: you now need to split your application into services.

You discover the problems around splitting a large application into services (service discovery, load balancers, distributed caches…) but you end up having a pretty solid architecture that scales horizontally.

That’s wonderful: only a few users experience latency. They report that sometimes they experience some lag, but you don’t really know how to reproduce their issue.

Thankfully, you introduced a metric of the time spent to process a request. You take a look at the mean value and you see absolutely nothing. However, you are now trained to understand metrics and you know that the mean value doesn’t really show you worse cases. So you take a look at the 99th percentile and it is now sky rocketing!

You are happy to know that something is wrong with the latency, so now you take a look at the 99th percentile of each of your services. However, these values don’t show anything wrong and stay within correct bounds.

You have no idea what could explain the problem. You are becoming nervous because users don’t like applications that lag and you need to find a solution, quick!

You’re desperate and are looking for any clue in the logs but you get so many requests per second that it is kind of hard to correlate log lines to a particular request.

When it comes to distributed services, you need distributed tracing to understand which one is responsible for the latency problems. This is why we integrated Zipkin at Criteo.

Introducing Zipkin

Zipkin is defined as a distributed tracing system. From http://zipkin.io, “it helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data. Zipkin’s design is based on the Google Dapper paper.”

It works by assigning to every request a unique id which we call a “correlation ID”. This correlation id remains the same when this query is passed across multiple services.

Let’s take an example: An incoming request goes to Service A which calls Service B and Service C to compute the result.

When a request is sent to Service A, you’ll see in Zipkin UI that it was also sent to Service B and Service C.

As shown on the picture above, Zipkin UI shows you the path the request took in your architecture and how much time is spent in every service.

In this particular example, you can see the time spent in each service:

Service names

Time spent (milliseconds)

Service A

11.959

Service B

6.126

Service C

1.948

But this is not the only value we can extract. We can also see that Service C is called after  getting the response from service B. Is it what you expected when you wrote the code? Can it be optimized by making the calls to B and C in parallel?

But 12 milliseconds should be unnoticeable by a user: we need to find the requests that took a much longer time to be processed.

You are lucky with the client bug report: the lag occurred between 2:55pm and 3:10pm on the 29th of August. Now, you simply ask Zipkin to give you the longest requests in that timeframe and… bingo!

A request that took more than one second! If you dig into the trace, you get a full view of the details:

It seems that service C has been called eight times sequentially!

The next step is to take a look at what each service does to help and you realize/remember how your code works:

  • Service A calls Service B to get a list of items
  • Service A then calls Service C once per item in the list

If the list is small, the users won’t notice any lag but otherwise, response time will grow linearly  with the number of items

You then realize that you could take two distinct actions here: 

  • Parallelize calls to Service C
  • Send only one request to Service C with the list returned from B

We   have only scratched the surface of what Zipkin can bring to your distributed services investigations. For example, you could add the correlation id to your logs or add contextual informations about the request.

At Criteo, we use Zipkin on 20K machines in 7 different datacenters all around the world. We handle a lot of requests (up to 2.5M per second for one service) and tracing is enabled on some requests (with a sampling rate) on all applications without any impact on application performance. 

Feel free to go to http://zipkin.io for more details and join the community!

Post written by:


Pierre-Hugues Jeanneret

Staff Software Engineer, R&D

Twitter: jeanneretph

 

  • CriteoLabs

    Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.