Monitoring your web services latency on the load-balancer side

By: CriteoLabs / 06 May 2013

As a technology company we care a lot about Quality of Service and Experience:

  • No impact on our publishers by serving ads too slowly
  • No impact on our advertisers/customers websites’ latency with slow tags

One of the key techniques we use to handle this problem is real-time monitoring latency of our HTTP services in order to react as soon as possible to any unexpected event.

The load-balancer, at the gates between your upstream network and your system infrastructure, is likely the best place to measure the service latency of your platform. Of course you must trust your network’s quality and capacity, which we do because we own and operate it.

After a failed test with a (now bankrupted) company that was developing FPGA-based load-balancers, we tested various solutions and found a vendor feature called AVR (which stands for Application Visibility and Reporting) which is part of the latest firmware release of a top player in this industry.

This solution outputs a large number of logs which allows monitoring and graphing metrics like Server Latency, HitCount, Throughput, HTTP Statuses (amongst others that we don’t use) for a combination of virtual servers, pools, servers and URLs and more.

In our case, the only metric that matters is Server Latency, and here is the list of metrics we gather:

  • HitCount, TPS (Average/Max) and Latency (Average/Max/Total) by backend server
  • HitCount, TPS (Average/Max) and Latency (Average/Max/Total) by backend server and per URL
  • HitCount, TPS (Average/Max) and Latency (Average/Max/Total) by virtual server (VIP)
  • HTTP Status codes by backend server and per URL

This is starting to look like serious HTTP latency monitoring…

Now for the gory details, here is how we do it:

1 – On the load-balancers we whitelist the URLs for which we want to collect AVR stats by pattern matching (we don’t want stats about the Internet noise).

when HTTP_REQUEST {
switch -glob -- [HTTP::path] {
"/bla/blabla.php" -
"/nnn/john.aspx" -
"/doe/ *.jsp" { AVR::enable }
default { AVR::disable }
}
}

2 – The AVR logs are sent to a syslog-ng service, which is itself logging into a syslog-ng unix-stream destination (a unix socket).

destination avr_stream { unix-stream("/usr/local/var/run/avr.sid"); };
filter AVR { facility(local0) and match("AVR" value("MESSAGE")); };
log { source(inet); filter(AVR); destination(avr_stream); };

3 – On this same syslog server, we run a custom program called avr_feeder (code available upon request), but a bash script reading a log file could do if you are running a small service. This service opens the unix stream, does a bit of parsing and sends the resulting messages to a graphite server.

4 – On the graphite server we collect the data with a 5 minutes ticking and we keep 30 days of history (storage-schemas.conf).

# AVR / 5 minutes - 30 days of history
[avr]
pattern = ^avr.*
retentions = 5m:30d

This is a simplified diagram of the system in place :

avr-schema

As a result we are able to monitor latency and http statuses for all of our services and compare them for the last 30 days.

Here are a few examples of things we can see thanks to this solution:

g1

A service that runs smoothly, except for a few peaks…

g2

Fixing a memory leak…

g3

Adding more servers in a pool in order to decrease peak latency…

g4

You know you’re doing it right when for the same URL your HitCount and http_200 graphs are overlapping.

g5

Somebody went too far with code cleanup…

g6

One of the URLs (red vs pink) is faster than the other on the same machine even when the traffic is low, interesting…

Thanks to this solution, which we were one of the first companies to deploy at scale (500K+ QPS – no sampling), and to our “quick and working” system to feed these statistics into graphite, we now have a very good view of our server/platform and service health over a comfortable enough period of time to detect many issues which might have been left unseen before.

Of course when you have such powerful and detailed statistics, you want to plug your alerting system into it… this will be discussed in a future article, so stay tuned!

Philippe Bourcier

  • CriteoLabs

    Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.


elementum venenatis, Donec quis, felis leo. sed dapibus elit.