Better ask for forgiveness than for permission : meet Criteo lab’s Matthieu Blumberg

By: CriteoLabs / 26 Oct 2017

Matthieu Blumberg, Engineering Director R&D. Photo: CTO Pizza

This article originally appeared on CTO Pizza on September 21st, as part of the series of interviews conducted by Alban Dumouilla to unearth some of the daily challenges of CTOs and tech leads around a pizza.

Who’s the guy managing Criteo’s huge infrastructure? 

It’s very important to give people meaning in their work, so that they can see how what they are doing is contributing to the big picture of the company

Who’s the guy managing Criteo’s huge infrastructure? 

It’s very important to give people meaning in their work, so that they can see how what they are doing is contributing to the big picture of the company

Let’s talk about your background first. Who are you?

I started playing around with computers at the age of 6 or 7, at the Beaubourg Library when it started getting some digital equipment. I got my first computer, the Atari ST in the 80’s. During my studies, I quickly became a systems administrator and ended up working for a company that provided backup services and help preventing the ‘Year 2000 problem’.

I was right in the middle of the first Internet bubble, in a company that was managing online documentation. It got bought by an American company and shutdown right after. I then moved to the INA (note: French’s public structure managing audiovisual assets) and worked on a platform to deal with video archives real time restoration. It was really interesting, I was really passionated by this job — it was a European project and we had to deep dive into RTLinux patches to manage the video streams live, which was not easy at the time.

But in 2002, following government change, public research budgets were cut and I lost my job. I joined a hosting company (LinkByNet), and then had my first kid. I have a tendency of switch jobs for every child I have !

I met with Olivier Poitrey (founder of Dailymotion), which had big optimization issues. In 2006, the site’s traffic was already hitting several Gigabits per second which was absolutely tremendous at the time. He wanted to optimize delivery of the videos over the web. While I was there, the traffic grew to 120 Gbps, which at the time was in the world of Hyperscale. We were a team of about 10 people at the time, and the company just raised 7M$. The key factor of success was the 2007’s presidential election, and thanks to Dailymotion it was the first time such democratic event could be supported by online video.

We built a platform capable of hosting 2 Petabytes (2 millions Gigabytes) of data, with a very small tech team and a flat hierarchy in the company.

It was a lot of fun for almost 4 years, and I finally moved to Pixmania in 2009 that wanted to build a white label e-commerce platform. That didn’t work out too well, and I joined Criteo in 2011.

Can you describe your job at Criteo as of today?

I’m a manager of managers. It’s widely different than being an individual contributor, as my goal is to help the teams to give the best they can and give them all the tools they need to be as autonomous and efficient as they can be.

It’s very important to give people meaning in their work, so that they can see how what they are doing is contributing to the big picture of the company.

It’s a very competitive market in terms of talent: it’s hard to hire, and we need to differentiate our core values from the rest to be able to attract the best. Tech is sexy, but it’s not enough to attract the best engineers.

Criteo is a tech company. There’s a big tech culture, but as the company really grows, we’re doing everything we can to avoid getting the inertia that follows the growth. For example, to be able to reform some parts of our architecture, there’s more and more people that need to be convinced. So one of my jobs is to keep in sync with everyone concerned in the company to keep moving.

Let’s talk about this infrastructure

We serve about 4 millions requests per second.

Can you first give us some numbers, so we know what we’ll be talking about? (August 2017)

  • We serve 5 billions banners a day
  • That’s about 4 million requests per second
  • We manage about 25,000 servers (up from 18,000 in January)
  • We add about 150 to 300 Terabytes of data per day in our Hadoop cluster
  • We operate 16 datacenters all over the world
  • We do machine learning on $550B e-commerce sales

Can you describe the infrastructure?

It can’t really be described as a whole, as it is always in an ‘in-between’ state. We’ve had some serious growth (100x in 6 years) and just added 8,000 machines in the last 8 months. We’re basically torn apart between the need for high growth and the need to make things better — so there’s no real way to describe the infrastructure, as it won’t be the same tomorrow. When we decide to implement a new standard, by the time we actually implement it on the entire infrastructure, we might have switched to something else.

We are working very closely with the Site Reliability Engineering team (the people installing the systems running on our hardware) not to have to think in terms of machines any longer, but in terms of racks. One machine does not exists for us. We can definitely lose an entire rack (25 to 45 machines) without losing any data or impacting any service. Being resilient and survive a datacenter outage is also a strong concern.

Are you planning any big change within your team or infrastructure?

We just bought Hooklogic, which allows us to enter the brands’ market. We used to target only re-sellers and now we can target brands directly, which changes the business quite a bit.

It’s a challenge because Hooklogic’s infrastructure was entirely in Amazon’s cloud, and we need to integrate it into ours. There’s also 150 people joining Criteo, so it’s as much of a social challenge than it is a technical one.

The acquisition also means that we have a new R&D office in Michigan (other offices: Palo Alto, Grenoble, Paris)

Can you describe a crisis you went through and how you solved it?

We have what we call “code reds” from time to time. The last big one was in June, two months ago, when our Sunnyvale datacenter dropped from the surface of the Internet. It went dark because of an electrical outage. On a Sunday!

About 50 people got mobilized very quickly, and that’s where I really love the team. Everybody is super involved and will jump in on a Sunday when we’re in trouble. For 45 minutes of electrical outage, it took us about 5h to get everything back online.

We had set up some failover logic between datacenters and everything worked out just fine — but we got lucky, as the outage was at 5 in the morning on a Sunday. I don’t think that things would have been as smooth if it happened during a peak hour.

The life of an engineering director

It’s better to ask for forgiveness than for permission

What’s your hardest challenge?

Hiring, and finding new ways to spare some money while scaling the infrastructure. As a tech company, the infrastructure is the cost of revenue, when in most companies it’s a cost of sales. So the more we can spare on the infra, the more profits we can make. My team spends the most money in the company!

I’m eyeballing some great initiatives regarding hardware and infrastructures to help cutting costs while scaling like the OpenCompute project led by Facebook. The hardest part is to do all that without minimizing the efforts on resiliency!

Read the full article here

 

  • CriteoLabs

    Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.