In every company, there is always an ultimate specialist that is knowledgeable about every tricky detail, and that is called as a last resort.
At Criteo, we pushed the concept further and have a dedicated team of engineers for lost causes and emergencies. Meet our firemen, the Escalation team!
WHERE DO WE STAND IN CRITEO?
Escalation is a SRE (Site Reliability Engineering) team and our main roles in the SRE department are:
- Releasing Criteo code (160+ applications as of 2016) and configuration
- Minimize the business impacts of every escalated incident
RELEASING CRITEO CODE?
Since the amount of applications to release is huge, we hold a meeting twice a week called Release Sync aka “Rsync” to arbitrate and coordinate this mess. In this meeting, we assess the changes on the production environment as whole: applicative releases, configuration changes, network changes, database changes or any other tricky changes. Based on our experience, we decide what shall be released or deferred, and put our veto on what seems risky.
Those meetings give the R&D teams a complete view of how the production environment will change during the week.
We also process day to day applicative releases and configuration changes, with two members of the teams dedicated to those tasks, from Monday to Thursday, because, of course.
We handle incidents from many and various sources. We have both technical and business alerts raised through our monitoring. Local or Global Operations teams can report an issue at any time.
One way or another, there is always a ticket created in our issue tracking system and our role as Escalation team is to:
- Do preliminary investigations and qualify the incident
- Contact the right team(s) to tackle the problem
- Follow up until the resolution of the incident
Depending on the business impact, we may also need to:
- Coordinate several teams (+50 teams as of 2016) in the R&D to mitigate the issue
- Communicate with Business teams and other stakeholders on any progress
At the end of any critical incidents, we write a postmortem document of which purpose is to:
- Describe the timeline of the incident
- Describe what we did to mitigate it
- Identify the root cause of the issue
- List what will be implemented to prevent it from happening again (non-recurrence actions)
Our team is available 24X7, since we have on-call shifts to cover out-of-office hours, holidays and weekends.
WHERE IS THE FUN IN ALL THIS?
Where is the fun in handling incidents, releases or configuration?
- You are the master of ceremony and also the guardian of the production. You validate a lot of changes (+4450 releases and +3500 changes as of 2016)
- You interact daily with an insane amount of people whether to arbitrate changes on the production environment or to solve issues
- As a result, you get to know a lot of people in a small amount of time
- Every day is a new day, depending on the releases, changes, incidents
The best is to hear it from the team members:
- Didier: “I learned the value of problem-solving from my Mom“
- Emmanuel: “I get so much satisfaction in solving incidents. The times I click ‘Resolve’ on a critical incident really make my days”
- Simon : “The fear in their eyes when I roam between the desks is somewhat funny. But the relief they express when I say I just come to say hello is funnier.”
- TingTing : “Here I succeed in maintaining fit by going up and down stairs to catch those culprit teams while harvesting some free foods when returning.”
- Mehdi : “Coordinating human skills and resources to solve an issue is like fighting on a man vs machine war, winning means that the day machines will overpower humans is delayed”
- Michel : “Seeing the blurry picture of how the humongous number components of the Criteo architecture fit together becoming clearer and clearer weeks after weeks”
- Gilles : “Lot of satisfaction on getting more and more knowledge in the Criteo infra and on solving issues”
- Ruben: “As Escalation’s Engineering Program Manager, everyday is full of challenges. Everything is going so fast, everything is possible #loveit”
WHAT CAN YOU LEARN DURING ONE YEAR?
You can have a good vision of the Criteo architecture and how its different components interact together. What is really thrilling is how Criteo evolves fast! Something you learn about today can be obsolete in a couple of months, so our team has to follow the pace. This is not an easy task, but we organize internal training sessions every month so that the knowledge of the team is still up to date.
Last, you cannot get bored and this is what makes this job awesome: you will always face new challenges as Criteo evolves. What is also fun is that every major incident we faced is the basis for wonderful stories like “this one time a new hire deleted every single application on Marathon” or “that other time I helped to fix internal tools’ access at 4 am for the Japan team” (but that might be the baseline for another blog post).
Anyway, if you liked what you read so far, don’t hesitate anymore and apply to this job. Go, go, GO !!!
Post written by:
DevOps Engineer – SRE, R&D
Operations Lead – SRE, R&D
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.See DevOps Engineer roles