Sixish years ago I was asked to take a couple of engineers and try to solve the “BI” problem at Criteo. While that in and of itself might merit a book, here I’ll spare you the twists and turns as we grew from a team of 4 to 40.
What I will do is:
– Provide *an* origin story for data engineering at Criteo for context (there are others since many orgs started producing data at similar times)
– Review the core competencies of a data engineer
– Discuss the principles of SRE as I see it
– Tie the last two points together in the presentation of the discipline of Data Reliability Engineering
Criteo is an organization that fosters innovation, both technically and organizationally, creating a safe space for engineers to think about how to make things better across the board. This is my attempt to use this safe (public) space to formalize my thoughts on data engineering. It is an imperfect document and I would love to hear your feedback.
Also, I am competent in the analytic side of data workloads and so this is clearly biased towards that. I don’t think this precludes the more transactional folks from projecting themselves into this analysis–I’m just not the person who can make those arguments.
We Have No Idea What We’re Doing
Recall that while the MapReduce paper was written in 2004, in 2012 Hadoop was still quite immature (some might claim that it still is). Cloudera’s first release was in 2009. Hortonworks formed in 2011 and Apache Hadoop 1.0 was released in November of 2012.
Back in these heady days of Hadoop circa 2012 we had a small cluster (60 measely nodes), Hadoop, Hive, Cascading, a SQL Server and hundreds of TBs of text delimited data and were charged with decommissioning gigantic sharded SQL Server clusters fed by highly optimzed bespoke C# tools by building “something”.
So we did what any self respecting engineering team would do when they don’t know their tools and have no spec: just start banging away at the problem and see what works (if you squint).
Early Successes Lead to Mid Life Crises
We ended up building lots of “somethings” and by and large these somethings worked pretty damned well, allowing us to integrate new and fancy components into the system. A PB-scale data warehouse was spawned in Hive. SQL Server was replaced with Vertica (for analytics–we still use SQL Server extensively elsewhere). SSRS with Tableau. We had the biggest Cascading jobs in production and we were heroes.
At some point things start to look a little wobbly and the duct tape and wire holding it together started coming undone.
Our Vertica cluster was crumbling under load (and losing its disks at an alarming rate). Tableau was such a success and there were so many dashboards published that not only was it crumbling as well, but no one knew which reports were right. Our biggest job was breaking the namenode and backfilling was a recurrent pain.
We were now no longer heroes.
Not all of the above happened at the exact same point in time (though more than I would like to admit did). We also did not have a big stop-the-world moment while we started thinking academically about the problem. As it turned out we had been accumulating nuggets of wisdom along the way and just hadn’t really realized it until much later.
The funny thing about distributed systems is that they are a little bit all or nothing and in some senses the formal proof of correctness is that the thing you put in production actually works and doesn’t break anything (there is also the well known it-works-too-well-and-breaks-everything problem). Indeed most of the insidious problems we’ve seen are of the garbage in garbage out variety. This feature of the system makes it fairly easy to over time identify the patterns that work and create a short list of best practices.
The Discipline of Data Engineering
Probably the best summary I’ve seen on the discipline of building data production workflows (which is just fancy speak for ETL) is from Maxime Beauchemin’s post on Functional Data Engineering. He touches on all the critical concepts, which I’ll review in my own way and in much less detail here.
The first thing we all learn is that data production should be data-driven and has nothing to do with the time of day. This simply means that the optimal moment for a job to start is not at midnight, or maybe 1 o’clock in the morning because data seems to arrive then, but rather whenever it happens to be that all of its dependencies are updated. This will end up being your data production DAG, or Directed Acyclic Graph.
The next thing to learn is that the outputted data is one big immutable data structure to which (one hopes) you are only appending new partitions of data. Updates are very rare and usually only caused by one of two things: data quality errors or functional changes to the data transformation logic. In either case they are of the delete-and-replace variety, and preferably atomic. Note that fancier deployments use append-only file systems under the hood, but to the end user this still appears to be an atomic delete-and-replace. Regardless of their semantics, updates are inherently unsafe as they present a splendid opportunity for clients of the data-being-updated to obtain incorrect results. To see this all you have to do is think of the poor job that requires one full day’s worth of an hourly data set that is in the process of being updated (deleted and replaced) one hour at a time.
Going forward when you read “update” think “delete and replace”.
The need for updates also leads us to the notion of idempotence and at-least-once scheduling semantics. We all know that large distributed systems fail frequently. And since failures can happen at any time, our data production graph needs to be able to pick up where it left off during a crash–which is almost always at some unsafe point, like after data has been produced, but before this fact has been recorded by the scheduler. The simple solution to this problem is to ensure that all jobs are idempotent so that the scheduler can pick up from some known safe point without worry. The bonus is that this makes it easy(er) to purposefully backfill for either of the previously mentioned reasons.
Congratulations, with all of the above knowledge you now know how to build a replayable data production DAG that can be safely fixed or modified in the past and you are now a first class data engineer!
Data Engineering is Not Just Scheduling Semantics
While Maxime (and yours truly, I hope!) make a compelling case for how to schedule your ETL workflows, I did fib a bit when I suggested that we had covered the basics of data engineering. Unfortunately we’ve really only discussed how data transformation units of work should be stitched together. We’ve said nothing of their content–and funny thing, lots of people tend to care mostly about that!
Content is of course why we’re doing all of this. Folks want to be able to extract knowledge from raw data, either for use by humans or other systems (but still, ultimately humans). And in decades past, this was considered the hard part since we had pretty comprehensive data platforms called databases at the ready. Many of you have probably heard of Stonebraker and Dewitt’s famous piece, “MapReduce: A major step backwards,” and while a bit disingenuous when taken at face value, it is an entirely accurate remark from the sense that, amongst other critiques, MapReduce is just a small part of a data platform and that it came with very little of the rest. For a more optimistic perspective, I highly recommend Julien Le Dem’s talk, “From flat files to deconstructed database.”
So back in the olden days when we didn’t need to worry about the data platform, people worried about the data and the logic needed to get it from representation A to A’ and this was an entire discipline by itself. Think Ralph Kimball and data warehousing and then think of those old-school BI folks who were trying to translate a back office user’s data needs into correct normal forms via tools like SQL, ETL platforms (SSIS, Kettle, et al) or just simple scripts.
I know, I know, you’re reading this and thinking that the fun stuff is all about scale and functional correctness and can’t we just gloss over user needs and well understood things like SQL? Emphatically, no!
Believe me–and any
BI data engineer–that translating an analytic or reporting need into the appropriate functional units of work requires not only the understanding of the business needs (and preferrably the business itself), but also the constraints of the platform. Squaring these two things is essential to creating a reliable data platform that has content, which I hope we can agree is the only sort worth anything!
For the record, at Criteo we make use of Hive, Scalding and Spark for our data transformation work units and we have a small army of engineers worried precisely about how to implement in the most efficient fashion the ever changing data needs of a fast moving Internet company.
Many folks might actually consider that last statement to be the beginning and the end of the definition of a data engineer. I consider it the critical point of entry into the discipline, without which we don’t have to worry about messy things like scheduling semantics across our distributed data platform.
And What About That Platform?
To recap, we’ve discussed ETL scheduling semantics and the importance of the discipline of creating content, but we haven’t said anything about the underlying systems. It’s actually useful to consider this part near the end since the architecture of any system should follow the needs of the business (more on that later).
Nevertheless, in this day and age of “big” data, the systems are at the forefront of everyone’s mind and so any self respecting data engineer should have solid understanding of them. This is when things like MapReduce (and its friends like Spark and Tez), Hadoop, columnar storage formats, RDBMS architecture and distributed systems in general are all important to understand. Also, know (of) those numbers that every engineer should know–that helps filter out the snake oil.
I’d like to point out that the fact that solid data engineers need to know these concepts to make even the slightest contribution at scale speaks directly to Stonebraker and Dewitt’s critique of MapReduce. Everyone is dissastified with many if not most of what is widely available and constantly looking for that magic component that will finally solve all their problems.
This might be the part where you as the reader, looking for insight into your platform’s issues, are dissappointed to find out that this isn’t a detailed overview of Criteo’s architecture for data processing. Our architecture is precisely that, ours. It has come to be via the slow and imperfect optimization of the function parameterized by our start date, our (changing) scale, our (changing) competencies, and our (changing) budget. While you might be able to learn from our implementation, it’s most certainly not something you should model yours on.
Data Visualization is Part of the
As a data engineer with all of this knowledge you might think, well, I can’t do everything and so I’m going to let someone else handle the data visualization stuff (note that Maxime Beauchemin created not only Airflow, but also Superset!).
The rub here is that data visualization is just another view of the data. The fact that
select * from foo returns a tabular view in your SQL client is because it’s the most obvious way to render data. Another way is of course to make a bunch of plots. Deciding which plots is difficult, but there’s research there, too.
Still, why should you as a data engineer worry about visualization? Well, primarily because visualization tools tend to create queries that require huge amounts of memory and I/O from your data storage systems and when done poorly mean slow queries, which in turn means both unhappy data systems administrators and unhappy data consumers.
Therefore, to get the most value out of a contentful data platform, you as a responsible data engineer should be thinking about how visualizations are being produced–at least for some critical subset of the platform.
What is SRE?
Site Reliability Engineering itself seems to be a discipline with many interpretations. Google themselves have written a bookand blog posts that don’t seem to intersect perfectly. My take is that an SRE is an expert coder automating the deployment and operations of her system, evolving it based on client needs via discussions around SLIs, SLOs and SLAs.
A Quick Recap of Data Engineering as I See It
Recall that this is just one individual’s view on what data engineering consists of and I suppose as with anything it reflects my tendencies as an engineer generally. I am without a doubt a jack-of-all-trades-master-of-none and so my bias is for something along the lines of a Full Stack Data Engineer. That job description includes:
- Business understanding to accompany clients in the use of your system
- Expertise across a number of data transformation frameworks
- Knowledge of how to make data production resilient in the face of failure or changes
- Understanding of the systems underlying your platform (storage, processing, visualization)
What is Data Reliability Engineering?
A serious data engineer is an expert coder, automating the deployment and operations of her data system(s), evolving it based on client needs via discussions around SLIs, SLOs, and SLAs.
Because the data storage, processing and visualization world is multidisciplinary by nature, she is also a multidisciplinary expert across more than one of those domains and is able to accompany her clients in the best use of her platform.
At Criteo Data Reliability Engineers work on everything from making systems like Hive, Presto, Vertica and Tableau function well at ever increasing scale, to building tools like Cuttle to make workflows reliable and vizsql to get intimate with our favorite data transformation lanuage, SQL or SLAB which gives us a better understanding of the state of our systems. We do all of those things, which is what made it obvious that we should be part of the SRE organization (which we are), all while we also continue to contribute substantially to the volume of critical data production, feeding a fairly large Vertica deployment and Criteo’s predictive engine, amongst other things. We’ve also built outright data products that help the business operate on a day to day basis and send our engineers out to other teams as temporary voyagers when they can use a helping hand.
One thing I’ve purposefully avoided is talking too much about scale (which we have what with 100PBs of data, 100s of thousands of jobs executed daily, ~1000 data producers, ~10000 data consumers), because I find that scale causes us to worry about the implementation and not the principles, which if they’re any good should be applicable at any scale.
In a final act of hubris, I will suggest to you that all of the above applies to a shop of any scale–from the single DB deployments all the way up to Google.
Post written by :
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.See DevOps Engineer roles