About a year ago I attended the Paris Hadoop Users Group (HUG) to listen to Marcel Kornacker discuss Impala, Cloudera’s version of Dremel and along with it a new HDFS file format, Parquet. The talk was great–Impala promised great improvements in query times on HDFS and Parquet was to be its native storage format but actually be execution engine agnostic, meaning that in theory it would be easy to query Parquet from any of the popular high level MapReduce frameworks (cascading, scalding, pig, hive, et al).
Parquet, for those who don’t know is a columnar storage engine for HDFS. It is similar in spirit to the likes of RCFile and ORC.
RCFile (still in major use here at Criteo) and ORC are actually great storage formats. To give you an idea of RCFile’s contribution at Criteo consider that it allowed us to get up to 20x improvements in query performance over our application produced logs for typical analytic work. The only trouble is we use lots of other frameworks to access and manipulate HDFS data and having to write and maintain an RCFile compatibility layer for each is onerous. Also, RCFile has some shortcomings and has been superseded by ORC in Hive, meaning that if we want to stay ahead of the game in Hive we need to start migrating lots of data to ORC and then start looking at how to make ORC available in the different exec engines.
Or we could think about using the execution engine independent Parquet format. Of course, at the time (jumping back to last Spring), Parquet didn’t have a Hive compatibility layer. After speaking a bit more with Marcel and considering our relatively massive use of Hive it seemed obvious to commit to contributing the Hive code–which we did!
For those interested in some of the gory details, you can find the initial patch to the Hive trunk here
You can also find Cloudera’s announcement here
Though you’ll see my name on the Jira ticket, most of the heavy lifting was done by Mickaël Lacour and Remy Pecqueur.
We’ve already start migrating data to Parquet and gave a talk at Twitter’s Open Source Conference back in April on the subject and you can actually find an earlier blog post on the subject here
Author: Justin Coffey
-
Senior Dev Lead at Criteo Paris Justin Coffey is a senior staff devlead at Criteo in charge of the Analytics Infrastructure team. He oversees (and even manages the occasional contribution to) the development of better tools to manage the petabytes of analytic data employed by hundreds of Criteo analysts and engineers across the world. With over 15 years of experience working in the Internet, Justin has worked with web technologies since their inception. Prior to working for Criteo, Justin worked in a number of Internet startups as an on-hands engineering manager helping drive explosive growth at the early stages.
See Dev Lead roles