Last week, Mickaël Lacour and Justin Coffey went to Twitter’s OSS #conf to talk about our work migrating over 1PB of data from RCFile to Parquet.
We import over 30B records per day into our Hadoop cluster and have a large user base running thousands of queries against it as well as many hundreds of aggregations being executed throughout the day. The benefits of columnar storage engines for analytic workloads are numerous ( http://en.wikipedia.org/wiki/Column-oriented_DBMS and http://research.google.com/pubs/pub36632.html are good places to start) and given our heavy usage of Hive, we quickly opted for the RCFile format.
RCFile has been great to us, but we wanted to move away from a Hive-only solution and towards a more open format that would be easy to use across any hadoopian execution engine (think scalding, pig, spark, impala, etc.). Parquet ( http://parquet.io/) looked like the perfect fit, but was lacking Hive support. We went ahead and contributed that layer and then got to work putting it into production.
It is currently live on a few of our largest datasets and we are working on moving over the rest and expect to complete the job in the next month or so.
We are super excited about this as we look forward to taking advantage of all the work being done in the Parquet world (new encodings, indexes, and more) and the flexibility to start looking at alternatives to Hive for analytics work.
finally at #CONF we have @jqcoffey wrapping things up by talking about @ParquetFormat usage at @CriteoEng pic.twitter.com/PZDZzU1aRb
— Twitter Open Source (@TwitterOSS) April 3, 2014
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.See DevOps Engineer roles