At Criteo, we do a lot of machine learning to predict quantities critical to our daily business. In this article, we will show you how we use Hive, Mahout and Hadoop to quickly build and evaluate machine learning models during initial experimental phase and how these tools allow us to efficiently try new ideas at scale.
At the core of machine learning is data, and often a lot of it. We acquire the data which we are allowed to collect and match through various logs like previous advertisement displays, browsing history on our customers websites or clicks on product in our banners. These logs are stored on a Hadoop cluster and we use several technologies to facilitate map/reduce jobs writing. Among them, Hive appears to be very efficient at quickly writing a job that will crunch tons of data. Though the resulting job is not as efficient as a finely tuned map reduce job, Hive’s agility helps us to test new ideas and to iterate quickly.
Mahout is a framework for machine learning over Hadoop which includes implementation of many algorithms for classification, clustering and recommendation. We rely on the Naive Bayes algorithm for our classification needs. Up to date documentation is rather difficult to find and we advise the interested reader to invest in Mahout in action, Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman, ISBN 9781935182689.
The Mahout Naive Bayes input format is rather simple. The input is an HDFS directory containing text files. Each line of the text file is an example Mahout will learn from. The target is at the beginning of the line, followed by a tabulation and then a list of words corresponding to the features. For instance:
EN\tI am an example of English text FR\tJe suis un exemple de texte en français DE\tIch bin ein Textbeispiel auf deutsch ES\tSoy un ejemplo de texto en español
is a Mahout input file with four text snippets and four targets: the text language. Our data are in general not formatted like that. In practice, they come from various logs, are not well formatted or require some transformations. Tweaking the train set generation is a great way to improve results and often an iterative process: you have several ideas you want to test quickly without the pain of writing a full Hadoop job.
This is where Hive shines as a great tool to easily manipulate large data store. Generating a new training set only requires a few requests. Let us assume for instance that we have two tables sharing one specific key: one with several lines of “key event” (called event_table which corresponds to a server log for instance) and one with lines of “key target” (called target_table which corresponds to the ground truth). We want to train a model to predict the target given the events. The following queries will prepare a train and a test set:
First create the external table
CREATE EXTERNAL TABLE data_table(target STRING, features STRING) PARTITIONED BY ( type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'TRAIN_AND_TEST_SET_LOCATION';
The idea here is to use Hive partition to separate between the train and the test set and to add several words separated by a space in the features column. Hive takes care of the Mahout format and the scalability.
Then populate the external table with data:
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE data_table PARTITION(type) SELECT t2.target, concat_ws(" ", collect_set(t1.event)), IF(rand() < 0.1, 'test', 'train') FROM (SELECT key, target FROM target_table) t2 JOIN (SELECT key, event FROM event_table) t1 ON (t1.key = t2.key) GROUP BY t1.key, t2.target;
There are now two directories at the location of your table: one with a train set and one with a test set. They are separated using the 90/10 rule.
The following command trains a classifier:
mahout trainclassifier –i TRAIN_AND_TEST_SET_LOCATION/type=train/ -o MODEL_LOCATION
Which may now be tested using:
mahout testclassifier –d TRAIN_AND_TEST_SET_LOCATION/type=test/ -m MODEL_LOCATION
The end result of the test is a confusion matrix which looks like this:
Confusion Matrix ------------------------------------------------------- a b 10 2 | 12 a = Category1 2 11 | 13 b = Category
In this example, among the 12 test examples which were in Category1, 10 were correctly classified as Category1 and 2 where incorrectly put in Category2.
This example does not involve a lot of transformations to our data but Hive is able to do many more.The collect_set function groups all the events in one set and concat_ws builds the input string (it is similar to a join in other languages). These techniques allow us to build a model joining several tables, using filtering on event days, dropping some blacklisted events, etc.The confusion matrix is a great tool to quickly see if the prediction is improved. If your final objective requires the need of other metrics, you will have to implement them, but confusion matrices are a great way to quickly evaluate the quality of a classification.
Production models are not created using Hive but rather ad-hoc jobs and thus we will not discuss that here. Several Hive/ Mahout tools are also defined in elephant bird https://github.com/kevinweil/elephant-bird.
That’s it for today, if you have any Hive tricks for efficiently building a new dataset, do not hesitate to share them with us!
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.See DevOps Engineer roles