This dataset contains feature values and click feedback for millions of display
ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction.
It is similar, but larger, to the dataset released for the Display Advertising
Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge
This dataset contains 24 files, each one corresponding to one day of data.
The training dataset consists of a portion of Criteo’s traffic over a period
of 24 days. Each row corresponds to a display ad served by Criteo and the first
column is indicates whether this ad has been clicked or not.
The positive (clicked) and negatives (non-clicked) examples have both been
subsampled (but at different rates) in order to reduce the dataset size.
There are 13 features taking integer values (mostly count features) and 26
categorical features. The values of the categorical features have been hashed
onto 32 bits for anonymization purposes.
The semantic of these features is undisclosed. Some features may have missing values.
The rows are chronologically ordered.
The columns are tab separated with the following schema:
<label> <integer feature 1> … <integer feature 13> <categorical feature 1> … <categorical feature 26>
When a value is missing, the field is just empty.
Difference with the Kaggle challenge dataset:
– The dataset is not over the same time period;
– The subsampling ratios are different;
– The ordering of the features is not the same and the computation of some of them has changed;
– The hash function for categorical features is different.
The dataset is hosted on WeTransfer.
The whole dataset (342 GB) can be donwloaded by following this link: https://we.tl/t-TbdTn2os31 . (Tip: click on ‘Preview’ if you are interested only by some specific files)
The integrity of each file can be checked by computing its MD5 hash. The expected hashes are:
- day_0.gz : 6cef23542552c3195e9e6e2bdbc4c235
- day_1.gz : 94b73908ee8f912c175420376c8952db
- day_2.gz : c3c0272c26cfaa03d932b2856a751ff5
- day_3.gz : b727ecfaaf482507bb998248833aa4c2
- day_4.gz : b99eaa6e324e49d9df4e6f840e76b6d9
- day_5.gz : 1294d0a56a90294aebf80133078d9879
- day_6.gz : 68586521483313e13aefb29e7b767fdb
- day_7.gz : a2c1c4bfec20fc88b0b504175a16b644
- day_8.gz : faabf247fd56140a76effa6a3da63253
- day_9.gz : ee3347a28c1dd2fb2c92094e643c132b
- day_10.gz : d043c2ec0346eb4c71aaae935416688e
- day_11.gz : 8d4ba32f0c4f654a3860b6f2ae1a8ea7
- day_12.gz : 908480917ed39be2a2ad2e1c339c40b4
- day_13.gz : 567d6bfa672dd10a0cf76feaec0cf92b
- day_14.gz : ed377357aecaccc5f93c754c4819fd8d
- day_15.gz : 8e91f2a8d3d95202dfc3b22b88064c12
- day_16.gz : 387269870bf8ec7d285cf0e8ce82e92e
- day_17.gz : 48d3538fcf04807e0be4d72072dbda0b
- day_18.gz : f26e23b6ef242f40b0e3fd92c986170c
- day_19.gz : 3f6f36657b0ff1258428356451eea6c8
- day_20.gz : db7ff2b830817d3b10960f02bfb68547
- day_21.gz : f1a4ba7f7a555cb4a7e724a082479f4a
- day_22.gz : 848ae20c4eab730ae487acc8ddaf52ba
- day_23.gz : a2748bdbc67dd544b3ac470c4f1a52df
NB: if you downloaded the whole archive between 22/4/2023 and 10/5/2023, trying to unzip the file day_18.gz will report a corruption error. Using the provided link above, one can download each file individually and recover the correct version of day_18.gz. Sorry for this issue.
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.See DevOps Engineer roles