There has been a lot of effort by technology companies in the last few years to harness the power of clusters to process their large training sets. This move was made particularly important with the success of deep learning on tasks such as image recognition, speech recognition or text understanding since models used for these tasks tend to be highly complex and can be incredibly slow to train on non-distributed architectures.
In the context of web advertising, it is crucial to be able to make predictions extremely quickly as little time is given to send a bid to the ad exchange. On average, Criteo is able to predict the click probability of a user in less than 50 microseconds, as opposed to the 50 milliseconds required by deep models, and does so up to 500 000 times per second. This is the main reason why generalized linear models like the logistic regression, which are simple, are still widely used in our industry. As such models are faster to train, the move to distributed learning was therefore not as much a priority as it might have been for other companies.
Our datasets, however, keep getting bigger. We now serve over 2 billion ads per day, each of which providing valuable information which we can use to train our prediction models. In addition to this vertical scaling, we also keep adding features to our model to better represent the user preferences. Finally, as we become more efficient in the way we deal with ad exchanges, the time allowed to make a prediction increases and gradually opens the way to using more complex models which in turn will be longer to train. All these reasons convinced us to move to a distributed learning framework. Besides the technical challenges, described here, was the question of the distributed optimization method to use.
There has been a recent surge in the academic literature of papers proposing new distributed learning algorithms. They mostly focus on the distribution of stochastic methods, that is methods that update the parameters of the prediction model after every example or set of examples, rather than batch methods, which only update the parameters of the model after having seen all the examples in the dataset. The latter category is less interesting from a scientific point of view as it is usually straightforward to distribute a batch algorithm since data need only be transmitted between nodes once per pass over the entire dataset.
An ongoing topic in the optimization community is which of stochastic or batch optimization methods are the best. The answer, as usual, depends on the use case. There are several advantages to using batch methods:
– Their convergence speed is faster (linear or even super-linear for second-order methods as opposed to sub-linear for stochastic ones), which is especially noticeable close to the optimum.
– They are easier to tune. Most of them rely on line searches to find the optimal step size, leaving few, if any, parameters to tune for the practitioner.
On the other hand, there are several reasons why one might be interested in stochastic methods:
– Their convergence speed is faster than those of batch methods when far away from the optimum. This is even more true when the dataset is large as they have no dependency on the number of datapoints.
– They are amenable to learning on a infinite stream of data.
– For non-convex problems, such as those encountered when using neural networks, one cannot hope to reach better than a local minimum. As the difference in training objective between different local minima is higher than the difference between stochastic and batch methods, there is no need to resort to the latter, which are usually more expensive.
– Finally, even for convex problems, though the convergence rate of stochastic methods on the training objective is slower than that of batch methods, the fastest convergence rate one can obtain on the test objective cannot be faster than sub-linear (see for instance this paper by Bottou and Bousquet). This is important as only the test objective relates to real-world performance and is thus the metric we really care about.
For complex models which are only trained once, for instance when one wishes to build an optimal image classifier, stochastic methods are the method of choice. In these cases, the practitioner can spend time tuning the parameters of its optimizer and does not wish to be limited in the size of the model because of constraints in the optimizer. In our setting, however, we found batch learning methods to be better suited:
– The automatic tuning of the parameters is an appealing feature when tens of models are trained several times per day.
– As our models slowly evolve over time, we do not train them from scratch but rather initialize them to the solution of the previous ones. Hence, we are consistently close to the optimum, a regime where stochastic methods are knowingly bad and second-order batch methods very efficient.
– Due to the fast prediction time, the bottleneck lies in the streaming of examples rather than the computation of the update. As a consequence, contrary to the findings of Bottou and Bousquet, we found that, within a given CPU budget, batch methods were performing better on the test set than stochastic ones.
For these reasons, we rely on an L-BFGS solver, initialized with stochastic gradient descent when starting far away from the solution. As an added gain, L-BFGS can be trivially parallelized: we distribute the computation of the gradients before computing the parameter update on a single node. This approach can scale to a very large number of nodes and the cost of the parameter update is independent on the size of the training set.
In summary, simplicity plays a huge role in the choice of an optimization algorithm for production models and we are happy to trade a little bit of convergence speed for safety. After all, the cost of a few extra nodes on a cluster will always be much lower than that of a failure of our prediction models due to a poorly converging optimization run. That is not to say that we stop here. There have been a few recent advances in distributed methods which are promising, such as this paper, and might lead to even faster optimization without loss in ease of implementation and use. Also, as we move to more complex models, we might find that stochastic method will take the crown.
N. Le Roux
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.See DevOps Engineer roles