ML model deployment

By: CriteoLabs / 18 May 2018

Web-scale ML : learning is not the (only) point 


Machine Learning (ML) is all in the news these days to the extent that even government agencies are taking note. Behind the fanfare, ML has been playing a key role in many major businesses for years (e.g. search engines, online advertising, recommendation systems, machine translation, etc. ). At Criteo, we rely on ML to optimize advertisement campaigns for the value they bring to our partners.

The widespread adoption and reach of ML is tightly linked to the recent availability of comprehensive, efficient and scalable ML frameworks. While these frameworks allow the training of ML models from terabytes of data, they do not yet offer the same level of maturity as far as deployment of the learned models in a production environment is concerned.

In this article, we outline the need for deployment of large scale ML-models and the requirements that it poses on the ML framework. We will also introduce some current tools one could use to address those production needs.

Great platforms for learning web-scale ML models!

Today, there are several products that support ML training, ranging from libraries from research labs focusing on particular algorithms (like LibFM), to distributed frameworks that provide ML features as an application layer (Spark ML). These products target different profiles of users such as data engineers, data scientists and ML researchers (you might want to check the nuances). These products also target heterogeneous execution platforms: like the JVM-centric Hadoop projects focused on very Big Data (Mahout), the wonderfully interactive single machine Python (SciKit-learn), generic numerical computations graph implemented in efficient C++ (TensorFlow) or deep neural networks running on GPUs (TorchCaffeDeepLearning4J).

While some platforms are quite specialized, a few have developed a global representation of a trained pipeline, thus enabling one to reuse not just the learned model but the entire training pipeline including feature engineering. As an example, the model of Pipelines proposed by scikit-learn has become a reference in this field and is being reused by Spark-ML. Such representations might appear as the foundation for deploying trained pipelines to production but the solution to deployment is more nuanced.

What do we (want to) do with the ML-model we just learned?

Learning a model per se can be the main point in a few cases:

  • In the Data Analytics use case, one aims at producing actionable – thus human understandable – insight from data in an ad-hoc basis; there is no “production environment” in which the models are used, instead human actionable reports are generated, leading to a changes in parts of the business.
  • In the Research use-case, the aim is to devise better training algorithms, and the learned models are mostly used to validate performance on representative data, which is typically done in the same environment as training, typically a big-data cluster or cloud.

But in the ML Engineering use case, the learned models are deployed and used production environments to make predictions based on which the business operates. For example, at Criteo we predict in real-time the potential interest of the user in a hundreds of products for a single ad display and we also off-line compute, on a daily basis, the most relevant products in catalogs of sizes in the millions.

The two kinds of productions environment are very different: the off-line one is a JVM-centric Hadoop cluster while the on-line one is an online server pool. For user facing applications; latency can be in tens of minutes or in tens of microseconds and the volume of data in kilobytes or terabytes! Thus, we need to be able to deploy and use our learned models in widely varying environments. This requirement applies not only to the type of ML model (GBDTs versus Deep Nets) but also, and more importantly, to the whole feature engineering pipeline.

Porting ML models to production

The ad hoc solution to the above requirements is straightforward, but costly. One can have different implementations of the models for the different environments and convert between them or you can embed or emulate the production environment within the learning one.  At Criteo, we actually do both ( ), where we use Spark MLlib for training and run our in-house .Net feature engineering pipeline as a Spark pipe transform (production embedding).  In some cases, we also export the MLlib model to a .Net implementation deployed on production servers. This is not only a source of significant computational overhead but also of increased complexity that we would like to reduce.

The notion of pipeline is instrumental: this is a computation graph which can represent a complex chains of features extractions and model predictions into a single entity that behaves like a “simple” function. More importantly, a pipeline can be serialized (for SciKit-learnfor Spark MLfor TensorFlow) and thus deployed into production.  Well almost – indeed, for reloading a Spark ML pipeline, one needs a Spark context, which does not make any sense on web server, especially a .Net one! What is not addressed by these frameworks with respect to web-scale production deployment is a cross-environment external representation of trained ML pipelines. We are aware of three projects that aim to fill this gap PMMLONNX and Mleap.

PMML is a XML centric proposal which seems to target mainly the use case of the learned model being used in providing insights to other parts of the business. There is a prominent PMML evaluator for Java but we could not find any for a .Net environment but there is a RESTful service for model evaluation. There are bridges for Spark and for SciKit-learn, all of them being pushed by a single contributor!

ONNX is a joint project of Microsoft, Facebook and Amazon. Open Neural Network eXchange format as name suggests is tailored to represent neural networks but not limited to it. There is no native runtime inference of  the exported models but there are Tensorflow-ONNX exporter and importer. There is a promising NNVM project to compile models to be able to run with TVM runtime.

Mleap relies on a lighter approach explicitly aiming at addressing the issue of model deployment through the combust ML solution. It provides a protobuf and JSON based external representation as well as integration with TensorFlow, Spark ML and soon SciKit-learn. A native implementation of the Mleap runtime that is easy to wrap in most languages is ongoing and is expected to provide execution latencies in a .Net compatible with our online requirements.

In 2017, Apple released a MLCore framework which would include model representation and deployment format. It thus become easy to export a model trained using some standard library (scikit-learn, XGBoost, LibSVM, Keras, …) to phones. The same has been done by Google with Tensorflow Mobile. At the same time, Databricks mentioned a similar initiative. Google also released XLA which is an optimized intermediate representation (with a compiler) which should enable Tensorflow models to be efficiently inferenced in different environments. Likewise, Intel released their own intermediate representation.


While the requirements for effectively training web-scale ML models are now quite well understood and addressed by major frameworks – especially through the notion of Pipeline – these frameworks do not currently address the use-case of deploying the trained models in very different production environment. There is, however, a strong need for being able to port models efficiently from training to deployment environments.

As we are moving from an in-house, .Net centric, ML system to a Spark based training while still leveraging our high performance, .Net based, on-line system, we make the case for a cross-environment external representation of trained ML pipelines which enable to bridge efficiently ML frameworks such as Spark ML, SciKit-learn or TensorFlow with production environments.

We identified three main projects addressing this requirement: PMMLONNX and Mleap.

While PMML seems to focus on Java-centric, enterprise data analytics use cases, MLeap explicitly addresses the use case of ML model deployment into production with very low latencies. Moreover MLeap supports serializing, deploying and scoring with full pipelines, including feature engineering, which is definitely part of the model deployment. ONNX is a much younger but promising project . For now its stakeholders are mostly focused on Deep Learning (and more specifically on applying it to images) but it has support for RNNs and other general ML models as well.

We have already validated that interfacing a rust implementation of MLeap engine with our .Net environment is doable in a 2 days (Hackathon, anyone?)!  We are working on a more in depth evaluation of performance, feature coverage and reliability. Stay tuned for more details.

Post written by :

Anthony Truchet
Staff Dev Lead, Palo Alto, CA

Oleksandr Pryimak
Senior Software Engineer, Palo Alto, CA


  • CriteoLabs

    Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.