What is Mesos?
Mesos is a resource scheduler pooling available resources (CPUs, memory, storage, GPUs) in a datacenter and sending resource offers to task schedulers. This is why the Mesos community likes to call Mesos a “two-level scheduler for your datacenter”.
Mesos Con 2017
Mesos Con was hosted in Prague from the 25th to the 27th of October. We had the opportunity to meet a lot of people from a wide range of companies going from small companies presenting their migration from legacy non-scalable systems to DC/OS to the biggest, like Microsoft, which is providing DC/OS in Azure for their customers.
Mesosphere, the company behind the Marathon framework, and one of the biggest Mesos contributors, was the main organizer of the event. Its main product is a Mesos distribution called DC/OS (Data Center/Operating System), built on their vision of Mesos as a datacenter level operating system.
At Criteo we’ve been running several Apache Mesos clusters for over two years now. We started with a small POC in 2015, at a time when Mesos was possibly a more obvious choice than it is today. That said, Mesos has a number of unique features that make its value proposition still relevant compared with today’s more popular schedulers:
- Simple architecture: with only Zookeeper, Mesos masters and Mesos agents, you have a full cluster up and running
- Extensible design, you can tune it to your needs using hooks, isolators or custom executors
- Focus on very few features (resource management and isolation), while delegating the high level scheduling work to schedulers of applications called frameworks (generic application oriented like Twitter’s Aurora or Mesosphere’s Marathon, or Big Data frameworks like Spark or Flink)
- Support of industry standards with CNI, CSI & OCI images
We now have about 750 agents in 7 datacenters, more than 150 applications running in production serving 250k QPS, and have the intention of migrating all of our online / offline services to Marathon. We also have the ambition to run a significant portion of our ML workloads on those clusters.
Interesting talks & tendencies
Security & Secret Management
Since its inception, security has not been a significant focus in the Mesos world. Most recommendations to secure Mesos were essentially: block access to the Mesos agent’s ports so people won’t be able to perform operations. The number of talks focused on security this year indicates that this is about to change.
During this talk, Vinod presented the new mechanism to retrieve secrets at launch time in Mesos agents.
At Criteo, we currently use asymmetric encryption to send secrets to our applications. For instance, in Marathon or Aurora, application descriptors include secrets encoded with the public key of the service account running the application. When the application launches, the Mesos executor running with the right service account can access the private key and decrypt the secret. This solution works well, but it has a few drawbacks:
- it is not possible to change a secret without stopping and re-deploying an application
- it is not possible to share secrets across applications
- knowing which secrets are in use can be tricky
The proposed solution includes a new module type called Secret Resolver that resolves a secret according to a key at launch time. As with any other module, Secret Resolvers can be implemented by vendors for custom resolution or resolution based on third party components like Hashicorp’s Vault.
While the idea is good, the talk did not provide answers for a few of our concerns, essentially how we could implement authorization checks on secret resolution (to prevent such things as Alice’s app using Bob’s app secret by just providing the proper secret reference).
The presentation was interesting, despite our concerns regarding the aforementioned issues. We were perhaps a bit disturbed by the fact Mesos only looks to be intended to be used by companies having supposedly trusted networks and components used by their clusters.
The presenters started by showing the various ways to secure traffic between all Mesos Endpoints using TLS.
Following this, the talk focused on the issue of authentication of executors or frameworks with the agent HTTP API. Today, authorization needs to be implemented by users or risk leaving the whole system open and allowing a rogue application try to connect to the agent and impersonate the executor. To fix this, an environment variables mechanism was added to the executor so that the agent can create a JWT (JSON Web Token) token to prove it is the right executor launched by the agent and to avoid an attacker being able to hijack the connection easily.
The talk was interesting, but did not address secret revocation/updates, nor rotating secrets regarding the keys used to encrypt JWT tokens. Since many systems are now using such rotating secrets, it was a bit disappointing, but definitely a step in the right direction.
Our opinion on those talks
Security in the Mesos ecosystem has definitely started to improve and the community is taking steps in the right direction. However, we were a bit disappointed by the overconfidence the community has in users. As long as a user is able to launch an app in Mesos, it can access a lot of data and their identity is lost definitively. This is probably due to various shortcuts taken, for example Mesosphere’s DC/OS admin router handling all security then discarding the context, whereas we think this kind of security informations should traverse all the system to benefit all components.
Containerization & Standards
As cluster orchestration solutions are becoming mainstream and many container technologies mature in the market, the industry figured out that it would be helpful to standardize the way the components of a cluster interact together to facilitate collaboration and interoperability. Not only does it make it easier to build a cluster from scratch, it also opens the door to a whole range of products where vendors can promote their technologies enabling choice and simplifying adoption for their customers.
During this presentation, Jie Yu made an overview on the progress that has been made in Mesos regarding the standardization of networking, storage and container images brought by the Open Container Initiative (a.k.a. OCI). Our team was mainly interested in container image format standardization since we are about to move from using fetched build artifacts to Docker images for spawning our containers. The goal of this change is simply to better manage application dependencies. We were glad to see the high level of commitment of the Mesos community to push toward standardization, it really helped us understand what is ahead as we prepare our transition to using Docker images in combination with the Universal Container Runtime offered by Mesos.
Jie Yu also presented the need of having a standardized way for containers to talk to each other. The standard is known as the Container Network Interface or CNI. CNI tries to abstract the way orchestrators interact with networking solutions so that anyone can plug into whatever solutions are available on the market, like Weave or Calico. When it comes to network at Criteo, we currently use the host network with the net_cls isolator to control the bandwidth used by containers because it is sufficient for our needs at this point in time.
The topics were really interesting because we now have a better idea of what the Mesos community’s vision is and it shows that standardization is a hot topic for the community at the moment. Most of the subjects are relatively new though so we’ll keep a close eye on them until we are confident enough in their maturity to try to use such abstractions in our infrastructure.
This talk was specifically focused on the Container Storage Interface (CSI) which abstracts the use of volumes for orchestrators. The speakers showed a big list of existing vendor plugins for managing volumes but, before standardization, this was also a nightmare where every orchestrator had to integrate all of the plugins. Standardization actually solves all this by making Mesos use an abstract interface and letting vendors implement it, thereby making their plugins available to any cluster orchestrator like DC/OS. For example, two vendors promoted their implementation of plugins called Portworx and RexRay and presented CSI as a selling point of their product since it really helped their users integrate the solution into their clusters.
The vision on storage standardization demonstrates once again that the Mesos community is advancing in the direction of better composability where one can transparently replace any component by another. Composability is really positive for Criteo since we are building our own custom clusters based on Mesos, integrating only best-of-breed open-source technologies ourselves to make sure we get the best from the market.
Machine learning on MESOS
Machine Learning was one of the big focuses in this edition. Mesos and ML has a long history together. Mesos was conceived to run the complex workloads typical on Hadoop, and Spark was developed in the very same AMPLab at Berkeley as a POC framework for Mesos. However, the current situation of ML frameworks is kind of mixed: HDFS on top of Mesos is still in beta (even if considered production-ready by Mesosphere) and Spark’s Framework implementation did not age well, perhaps lacking too many features to make the Mesos-Spark combo really interesting for real users.
A lot of talks were focused on ML products within Mesos, possibly as an echo to Mesosphere’s SMACK stack marketing effort. Here is a selection of them.
Although originally closely related to Mesos, Spark is nowadays a very popular independent ML framework, usually used in tandem with Hadoop to run certain categories of ML jobs in a scheduled or ad-hoc fashion.
Stratio presented its fork of Spark that adds support for Kerberos (details here). The large-scale adoption of Kerberos and its support in the Hadoop stack makes it an obvious candidate when securing (and to some extent, identification for resource and quota management) ML jobs, especially in the case of individual users running ad-hoc jobs, thanks to the Kerberos delegation mechanism. This feature is a good initiative, but sadly it seems the Spark community didn’t take any interest in merging it into the product.
Mesosphere presented Flink running as a Mesos framework. Flink is a popular new toolbox for handling large amounts of streamed data in a distributed and resilient fashion. We got an overview of the architecture of Flink’s Mesos framework scheduler and its components. Two notable points were the use of Netflix’s Fenzo library to handle resource offer exchange with Mesos masters, and the ongoing rework of the architecture which will allow handling multiple jobs from one framework instance in Mesos (using the same Dispatcher pattern as Spark).
This is another Mesos framework developed by Mesosphere, leveraging the famous ML library by Google. It was introduced in 2 sessions: a presentation talk and a workshop session which allowed attendees to play with both the framework and GPU resources. The framework was implemented using the dcos-commons library. There is one drawback though, the scheduler is only able to handle one job definition at a time. Fortunately, there is some work underway to allow the scheduler to handle multiple jobs, a common pattern for ML frameworks.
The topic was a bit more technical compared to other presentations but we were very curious about what could be done with custom executors. Indeed, at Criteo we have some use cases like service discovery where we thought a custom executor could fit. The presentation helped us understand the internals of schedulers and executors and gave us the incentive to try to write one ourselves.
This talk given by Mesos contributor Vinod Kone described a new mechanism introduced as an experimental feature in Mesos 1.4, called Fault-Domains, which mimics the regions and availability zones of Cloud providers. These first class primitives will be available natively in Mesos and allow the handling of hybrid deployments (Mesos bare metal + Mesos/Cloud).
The talk was quite interesting, particularly regarding how migrations will be performed, how the Mesos masters will have a notion of their own Fault-Domain and how it will allow masters to launch tasks outside their own region. It also requires a few changes to schedulers (the applications responsible for scheduling apps in Mesos) in order to work with remote regions and avoid launching lots of tasks in another datacenter or in expensive regions.
Availability zones will be able to represent physical differences (i.e. a rack or a datacenter room) and enable interesting availability use cases, such as “spread this application on at least 3 different racks”.
Masters won’t be able to spread across multiple regions however.
The talk was very informative and interesting. Indeed, enforcing 2 levels of separation (only regions / zones) will already ensure quick adoption because it is easier to perform a good implementation at 2 levels, and much harder at “n” levels. Only time will tell if 2 levels are enough.
This edition introduced Town Halls events, a 1.5-hour session dedicated to exchange and open discussion with a project community. There were four of those: Mesos, DC/OS, Kubernetes and Marathon. Most of us chose to go to the Mesos one, and we had a very good time chatting and sharing user experiences with both Mesos contributors and users.
This year’s event was mainly focused on security, standardization and big data. Those topics are very important to us and it clearly indicates the direction Mesos ecosystem is taking. The range of topics gives us some insight into how much we will be able to leverage those topics to serve our business of e-commerce advertising in the future.
Post written by:
Julien Pepy – Snr DevOps Engineer, Dino Lukman- DevOps Engineer, Pierre Cheynier – Snr DevOps Engineer, Pierre Souchay – Snr Staff DevOps Engineer, Clement Michaud – DevOps Engineer.
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.See DevOps Engineer roles