Windows Chef Cooking at Criteo

By: Baptiste Courtois / 04 Jul 2016

When I joined Criteo 2 years ago, I was asked to work on the automation of our infrastructure, using Chef.

More precisely, I was asked to work on the Chef automation of our Windows servers… I had neither Chef nor Ruby experience, and I was no Windows expert, but I was to take care of the more than 5000 Windows servers running our real time processing platform.

Quite a challenge don’t you think?

Criteo & Chef

Our infrastructure

Criteo is one of the main players in the banner-advertising world, and some people know us for our pretty big Hadoop cluster. We continuously process tons of data to accurately predict website visitors’ interests in real time.

All our products and services run on 17000 bare-metal servers, distributed in seven strategic locations across the world:

  • Europe – 2 datacenters
  • North America – 2 datacenters + a new one soon
  • Asia & China – 3 datacenters

Around 80 SRE Engineers work every day on this platform to improve and provide reliable and efficient services for not only our R&D, but also our external clients.

From Criteo’s very beginning, significant effort has been invested in automating everything as much as possible:

  • Server provisioning
  • Service and application deployment
  • Software and firmware upgrades
  • Operational and maintenance tasks

Our tools

In order to work efficiently and deliver quickly using automation, we use a lot of common open source technologies, such as:

  • Git for versioning
  • Gerrit for code reviews
  • Jenkins for Continuous Integration
  • Rundeck as a general purpose operations tool

But the most important thing, in the context of this blog post, is that Chef is our main configuration system. Chef allows us to describe the final state we want for a specific machine, and continually ensure that everything is up-to-date and still in the desired state.

Thanks to all the tooling and ecosystems described above, we now have around 200 internal cookbooks and around 20 open source cookbooks published in the Chef supermarket, with more to come.

Two thirds of our 17000 servers are running CentOS, the rest are Windows Servers.

Fortunately, Microsoft (now) loves open source technologies 🙂

Chef & Windows

Since 2012, Microsoft has tried to increase its efforts related to Open Technologies and recently it has even open sourced the .NET framework.

Chef and Microsoft have also started a partnership to bring the power of Chef to Azure.

The partnership with Microsoft only started in May 2014, and many things should develop from this. But what is the current state of Chef automation for Windows?

The Windows cookbook

As explained by Adam Edwards in his talk, cooking on Windows is a little different compared to other platforms. Many technologies have their cookbooks like Apache, Nginx or even IIS, and you’ll also find a Windows cookbook… but there is no Linux cookbook!

Since configuration in Linux is mostly file based, having file primitives in Chef is enough to do almost anything and start building specific cookbooks. Conversely, Windows uses different types of configuration like the registry, service APIs and files; so in 2011, Chef created the Windows cookbook to provide its own primitives independent of Chef’s Core releases.

This was a great improvement at first, but it quickly started to be annoying – all cookbooks that also have to manage Windows had to depend on the `Windows` cookbooks… so Chef decided to migrate step by step the stable content of this cookbook to the Core project:

  • Chef 11 (2012) release with primitives for registry keys, PowerShell and batch scripts
  • Chef 12.0 (2014) release with primitives for Packages, DSc, Reboot and logging to the Event logs
  • Primitives for scheduled tasks or optional componentsaka Windows features – are still missing.

Windows related cookbooks

The `Windows` cookbook was and remains the most important cookbook when you have to deal with Windows. When I started my mission to improve Criteo’s Windows automation, cookbooks supporting Windows in the community were very rare.

Hard to find the unicorn

Luckily for me, who had to deal with web servers; the `IIS` cookbook was quite powerful and complete. For the rest it was a bit more complicated.

In most cases either:

  • the cookbook you need did not exist
  • or it was poorly written, so it was difficult to understand how it actually worked
  • or there was no active support for the cookbook

Sometimes we needed a cookbook for a specific feature and we found multiples existing cookbooks with the same adoption/support ratio… Which one were we supposed to take?

For the .NET Framework, there are cookbooks like dotnetframework and different cookbooks for each version – ms_dotnet2, ms_dotnet35, ms_dotnet4, ms_dotnet45. To add to the mess, we created our own ms_dotnet cookbook, which was intended to properly factorize all the logic to set up the Framework on any recent Windows machine.

Nowadays, with around 3000 cookbooks in the supermarket, you should be able to find what you need; but sometimes you don’t 🙂

Write your own cookbook (Cookin’ up a storm?)

When we started to automate the way we update our Windows servers, we tried to reuse community cookbooks. We found none, so we had to write our own Windows open source cookbook from scratch. Because we were fearless, we wrote two cookbooks; one for the client part and the other one for the server part:

  • wsus-clientConfigures your Windows machine to connect to a WSUS server or the official Windows Update Service
  • wsus-server – Configures your Windows server as a WSUS server

Because these are Windows cookbooks, we started by adding the `Windows` cookbook as a dependency! Then we used a lot of Windows primitives windows_feature, windows_package, registry_key, powershell_script. We also created our own LWRP – primitives that people can reuse in their cookbooks.

Obviously, when we tested them in production it worked! Confident, we continued to create open source cookbooks supporting Windows. We planned to set up Rundeck on all our servers to perform operational tasks using WinRM, and came up with the following cookbooks:

  • rundeck-node – Configures a node to be used with Rundeck server
  • winrm-config – Configures WinRM service and client

Again, we successfully went into production, and managed to perform the tasks we wanted. But quickly we faced weird problems with a few misconfigured servers, people started to create issues on GitHub, etc.

We had to improve our tests, to replicate the issues and to ensure that our cookbooks are working on common configurations/environments that were not only our own.

Testing framework & a few encountered issues

Test-Kitchen is one of the official ways to test your Chef cookbooks. It allows you to run your code on various cloud providers and virtualization technologies, and then performs some checks to ensure that everything went Ok.

At Criteo we have chosen to use Vagrant and VirtualBox as Virtualization layers, so we use the kitchen-vagrant plugin to spawn new VMs and run our recipes on top of them.

I can assure you, when I started it was really fun to test on Windows VMs :).

SSH vs WinRM

Although Windows has its own remote protocol – WinRM – we had to use SSH to communicate with our VMs because it was the only possible way. Soon enough we wrote a vagrant plugin and test-kitchen drivers to use the native system. Then Chef teams provided the official Windows support in Test-Kitchen 1.4.

The Scheduled Task workaround

Working with our own Test-kitchen plugin, we have been able to work around some WinRM limitations. You first have to know that on Windows, a remote session has less permission than a local one, and some features are simply not available! A remote session also has more constraints on the allowed number of processes it can run, or the memory it can use. You either have to tune everything or try to emulate a local session.

The easiest way is to run as a Scheduled Task. That’s the trick we used to communicate with the VM, but it has been deprecated with Test-Kitchen 1.4. So we needed a replacement.

Steven Murawski, a Windows Chef guy, first worked on a small Gem using the same trick. It has been recently deprecated in favor of the official scheduled task support built into the WinRM transport in Test-Kitchen 1.8.

Thanks to this workaround, we fixed the following issues we encountered:

  • “Access is denied” error when trying to install ms_dotnet or sql_server
  • “Unable to create IUpdateSession:: CreateUpdateDownloader” error on wsus-client

We are now also able to run Chef as SYSTEM via a Scheduled Task in Test-Kitchen!

Root equivalent: Administrator or SYSTEM

Another difference between Windows and Linux that introduced some issues during our Chef experiments is the fact that Windows does not have a `root` user. In fact, there are two users that people tend to qualify as equivalent:

  • The Local SYSTEM user
  • The Built-in Administrator user

Both are useful, but not for the same purpose. Unfortunately, most cookbooks use `Administrator` as the root equivalent but Chef runs as SYSTEM when installed as a Windows service. Because they are two different users, they can have different permissions, so one needs to be consistent!

By default the Local SYSTEM account has the same rights as the built-in Administrator user, plus some interesting privileges like running as a Service or opening the Security Registry!

However, Local SYSTEM is not perfect and has some caveats due to the behavior difference inside a Workgroup and inside a Domain. Here again, we can still find good reasons to use it over the Built-in Administrator account, which does not exist on a Domain Controller for instance, thus making Chef fail when you promote your node as a Domain Controller.

That’s why at Criteo we choose the Local SYSTEM account as the equivalent to the root user, and we try to promote this choice in the community.

Conclusion of a Journey

Throughout our experience with Chef we discovered that it’s generally easier to try to use “the” native way of doing something. For instance, we use the SYSTEM user because it seems to be the best root equivalent. We also prefer native technologies over ported ones, like WinRM over SSH or W32Time to Meinberg ntp.

This choice allowed us to spend less time on bugs due to cross-platform or portability issues, and to focus on our main task. Following this simple guideline we managed to reinstall all our 5000 Windows servers in an automated manner within 3 months.

We have continued our efforts to automate the deployment of Windows updates in a safe and automated way, dealing with the orchestration of multiple reboots per server. This topic is so broad and interesting, we could do a whole lecture about it. It laid the basis of our orchestration system, which allowed us to reinstall -again- our Windows platform and migrate it in less than 2 months from Windows server 2008R2 to Windows server 2012R2!

Maintaining and improving the automation of this platform was so challenging and exciting, I was afraid to have discovered all the secrets Chef has to offer but I have barely scratched the surface!