PoH – Part 2 – Running C# on a Linux Hadoop cluster

By: CriteoLabs / 01 Sep 2014

Assume you have a code base in C# and you want to run it in a distributed way on Hadoop.  What do you do: rewrite your historic code base in Java? Or try to forget the lesson you learnt when you were 3 years old, that a square piece cannot fit into a triangle hole?

Taking advantage of the lessons learnt the hard way by others, we chose to give the second approach a try. Here is how we managed to give a triangle shape to our square piece. And to run C# code on Hadoop in production.

Hadoop Streaming

Hadoop streaming, a tool from the Hadoop project, “allows you to create and run Map/Reduce jobs with any executable”.

On the paper it is rather easy to use:

hadoop jar hadoop-streaming.jar \
-input file1.txt \
-input file2.txt \
-output output \
-mapper myMapper.exe \
-reducer myReducer.exe

where myMapper.exe and myReducer.exe can be any arbitrary executable. They should both read their inputs on the standard input stream and write on the standard output one. The only constraint for the mapper is to write its output, using the format “key\tvalue”, and to let the jar do its part using UNIX pipes.

Nothing seems very tricky here: our logging framework simply needed to be configured to redirect our logs to the error stream to ensure the standard stream is reserved for hadoop-streaming.

On second thoughts, it can actually be a bit tricky when the legacy codebase use third party dependencies which may also write output on the standard stream. To avoid this, we had our mappers prefix their lines by a given arbitrary string, and the reducers filter the lines they read to only keep the prefixed ones.

Running jobs without Java will not be an issue. Next step: running C# on Linux!

Meeting Mono

C# is a language developed by Microsoft designed to run primarily within Microsoft’s .Net software environment. After C# became an ecma standard, some engineers developed a cross-platform free and open source port of .Net: Mono. It seems widely used in the industry so what about giving it a try?

Not so fast! Our code base has been developed over the years with only .Net in mind. Let’s ensure it will run fine on this different environment!

Mono has some documentation regarding its compatibility with .Net, both regarding the runtime and the framework. However, when about to write new code, it might be interesting to ensure we will only use compatible features. With an existing code base, however, it would be daunting to try to do manual checks.

Fortunately, the Mono project is not just a compiler and a runtime. It also brings a few tools, and in particular MoMA, the Mono Migration Analyzer. Let’s feed it with our assemblies!

Hmmm, it does not look like an already won game:

Results of the MoMA run

Results of the MoMA run

Let us hope these are only methods used by our internal SDK that are not manipulated by our particular application.

With all this pessimistic information in mind, what about it giving it a try in practice?

Expected complications

After this preliminary study, we cannot say we were much surprised when our first run failed because the following code threw a NotImplementedException:

DataContext db = BuildDataContext();
var loadOptions = new DataLoadOptions();
loadOptions.LoadWith<ABTest>(t => t.Populations);
db.LoadOptions = loadOptions;

Looking under the hood, it might indeed have been one of the issue MoMA tried to warn us about. Here, rewriting our code without using DataLoadOptions was enough.

A bit trickier:

reader = context.StoredProcedure(@"FeatureLogApplications_Metrics_GetMetric")
.Set<int?>("@MetricID", null)
.Set<string>("@MetricName", variableName)

In this case, the method is implemented but giving null yields strange results. Rewriting our stored procedure did the trick.

For other issues we found, we could fix Mono and send our patches upstream.

All is well that ends well

Despite these glitches, it was actually fairly straightforward: it did not completely work out of the box but, as far as Mono is concerned, it only took a couple of days to make the needed changes.

We learnt a few things from this project:

  • Using Mono on our historic code base works fairly well, even though this code had never been written with Mono in mind. We stumbled upon a few differences with .Net on the SQL API, but nothing we could not circumvent.
  • The Mono project provides a tool and some documentation to check the compatibility of one’s code base. It can give some insights, but it seems easier to send the code on the front line directly and to count the bodies afterwards.
  • Using Hadoop Streaming and Mono we are able to successfully run Hadoop jobs in C#. Although integration is certainly less convenient than using Java, we are nevertheless able to use it in production

Author: Guillaume Turri

  • CriteoLabs

    Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.