Data Serialization Comparison

By: CriteoLabs / 17 May 2017

At Criteo, Performance is everything.

The serialization formats considered:

  • Protocol buffers
  • Thrift
  • Avro
  • Json
  • XML

We did the benchmarking using a specialized library: http://benchmarkdotnet.org/ , and C# .net 4.5.

The data model

Before digging into the implementation details, let us see the Data Model. Our data is similar to an Excel workbook. It has got many pages, and each page is a table. In our case each table has got some keys indexing the rows. One difference with Excel though is that each cell may contains more than just a single item. It could be a list of floating point values, a dictionary of floating point values, a dictionary of dictionaries, and so on.

We represent data in a table like structure. Each column may have a different data type and every cell in a column has the same data type as shown in table below. The cell’s value are implemented as subclasses of a base class called IData. We have one implementation of IData for each type of data structure we want to put in cells.

Key
Column1
Column2
Column3
Column4
...
String
double
Double[]
Dictionary<int, double>
Dictionary<int,double[]>
String
double
Double[]
Dictionary<int, double>
Dictionary<int,double[]>
String
double
Double[]
Dictionary<int, double>
Dictionary<int,double[]>
String
double
Double[]
Dictionary<int, double>
Dictionary<int,double[]>

Table 1 Example of the table like structure.

In order to have fixed sample of data to serialize, We wrote a data generator that randomly generates the different possible values for each type of columns.

The XML Story

The original implementation was in XML so it became the reference benchmark for the other formats. Implementation was easy using the standard .net runtime serialization, simply decorate the classes with the correct attributes and voila.

Figure 1 DataContract annotations for serialization

The interesting part is the “[DataContract]” and “[DataMember]” attributes which indicates to the serializer what members to serialize.

The JSON Story

Json is supposed to be faster and light-weight than XML. The serialization is handled by the Newtonsoft library, easily available in C#. There is just one small glitch here, in order to be able to correctly serialize and deserialize such dynamic data, we had to set the type name handling to automatic. This resulted in json text with a type field.

For example.

The Protocol Buffer story

This has a lot of hype, which kind of makes sense because binary formats are most of the time faster than text formats, also the data model for the messages could be generated in many languages from the protobuf IDL file. The drill here is to create the IDL, generate C# objects, write convertors and serialize!

But wait, which library should we use ? We found at least three different nugets, two of them claimed to implement the same version of Protobuf V3.

After much investigation, we realized that Google.Protobuf is provided by Google and had the best performance. Protobuf3 is compiled by an individual from the same source code but it is slower.

There is more than one way to solve the problem with protobuf and we decided to try three different implementations and compare the performance.

First implementation

This implementation is referenced as protobuf-1 in our benchmarks. The design had to solve the problem of storing a polymorphic list. This had to be done using inheritance, and this blog post explores different methods of implementing it. We compared them and chose to use the type identification field  approach as it had a better performance.

Let’s see the example.

Here, each cell of the table would contain one object of DataColumnMessage, which would have one field filled with values and the rest of them are null values. Protobuf does not store null values for optional fields, so the file size should not change a lot. But still this meant 4 null values and if the number of fields increase, that would mean even higher number of null values. Does that effect the performance ? Keep reading for the comparison of results.

Second Implementation

This implementation is referenced as protobuf-2 in our benchmarks. We knew that each column has the same data type, so we tried a column based design. Instead of creating one value for each cell, we decided to create one object per column. This object will store one field for the type of the objects stored, and a repeated value for each cell. Therefore drastically decreasing the number of null values and the number of field “type”.

Let’s look at how the IDL would look like,

We believed that this should improve the performance by quite a lot in comparison to the previous version.

Third Implementation

This implementation is referenced as protobuf-3 in our benchmarks. We wanted to leverage the new “map” keyword introduced in the protobuf version3 and benchmark its performance. This is the new specialized way of defining dictionaries so we were hoping for performance improvements. Our hypothesis was that we don’t need to allocate and copy data while converting to our business objects. Are we right ? We’ll see in the comparisons.

The dictionary object description changes from a list of key/value pair to a map.

Protobuf V2

Protobuf V3

This generates code that directly uses the C# dictionary implementation.

The Thrift Story

You know the drill, create the IDL first, then generate the message objects, write the needed conversions and serialize. Thrift has a much richer IDL, with a lot of things that do not exist in protobuf. In the test example, one big advantage we had was the availability of the “list” keyword. This meant that we can now specify r

The rest of the IDL is not that different from protobuf.

Let’s look at the example.

In our simple case, the Thrift IDL allows us to specify our map of list in a single line :

Protobuf IDL

Thrift IDL.

We found Thrift mandates a stricter format where you declare classes before using them. It also does not support nested classes. However it natively supports list and maps, which simplified the IDL file.

Thrift syntax is far more expressive and clear. But is it the same for performance ? To know more keep reading.

The Avro story

Apache Avro is a serialization format whose support in C# is officially provided by Microsoft. As with the other serialization systems, one can create a schema (in JSON) and generate C# classes from the schema.

The example of Avro JSON Schema (excerpt):

We found the JSON schema very verbose and redundant in comparison to the other serialization formats. It was a bit difficult to actually write and generate the classes. So we took a shortcut and generated the schema by using the DataContract annotations. All we want is performance and no difficult Schema language could stop us from evaluating all the options.

Note that there exists an IDL for Avro with a Java tool that generates the JSON schema from the IDL file, see: We didn’t evaluate it because we preferred to let the C# Avro library generate the schema from the annotated code.

But, there was another blocker. It was the inconsistent support for dictionary-like structures. For instance, we found that the C# version correctly serializes a dictionary whose value type is a list. But it refuses to serialize a dictionary of dictionaries. In this case, it doesn’t throw any exception. Instead it will just silently fail to serialize anything.

We also came across a bug opened at Microsoft’s github two years ago (still open), where a dictionary with more than 1024 keys throws an exception when serialized. (https://github.com/Azure/azure-sdk-for-net/issues/1487)

Given these constraints we had to serialize dictionaries as list of key and value pairs and create all intermediary classes with annotations. It contributed to make the scheme more complex. Is it now going to beat the other formats ? Let’s find out.

Results

We split our benchmarks into two configurations:

  • Small objects, where the lists and dictionaries contains few items (less than 10). This configuration is used usually by a web-service exchanging small payloads.
  • Big objects, where we accept lists with many hundreds items. This configuration is representative of our Map/Reduce jobs.

We measured the following metrics:

  • Serialization time
  • Deserialization time
  • Serialized file size

In our case the deserialization is more important than serialization because a single reducer deserializes data coming in from all mappers. This creates a bottle neck in the reducer.

Small Objects

File Sizes

Table 2 Small objects serialized file sizes in Bytes

  1. All binary formats have similar sizes except Thrift which is larger.
  2. 2nd implementation of protobuf is the smallest among other protobuf implementations due to the optimization achieved with column based design.
  3. 3rd implementation of protobuf is a bit bigger which implies that the map keyword in protobuf increases the file size by a little.
  4. Json is off course better than XML
  5. XML is the most verbose so the file size is comparatively the biggest.

Performance

Table 3 Small objects serialization time in micro-seconds

Thrift and protobuf are on par. Avro is a clear loser.

  1. The different implementations of protobuf
    1. 3rd implementation of protobuf that uses the map is around 60% slower than the other protobuf implementations. Not recommended if you’re looking for performance but it inherently provides the uniqueness of the keys which is a tradeoff.
    2. 2nd implementation is not optimal. The column based design doesn’t show its full effect for small objects.
    3. Serialization is generally quicker than deserialization which makes sense when we consider the object allocation necessary.
  2. The numbers confirm that text formats (xml, json) are slower than binary formats.
  3. We would never recommend using Avro for handling small objects in C# for small objects. Maybe in other languages the performance would be different. But, if you’re considering to develop micro services in C#, this would not be a wise choice.

Large Objects

File Size

Table 5 Large objects serialized file size in MegaBytes

  1. Avro is the most compact but protobuf is just 4% bigger.
  2. Thrift is no longer an outlier for the file size in the binary formats.
  3. All implementations of protobuf have similar sizes.
  4. Json is still better than XML
  5. XML is still the most verbose so the file size is comparatively the biggest.

Table 6 Large objects serialization time in milli-seconds

  1. This time, Thrift is a clear winner in terms of performance with a serialization 2.5 times faster than the second best performing format and a deserialization more than 1.3 times faster.
  2. Avro, that was a clear disappointment for small objects, is quite fast. This version is not column based, and we can hope it would make a little faster.
  3. The different implementations of protobuf
    1. Column based 2nd implementation of protobuf is the winner. The improvement is not huge, but the impact of this design kicks in when the number of columns starts to be very high.
    2. Serialization is generally quicker than deserialization which makes sense when we consider the object allocation necessary.
  4. Serializing XML is faster than Json. Json on the other hand is way faster.

Final conclusion

Protobuf and Thrift have similar performances, in terms of file sizes and serialization/deserialization time. The slightly better performances of Thrift did not outweigh the easier and less risky integration of Protobuf as it was already in use in our systems, thus the final choice.

Protobuf also has a better documentation, whereas Thrift lacks it. Luckily there was the missing guide that helped us implement Thrift quickly for benchmarking. https://diwakergupta.github.io/thrift-missing-guide/#_types

Avro should not be used if your objects are small. But it looks interesting for its speed if you have very big objects and don’t have complex data structures as they are difficult to express. Avro tools also look more targeted at the Java world than cross-language development. The C# implementation’s bugs and limitations are quite frustrating.

The data model we wanted to serialize was a bit peculiar and complex and then investigation is done using C# technologies. It could be quite interesting to do the same investigation in other programming languages. The performance could be different for different data models and technologies.

We also tried different implementations of protobuf to demonstrate that the performance can be improved by changing the design of the data model. The column based design to solve the real problem had a very high impact on the performance.

 

Post written by:


Afaque Khan

Software Engineer, R&D – Engine – Data Science Lab

Frederic Jardon

Software Developer, R&D – Engine – Data Science Lab

 

  • CriteoLabs

    Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.