A few months ago, we tried to move our real-time prediction component to dedicated GPU servers.
The basic operations are simple enough: hashes, scalar products, plus some specific operations for our prediction algorithm.
Some of our coworkers were really skeptical because the GPU paradigm does not fit very well with these operations: notably, the prediction step requires random access to large chunks of RAM that do not fit well the GPU memory model.
The algorithm as run on the GPU is comprised of five steps:
- Data transfer to the GPU
- Data formatting within the GPU: some byte array manipulations to adapt data structures to GPU
- Hash computation
- Prediction: apply the prediction model
- Pass back data to the main CPU
20% of the time is overhead due to GPU technical constraints (1,2 & 5).
If we compare the 80% remaining to CPU:
We increased the sequential prediction rate from 140 000 prediction/s to 430 000 prediction/s on our test server. The main improvement is on the MurmurHash3 computation step where we get a nearly x10 speed-up.
It is also interesting to see that the prediction step is very slow on GPU compared to CPU.
We think these results can be explained by memory access patterns:
- The hash computation has good locality.
- The kernel used for prediction does many accesses to its global memory where our prediction models are stored. This is done in a random pattern which cannot take advantage of memory coalescing and is therefore very inefficient bandwidth-wise.
A topic we chose to ignore is the concurrency model: this experiment has been made on a single-threaded scenario, whereas we obviously run multithreaded in production.
We are still looking for tricks to avoid access to global memory and improve our prototype, and try this out on an APU.
By the way, if you want to play with these technologies at scale, we are hiring! 😉
Authors: Laurent Vion & Vincent Perez
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.See DevOps Engineer roles