Hunting a bug in one of our metrics, we recently discovered a very unexpected correlation in our data, between two actions that we make using pseudo random numbers. Isolating the code responsible for this, we discovered that we had indeed a very tricky bug in our use of pseudo random numbers.
After removing all the irrelevant code, it finally looked like this:
const int SeedShift = 655575;
static void GetRandomNumbers(int seed, out double x , out double y)
{
x = new Random(seed).NextDouble();
y = new Random(seed + SeedShift).NextDouble();
}
Variables x and y were used to take two distinct random actions., and the seed here was built from a hash of the id of the impression. Note that seeding the random like this is a good practice, mandatory to reproduce what happens in a complex prod environment.
We were expecting here that x and y behaved as two (almost) independent random numbers. In other words, we expected the graph of x versus y to look like this: But let’s see what actually happened:
public static void Main(string[] args)
{
Random seedGenerator = new Random();
for (int i = 0; i < 1000; i++)
{
int seed = seedGenerator.Next();
double x,y;
GetRandomNumbers(seed, out x, out y);
Console.WriteLine("" + x + ";" + y);
}
}
Plotting the resulting (x,y) pairs, we got this:
As you can see, x and y were definitely NOT independent. What it meant for us at criteo, was that some metrics we computed and which relied on the independence of the actions were just wrong. Fortunately, these were not critical metrics, and we did not observe a significant difference in business metrics (clicks, sales,…) after fixing this bug.
How could a largely used object such as C# Random exhibit such a bad behavior ?
The answer is simple: Random object was not designed to be used this way. The rule of thumb to remember here is: your random numbers are just as random as your seeds. Here the seeds were correlated, and so were the random numbers.
For more details, you can have a look at : http://stackoverflow.com/questions/12282628/why-are-initial-random-numbers-similar-when-using-similar-seeds
How to fix this ?
One obvious solution is to use only one random generator:
const int SeedShift = 655575;
static void GetRandomNumbers(int seed, out double x, out double y)
{
var random = new Random(seed);
x = random.NextDouble();
y = random.NextDouble();
}
However, we did not use that. Remember indeed that this is a very simplified version of the code: actually both random objects were called in different threads. Using the same random in different thread would lose the reproducibility of our system. Note also that C# random object is not thread safe. Instead we finally used something like this:
static void GetRandomNumbers(int seed, out double x, out double y)
{
int seedX = Hash( seed ,"uniqueSaltForX");
int seedY = Hash( seed ,"uniqueSaltForY");
x = new Random(seedX).NextDouble();
y = new Random(seedY).NextDouble();
}
Here the ‘Hash’ function is a hash strong enough to guarantee that seedX and seedY can be considered as independent random numbers. Another possibility would be to use seedX and seedY directly as your random numbers. (But in our case, one of the actions actually required several random numbers.)
And by the way, I mentioned that C# Random object is not thread safe. If you think it is not a big deal, you may have a look at the following program:
static void CallRandom(Random random, int threadId)
{
for (int i = 0; i < 100000; i++) // A lot of calls to be sure the race conditions in random happened …
{
random.NextDouble();
}
for (int i = 0 ; i < 10 ; i++)
{
Console.WriteLine("thread=" + threadId + " randomDouble=" + random.NextDouble());
}
}
static void Main(string[] args)
{
Random random = new Random();
int nbThread = 2;
Parallel.ForEach(Enumerable.Range(1, nbThread), (threadId) => CallRandom(random, threadId));
Console.ReadLine();
}
Here is the output we got :
Probably not as random as you could have expected !
-
Software Developer R&D Engine
See Dev Lead roles