A few months ago, we constructed our own map of the web. We wanted to add new variables to our prediction engine. This engine optimizes, in real-time, many components of our banners: it selects the best campaign, calculates the business value of the banner and choses content such as graphics to maximize the probability of a user clicking and then purchasing. The precision of our engine depends on the variables, also called features, that it uses and thus finding the right variables is a key part of our business.Analyzing the web helped some companies. For instance, the relevance of search engine results vastly improve when they consider the links between the pages. PageRank is an example of a feature calculated from the web graph. We thought that having a map of the web might be useful to compute relevant features: we could imagine that ads should be different on web pages with high PageRank and that pages close to a shop’s site in the graph should have a different ads than the others. We describe here how we build our web graph and some of its properties.
Methodology
In general, when you want to build a web graph you use a crawler. It is software which will retrieve a page, extract the links and other required data and then continue to another page. Performing a full crawl is a very complicated task, demanding a lot of computing power, bandwidth and monitoring. Therefore, we decided to use a very different approach and to use only the data we already had. When we display an ad to a given user, we log this information. Each browser is given an anonymous UserID. We have huge logs with lines like “This UserID saw an ad on this page at this time”. Using this information, we can build the list of pages seen by a specific UserID (though that data is still completely anonymous). If we do that for each UserID and take the union of all these lists, we have a web graph which describes the trajectories of the various UserIDs on the web.
For instance for two UserIDs, UserID 1 visits pages A, B, C and D and UserID 2 visits pages E, F, B, C and G:
This methodology has several advantages. It is easier to perform than a crawl since we already have the information. We do not have to visit every page and we only consider the pages relevant to us. It may also provide more pertinent information than a crawl since we only consider the links actually followed by UserIDs. The principle of PageRank is to give to each page a score which represents the likelihood that a person randomly clicking on links will arrive at any particular page. In contrast to PageRank, with our methodology, we have real browsing trails.
The methodology is biased however. We do not display ads on every page and thus we miss some intermediate pages. For instance, if a UserID is on page A where we display ads and follows a link to page B where we do not and then follows a link to page C where we do, we will incorrectly conclude that we have a link between page A and C.
Only a full crawl would allow us to measure this bias, but we assume that our approximation is satisfactory for the pages relevant to us.
Another bias is induced by how people use their browser. When user stop browsing, there is a big delay before they see another page and no relevant link between this page and the previous one. There are other cases: when a user have several simultaneous sessions in different tabs, we see successive views of unrelated pages and when there are several ads on a single page, we see a UserID with multiple hits. To curb these effects we modified our methodology in two ways. First we changed the definition of “a page”. Instead of looking at relations between full URLs, we decided to look at relations between domains. Using this rule, links were followed by many more UserIDs and give us better statistics. This reduces the effects of unrelated page views because of tabbed browsing for instance: existing hyper-links are followed far more often. Secondly, we introduced a notion of session to consider only relation between domains visited during the same session. To analyze the sessions, we look at the time difference between two page views. The distribution of the time between two page views in seconds is represented on the following graph:
We have some spikes on comon delays like one minute, 100 seconds, 5 minutes… corresponding mostly to auto refresh. However, this kind of distribution does not allow to chose a clear cut off value. We made an arbitrary choice: a session ends after ten minutes of inactivity.
General facts and statistics
At this point, we have a clean graph structure representing the French web seen by Criteo. We have tested if this graph has the properties usually present on complex networks. The graph has around 600 000 nodes (each node corresponds to a domain) and 16 billions links. It has more than 1000 connected components but the biggest one contains more than 95% of the nodes. The rest of this article focuses on this giant component. The existence of smaller connected components is interesting. They might be small clicks farms or foreign sites identified as being French. We will analyze them later.
In the giant component, the density, which is the probability that two random nodes are connected, is very low. One interesting point is that, in contrary, the clustering coefficient is high (around 0.5). This coefficient is the probability that any two nodes connected to a third one are themselves connected. It counts the number of triangles. An explanation of this high clustering coefficient is that the graph is organized into groups of nodes, tightly connected each others but not to the rest of the world. These groups are often called communities and they assemble together sites dealing with the same subject.
Even if the graph contains around 600 000 nodes, they are often very close to each other. The average distance between any two nodes in the giant component is only 3 and the diameter, the greatest distance between any two nodes, is only 11. These values are very small considering the number of nodes. This is a characteristic of Small-world networks and is usually caused by the presence of nodes in the network with a high number of connections, called hubs, which create short paths between nodes.
As we expected, the graph also appears to be scale free. The degree distribution is very heterogeneous and almost follows a power law. It means that even if they are rare, you can find sites with many connections to themselves or to others. This is very common in graphs extracted from measures of real world objects and in general it can be explained by a “winner takes all” rule. When you are an important site, other sites tend to link to you and you become even more important.
Even if we do not draw deep conclusions from our web graph, it is always impressive to be able to visualize the Web we visit every day. Finally we cannot resist sharing with you a small useless picture. This is the internet:
T. Aynaud and P. Roseau
-
Our lovely Community Manager / Event Manager is updating you about what's happening at Criteo Labs.
See DevOps Engineer roles