Wikipedia Clickstream - Getting Started

This post gives an introduction to working with the new February release of the Wikipedia Clickstream dataset. The data shows how people get to a Wikipedia article and what articles they click on next. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. To give an example, consider the figure below, which shows incoming and outgoing traffic to the “London” article.

_config.yml

The example shows that most people found the “London” page through Google Search and that only a small fraction of readers went on to another article. Before diving into some examples of working with the data, let me give a more detailed explanation of how the data was collected.

Data Preparation

The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the “referer”. This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015.

The dataset only includes requests for articles in the main namespace of the desktop version of English Wikipedia.

Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources to English Wikipedia, based on this scheme:

an article in the main namespace of English Wikipedia -> the article title
any Wikipedia page that is not in the main namespace of English Wikipedia -> other-wikipedia
an empty referer -> other-empty
a page from any other Wikimedia project -> other-internal
Google -> other-google
Yahoo -> other-yahoo
Bing -> other-bing
Facebook -> other-facebook
Twitter -> other-twitter
anything else -> other-other

MediaWiki Redirects are used to forward clients from one page name to another. They can be useful if a particular article is referred to by multiple names, or has alternative punctuation, capitalization or spellings. Requests for pages that get redirected were mapped to the page they redirect to. For example, requests for ‘Obama’ redirect to the ‘Barack_Obama’ page. Redirects were resolved using a snapshot of the production redirects table from February 28 2015.

Redlinks are links to an article that does not exist. Either the article was deleted after the creation of the link or the author intended to signal the need for such an article. Requests for redlinks are included in the data.

We attempt to exclude spider traffic by classifying user agents with the ua- parser library and a few additonal Wikipedia specific filters. Furthermore, we attempt to filter out traffic from bots that request a page and then request all or most of the links on that page (BFS traversal) by setting a threshold on the rate at which a client can request articles with the same referer. Requests that were made at too high of a rate get discarded. For the exact details, see here and here. The threshold is quite high to avoid excluding human readers who open tabs as they read. As a result, requests from slow moving bots are likely to remain in the data. More sophisticated bot detection that evaluates the clients entire set of requests is an avenue of future work.

Finally, any (referer, resource) pair with 10 or fewer observations was removed from the dataset.

Format

The data includes the following 6 fields:

prev_id: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on
curr_id: the unique MediaWiki page ID of the article the client requested
n: the number of occurrences of the (referer, resource) pair
prev_title: the result of mapping the referer URL to the fixed set of values described above
curr_title: the title of the article the client requested
type
- “link” if the referer and request are both articles and the referer links to the request
- “redlink” if the referer is an article and links to the request, but the request is not in the produiction enwiki.page table
- “other” if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer

Getting to know the Data

There are various quirks in the data due to the dynamic nature of the network of articles in English Wikipedia and the prevalence of requests from automata. The following section gives a brief overview of the data fields and caveats that need to be kept in mind.

Loading the Data

First let’s load the data into a pandas DataFrame.

import pandas as pd
df = pd.read_csv("2015_02_clickstream.tsv", sep='\t', header=0)
#we won't use ids here, so lets discard them
df = df[['prev_title', 'curr_title', 'n', 'type']]
df.columns = ['prev', 'curr', 'n', 'type']

Main_Page	127500620
87th_Academy_Awards	2559794
Fifty_Shades_of_Grey	2326175
Alive	2244781
Chris_Kyle	1709341
Fifty_Shades_of_Grey_(film)	1683892
Deaths_in_2015	1614577
Birdman_(film)	1545842
Islamic_State_of_Iraq_and_the_Levant	1406530
Stephen_Hawking	1384193

Top Referers

The clickstream data aslo let’s us investigate who the top referers to Wikipedia are:

df.groupby('prev').sum().sort('n', ascending=False)[:10]

other-google	1494662520
other-empty	347424627
other-wikipedia	129619543
other-other	77496915
other-bing	65895496
other-yahoo	48445941
Main_Page	29897807
other-twitter	19222486
other-facebook	2312328
87th_Academy_Awards	1680559

The top referer by a large margin is Google. Next comes refererless traffic (usually clients using HTTPS). Then come other language Wikipedias and pages in English Wikipedia that are not in the main (i.e. article) namespace. Bing directs significantly more traffic to Wikipedia than Yahoo. Social media referals are tiny compared to Google, with twitter leading 10x more requests to Wikipedia than Facebook.

Lets look at what articles were trending on Twitter:

df_twitter = df[df['prev'] == 'other-twitter']
df_twitter.groupby('curr').sum().sort('n', ascending=False)[:5]

Johnny_Knoxville	198908
Peter_Woodcock	126259
2002_Tampa_plane_crash	119906
Sơn_Đoòng_Cave	116012
The_boy_Jones	114401

I have no explanations for this, but if you find any of the tweets linking to these articles, I would be curious to see why they got so many click-throughs.

Most Requested Missing Pages

Next let’s look at the most popular redinks. Redlinks are links to a Wikipedia page that does not exist, either because it has been deleted, or because the author is anticipating the creation of the page. Seeing which redlinks are the most viewed is interesting because it gives some indication about demand for missing content. Since the set of pages and links is constantly changing, the labeling of redlinks is not an exact science. In this case, I used a static snapshot of the page and pagelinks production tables from Feb 28th to mark a page as a redlink.

df_redlinks = df[df['type'] == 'redlink']
df_redlinks.groupby('curr').sum().sort('n', ascending=False)[:5]

2027_Cricket_World_Cup	6782
Rethinking	5279
Chris_Soules	5229
Anna_Lezhneva	3764
Jillie_Mack	3685

Searching Within Wikipedia

Usually, clients navigate from one article to another through following a link. The other prominent case is search. The article from which the user searched is also passed as the referer to the found article. Hence, you will find a high count of (Wikipedia, Chris_Kyle) tuples. People went to the “Wikipedia” article to search for “Chris_Kyle”. There is not a link to the “Chris_Kyle” article from the “Wikipedia” article. Finally, it is possible that the client messed with their referer header. The vast majority of requests with an internal referer correspond to a true link.

df_search = df[df['type'] == 'other']
df_search =  df_search[df_search.prev.str.match("^other.*").apply(bool) == False]
print "Number of searches/ incorrect referers: %d" % df_search.n.sum()

Number of searches/ incorrect referers: 106772349

df_link = df[df['type'] == 'link']
df_link =  df_link[df_link.prev.str.match("^other.*").apply(bool) == False]
print "Number of links followed: %d" % df_link.n.sum()

Number of links followed: 983436029

Inflow vs Outflow

You might be tempted to think that there can’t be more traffic coming out of a node (ie. article) than going into a node. This is not true for two reasons. People will follow links in multiple tabs as they read an article. Hence, a single pageview can lead to multiple records with that page as the referer. The data is also certain to include requests from bots which we did not correctly filter out. Bots will often follow most, if not all, the links in the article. Lets look at the ratio of incoming to outgoing links for the most requested pages.

df_in = df.groupby('curr').sum()  # pageviews per article
df_in.columns = ['in_count',]
df_out = df.groupby('prev').sum() # link clicks per article
df_out.columns = ['out_count',]
df_in_out = df_in.join(df_out)
df_in_out['ratio'] = df_in_out['out_count']/df_in_out['in_count'] #compute ratio if outflow/infow
df_in_out.sort('in_count', ascending = False)[:3]

curr	in_count	out_count	ratio
Main_Page	127500620	29897807	0.234491
87th_Academy_Awards	2559794	1680559	0.656521
Fifty_Shades_of_Grey	2326175	1146354	0.492806

Looking at the pages with the highest ratio of outgoing to incoming traffic reveals how messy the data is, even after the careful data preparation described above.

df_in_out.sort('ratio', ascending = False)[:3]

curr	in_count	out_count	ratio
List_of_Major_League_Baseball_players_(H)	57	1323	23.210526
2001–02_Slovak_Superliga	22	472	21.454545
Principle_of_good_enough	23	374	16.260870

All of these pages have more traversals of a single link than they have requests for the page to begin with. As a post processing step, we might enforce that there can’t be more traversals of a link than there were requests to the page. Better bot filtering should help reduce this issue in the future.

df_post = pd.merge(df, df_in, how='left', left_on='prev', right_index=True)
df_post['n'] = df_post[['n', 'in_count']].min(axis=1)
del df_post['in_count']

Network Analysis

We can think of Wikipedia as a network with articles as nodes and links between articles as edges. This network has been available for analysis via the production pagelinks table. But what does it mean if there is a link between articles that never gets traversed? What is it about the pages that send their readers to other pages with high probability? What makes a link enticing to follow? What are the cliques of articles that send lots of traffic to each other? These are just some of the questions, this data set allows us to investigate. I’m sure you will come up with many more.

Getting the Data

The dataset is released under CC0. The canonical citation and most up-to-date version of the data can be found at:

Ellery Wulczyn, Dario Taraborelli (2015). Wikipedia Clickstream. figshare. doi:10.6084/m9.figshare.1305770

An IPython Notebook version of this post can be found here.

Written on February 28, 2015