A collection of machine learning models for detecting personal attacks and aggression in discussion comments. We are using the models to study the prevalence and impact of toxic comments on Wikipedia, as well as encourage the development of new moderation tools. In collaboration with Google Jigsaw. This is work in progress.

[homepage] [paper] [code] [app]

Wikipedia Navigation Vectors

An embedding of Wikipedia articles and Wikidata items by applying Word2vec models to a corpus of billions of reading sessions. The embeddings have the property that articles that tend to be read in close succession have similar vector representations. A demo app of how these vectors can be used to generate reading recommendations is linked below.

[homepage] [code] [data] [app]


System for recommending articles for translation between Wikipedias. Nominated for best paper at WWW 2016.

[paper] [video] [code] [app]

Wikipedia Clickstream

Dataset containing counts of billions of (referer, resource) pairs extracted from the request logs of Wikipedia. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another.

[homepage] [code] [data]

Practical AB testing

A collection of blog posts containing practical advice on AB testing and some useful extensions to Bayesian hypothesis testing methods based on learnings from hundreds of tests at WMF.

Mining for Earmarks

Built a machine learning system for automatically extracting earmarks from congressional bills and reports. The system was used to construct the first publicly available database of earmarks dating back to 1995. Won the runner-up prize at the 2015 Bloomberg Data for Good Exchange. Accepted at KDD 2016.

[homepage] [paper] [code] [data]

Deep Learning for Text Classification

Recursive Neural Network for Short Text Classification. The model jointly learns vector representations of words, a method of merging word vectors into document vectors and a classifier over the document vectors. Implemented from scratch back in the day before we had deep learning frameworks.


Trust in the CouchSurfing Network

Inference of trust ratings between strangers from trust ratings between acquaintances and the structure of the network that connects them. My CS229 project turned ICWSM paper.