GitHub Contest RecapPosted: 03 Sep 2009
Recently I decided to throw my hat into the GitHub contest ring, the goal of which was to predict repositories that a GitHub user should watch. It had been some years since I had really done anything intensive with artificial intelligence (AI) and I thought it would be fun. I was also attracted to the bounty offered: an aged bottle of bourbon and a lifetime large GitHub account. This post serves as my high-level recap of the contest.
While I had never worked on a recommendation system per se, I had done work with classifiers before. Looking at the problem, my gut reaction was that an instance-based learning algorithm, such as k-nearest neighbors, would be the most effective approach. I also anticipated the top three winning entries to end up somewhere in the 70 - 85% accuracy range. As it turns out, the best accuracy at the end of the contest was 56%. While it is possible that this prediction accuracy is the best that could be achieved with the data, my guess is the relatively short time allowed for the contest (roughly a month) made it difficult for a very strong entry in the contest. Additionally, the prize offered likely only attracted amateur entries. This is not to say the contestants were unskilled, just that I doubt the dedication shown for something such as the Netflix contest was exerted for the GitHub contest.
I decided to use the weighted k-nearest-neighbors algorithm as the basis for my submission. In order to measure progress and avoid overfitting results, I used a stratified n-fold cross validation evaluation strategy. The idea behind n-fold cross validation is to partition the dataset into proper subsets (the "fold") that each maintain the underlying distribution of the full dataset. This consisted of taking a stratified sample (watcher ID, repository ID) pairings, using the repository ID as the class value. Once the n folds are created, one is used as a test set while the other n - 1 are used for training; the process is repeated until every fold has the opportunity to be the test set (the cross validation). The average of the observed accuracies is used to mitigate the effect potential sampling bias.
My first approach was not terribly efficient, but I had taken the engineering approach of making it correct and then making it fast. I was quite surprised when my first pass wouldn't even run because it broke the MRI Ruby garbage collector. This forced me to reduce the search space earlier than I had intended. While I could have taken a stratified sample of the training data, I wanted to minimize data loss, so I opted to restructure the instance space into regions grouped by repository ancestry. This reduced the instance space down from about 113,000 points to roughly 77,000 using 10 folds. While an improvement, this reduced space still proved to be too large to perform full evaluations over doing pairwise comparisons (a O(n2) approach). Other pruning methods, such as removing regions with single repositories or single watchers, further reduced the search space, but at the cost of accuracy.
The next step, thus, was to devise a search strategy that attempted to find the regions that the test instance belonged to, while avoiding those that the instance did not. I tried several heuristics with varying degrees of success. Empirical evidence suggested that if a test instance either watched 25% or more of an owner's repositories or 25% of the test instance's repositories were owned by the same person, that the test instance would be likely to watch other repositories owned by that person. For my final submission the search heuristic I used considered all the regions that the test instance was already known to belong to and regions that contained repositories that were owned by any repository owner that the test instance was known to watch a substantial portion of.
Simply finding the correct regions wasn't sufficient, however. Once the regions were chosen there was still the matter of choosing the correct repositories from those regions. I found that using the most forked repository per region was generally the correct one. A minor accuracy boost was achieved by evaluating the parent of any repository that a test instance was known to be watching, working under the observation that watchers tended to watch parent repositories when watching one of that repository's children.
Evaluations were performed by calculating the Euclidian distance between two repositories. My first approaches were pure distances and yielded symmetric results. As the project evolved, however, I found myself augmenting the definition of distance to be a route taken between two repositories by a given training instance and even adopted some aspects of Hamming distance. This meant that for two repositories r1 and r2, given training instances t1 and t2, the distance between r1 and r2 could vary depending on characteristics of t1 and t2. While this may not be a strict Euclidian distance calculation, I likened it to modes of transportation. E.g., the distance between two city centers may be fixed for a straight line, but can vary considerably if one chooses to walk versus take a train.
The distance calculation weighted different attributes based upon observations of the shape of the training data. The attributes I ended up using were: parent-child relationships, general common ancestry, common owner, and common watchers. I had planned on using a genetic algorithm to optimize the weighting system, but ran out of time before the end of the contest.
I had two goals when starting this project: 1) win the contest; and 2) learn more about Ruby. Unfortunately, achieving the latter may have come at the expense of the former. In the end, I achieved slightly over 40% accuracy, which was a far cry from where I had expected to end up and wasn't enough to win the contest. My original choice of technology stack was not appropriate. In most of my previous AI projects I used Java with a smattering Python and to a lesser extent, ML and LISP. Having been using Ruby as my primary programming language for the past year and a half, I thought I would use it for my contest entry. I spent roughly three weeks trying to optimize a Ruby solution that could only achieve 31% accuracy after a seven hour run on a quad core machine with 4 GB RAM. During the last week of the contest, I rewrote the project in Java and achieved 40% accuracy with a 10 minute long run on a dual core MacBook Pro. I plan on elaborating on that a bit in a future post.
[Edit: (Sept. 17, 2009) I've now published the post comparing my Ruby and Java entries]
The code for my entry can be found on GitHub.