The Baconator: Finding Erdős Numbers on Wikipedia
The Erdős number describes the collaborative distance between an author of mathematical papers and Paul Erdős. The Bacon number, which measures an actor’s degress of separation from Kevin Bacon in cinema, is a well known varition on this theme. I recently came across another interesting variation – what is the shortest possible path between any two articles on Wikipedia?
Finding the shortest path between two nodes suggests a breadthfirst search. Given any starting page and a target page, our search should look something like this:
 Extract all links on the page
 Check if any of these are the target page. If so, return the path from the starting page to the target page
 Return to step 1, visiting all of the links found on the page (yay recursion!)
This gives us a few problems to solve:
 Extracting all of the links from a Wikipedia page that point to other Wikipedia pages
 Returning the sequence of articles that makes up a shortest path. Many paths could tie for shortest; we are satisfied with returning one of them
 Keep track of which links we have visited to avoid visiting them a second time
The first problem is readily solved with yokogiri (Nokogiri for Clojure), a lovely HTML/XML parser library. A solution might look something like this, with the addition of error handling:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

The second problem is more interesting – how do we ensure the shortest path is returned? Our goal is to visit all of the distance1 links first, then all of the distance2 links, etc. to ensure the shortest possible path is found. The distance2 links are all created from distance1 links, hence the only challenge here is to ensure we visit links in order of their depth. We can accomplish this with a queue, taking the next link to test from the front of the queue and adding new links discovered to the back of the queue. While you won’t find it on the cheatsheet, Clojure does include a queue that can be created with clojure.lang.PersistentQueue/EMPTY
.
As links are extracted from the page, they are compared against the visited
set to ensure we do not visit the same link twice. New links are added to the visited set and the queue as they are discovered.
1 2 3 4 5 6 7 8 9 10 

Instead of enqueueing links, we enqueue lists of links to maintain knowledge of the path that got us to the next link to visit, which is the first link in each list. Hence, running (search "/wiki/Secondhand_Lions:_A_New_Musical" "/wiki/Kevin_Bacon")
ultimately returns ("/wiki/Robert_Duvall" "/wiki/Secondhand_Lions" "/wiki/Secondhand_Lions:_A_New_Musical")
giving Secondhand Lions: A New Musical a WikiBacon number of 3.
If Clojure isn’t quite your speed and you’d prefer to see a Ruby implementation you can find it here.