The Baconator: Finding Erdős Numbers on Wikipedia

The Erdős number describes the collaborative distance between an author of mathematical papers and Paul Erdős. The Bacon number, which measures an actor’s degress of separation from Kevin Bacon in cinema, is a well known varition on this theme. I recently came across another interesting variation - what is the shortest possible path between any two articles on Wikipedia?

Finding the shortest path between two nodes suggests a breadth-first search. Given any starting page and a target page, our search should look something like this:

  1. Extract all links on the page
  2. Check if any of these are the target page. If so, return the path from the starting page to the target page
  3. Return to step 1, visiting all of the links found on the page (yay recursion!)

This gives us a few problems to solve:

  1. Extracting all of the links from a Wikipedia page that point to other Wikipedia pages
  2. Returning the sequence of articles that makes up a shortest path. Many paths could tie for shortest; we are satisfied with returning one of them
  3. Keep track of which links we have visited to avoid visiting them a second time

The first problem is readily solved with yokogiri (Nokogiri for Clojure), a lovely HTML/XML parser library. A solution might look something like this, with the addition of error handling:

(ns erdos-c.core
  [:require [yokogiri.core :refer :all]])

(def client (make-client :javascript false))

(defn page [uri]
  (get-page client (str "" uri)))

(defn link-snippets [uri]
  (xpath (page uri) "//a"))

(defn links [uri]
  (->> uri
    (map attrs)
    (map :href)
    (remove nil?)))

(defn wikifilter [uri] 
  (re-find #"\A\/wiki\/" uri))

(defn wiki-links [uri]
  (filter wikifilter (links uri)))

The second problem is more interesting - how do we ensure the shortest path is returned? Our goal is to visit all of the distance-1 links first, then all of the distance-2 links, etc. to ensure the shortest possible path is found. The distance-2 links are all created from distance-1 links, hence the only challenge here is to ensure we visit links in order of their depth. We can accomplish this with a queue, taking the next link to test from the front of the queue and adding new links discovered to the back of the queue. While you won’t find it on the cheatsheet, Clojure does include a queue that can be created with clojure.lang.PersistentQueue/EMPTY.

As links are extracted from the page, they are compared against the visited set to ensure we do not visit the same link twice. New links are added to the visited set and the queue as they are discovered.

(defn search [start target]
  (loop [queue (conj (clojure.lang.PersistentQueue/EMPTY) (list start))
         visited #{}]
    (let [links (peek queue)
          test-link (first links)
          page-links (filter #(not (contains? visited %)) (wiki-links test-link))]
      (if (some #(= target %) page-links)
        (recur (pop (apply conj queue (map #(cons % links) page-links)))
               (apply conj visited page-links))))))

Instead of enqueueing links, we enqueue lists of links to maintain knowledge of the path that got us to the next link to visit, which is the first link in each list. Hence, running (search "/wiki/Secondhand_Lions:_A_New_Musical" "/wiki/Kevin_Bacon") ultimately returns ("/wiki/Robert_Duvall" "/wiki/Secondhand_Lions" "/wiki/Secondhand_Lions:_A_New_Musical") giving Secondhand Lions: A New Musical a Wiki-Bacon number of 3.

If Clojure isn’t quite your speed and you’d prefer to see a Ruby implementation you can find it here.