Finding Similarity

Finding similarity in Strings using Jaro distance

I recently started to develop a strange interest in mechanisms for detecting similarity and difference in collections of stuff. Somebody once told me that everything in life is about search and sort. Perhaps that's the reason these techniques for detecting similarity and difference resonate so deeply.

I had always known about Levenshtein distance, but never cared to see what it actually did. But when I look into it the other day, I found out there are quite a few other techniques as well: Jaro distance is one of them, and it's not included in StringUtils from commons-lang.

Reading up on these kind of things is dangerous. Don't do it! The Wikipedia pages explain it in just a few lines with a simple formula, suggesting that implementing it is ridiculously simple.

And I have to admit, it isn't all that hard at all. But then, when you try to do it, it appears that a few fundamental parts have been left out of the description. It feels a bit like starting to work on a construction kit, and then - when you're almost done - finding that the fundamental piece holding everything together is not included in the box.

Implementing it

Unfortunately, nobody warned me, so I ended up thinking I could implement it in Scala in five minutes. And then found myself struggling for more time than I'm willing to admit searching the Internet for the details they left out, or my brain refused to register.

Here's my version. It needs a bit of work, but it does the job:

def distance(first: String, second: String) = {
  val maxLength = max(first.length, second.length)
  val maxDistance = (maxLength / 2) - 1
  var matches = 0
  var halfTranspositions = 0
  for (i <- 0 until first.length) {
    val c = first.charAt(i)
    if (second.length > i && c == second.charAt(i)) matches += 1
    else {
      var found = false
      for {
        j <- max(0, i - maxDistance) until min(second.length, i + maxDistance)
        if !found
        if (i != j)
        if (c == second.charAt(j))
      } {
        matches += 1
        halfTranspositions += 1
        found = true
      }
    }
  }
  (1.0 / 3.0) * (
    (matches / first.length.toDouble) +
      (matches / second.length.toDouble) +
      ((matches - (halfTranspositions / 2)) / matches.toDouble)
    )
}

With this, let's assume somebody is searching for work of Ruisdaal. That is, he assumes that's the way you write his name. In reality, it's most often spelled as Ruysdael, so you can easily get it wrong.

With Jaro, we can now easily find the best match given a list of potential candidates:

// The candidates from which we need to pick a name
val candidates = 
  List("roghman", "ruysdael", "rembrandt", "vermeer", "mondriaan")            

// Sort the names by calculating the Jaro distance between these
// names and the String "ruisdaal". Then reverse the order - we
// want the name with highest score to come first - and pick the
// first name from the list.
candidates.sortBy(distance(_, "ruisdaal")).reverse.head 

// gives ruysdael

(Came across this afterward; resisting the urge to read it, otherwise I know what I will be doing for the next couple of hours.)