Friday, December 04, 2009

Implementing the Levenshtein distance algorithm

In a previous post I mentioned there are many ways to perform a test on a string to see how closely it can match another string. One of the more interesting algorithms for this is the Levenshtein Distance Algorithm.

This algorithm is named after Vladimir Levenshtein in 1965. It is often used in applications that need to determine how similar or indeed how different two string values are.

As you can guess this is very useful if you every want to implement a basic fuzzy matching algorithm for example. It is also used a lot in spell checking applications to suggest potential word matches.

The Code sample below is in C# but can be translated to any language. It is not optimized at all and is only here to serve as an illustration. It should be noted that there are better and more sophisticated Distance algorithms that you can find via Google or a good book on string comparison algorithms. (there are a lot of ways to re-factor and optimize the following code.

class LevenshteinDistance
{

     public int Execute(String valueOne, String valueTwo)
    {
        int valueOneLength = valueOne.Length;
        int valueTwoLength = valueTwo.Length;

        int[,] DistanceMatrix = new int[valueOneLength, valueTwoLength];

        for(int i=0; i<valueOneLength; i++)
             DistanceMatrix[i, 0] = i;
       
        for(int j=0; j<valueTwoLength; j++)
            DistanceMatrix[0, j] = j;

        for(int j=1;j<valueTwoLength;j++)
            for (int i = 1; i < valueOneLength; i++)
            {
                if (valueOne[i] == valueTwo[j])
                    DistanceMatrix[i, j] = DistanceMatrix[i - 1, j - 1];
                else
                {
                    int a = DistanceMatrix[i - 1, j] + 1;
                    int b = DistanceMatrix[i, j - 1] + 1;
                    int c = DistanceMatrix[i - 1, j - 1] + 1;
                    DistanceMatrix[i, j] = Math. Min(Math.Min(a,b),c);
                }
            }

        return DistanceMatrix[valueOneLength-1, valueTwoLength-1];
    }   

}

and the test

[Fact]
public void TestDistanceAlgorithm()
{
    LevenshteinDistance x = new LevenshteinDistance();

    Xunit.Assert.Equal(3, x.Execute("Sunday", "Saturday"));
  
}

Tomorrow, I will post an blog entry explaining how you can use this algorithm to match exact or near exact data.

No comments: