Levenshtein Distance .*;

Levenshtein Distance

The Levenshtein distance (LD), sometimes called "edit distance", is a measure of the similarity between two strings, referred to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.

For example,

If s = "sink" and t = "sink", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.

If s = "sink" and t = "sank", then LD(s,t) = 1, because one substitution (change "i" to "a") is sufficient to transform s into t.

If s = "sink" and t = "tank", then LD(s,t) = 2, because two substitutions (change "i" to "a", and "s" into "t") are needed to transform s into t.

If s = "sink" and t = "thank", then LD(s,t) = 3, because two substitutions (change "i" to "a", and "s" into "t") and one insertion (adding "h") are needed to transform s into t.


The greater the Levenshtein distance, the more different the strings are.

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965.

 

Example By Hand

Let s = "SNAKE"

Let t = "STAKES"

Step 1

  • Set n to be the length of s (n = 5).
  • Set m to be the length of t (m = 6).
  • Construct a matrix containing 0..m rows and 0..n columns.

Step 2

  • Initialize the first row to 0..n.
  • Initialize the first column to 0..m.

    S N A K E
  0 1 2 3 4 5
S 1          
T 2          
A 3          
K 4          
E 5          
S 6          

Steps 3 through 6

  • i = 1

    S N A K E
  0 1 2 3 4 5
S 1 0        
T 2 1        
A 3 2        
K 4 3        
E 5 4        
S 6 5        

 

  • i = 2

    S N A K E
  0 1 2 3 4 5
S 1 0 1      
T 2 1 1      
A 3 2 2      
K 4 3 3      
E 5 4 4      
S 6 5 5      

 

  • i = 3

    S N A K E
  0 1 2 3 4 5
S 1 0 1 2    
T 2 1 1 2    
A 3 2 2 1    
K 4 3 3 2    
E 5 4 4 3    
S 6 5 5 4    

 

  • i = 4

    S N A K E
  0 1 2 3 4 5
S 1 0 1 2 3  
T 2 1 1 2 3  
A 3 2 2 1 2  
K 4 3 3 2 1  
E 5 4 4 3 2  
S 6 5 5 4 3  

 

  • i = 5

    S N A K E
  0 1 2 3 4 5
S 1 0 1 2 3 4
T 2 1 1 2 3 4
A 3 2 2 1 2 3
K 4 3 3 2 1 2
E 5 4 4 3 2 1
S 6 5 5 4 3 2

Step 7

  • The distance is in the lower right hand corner of the matrix, i.e. LD(s,t) = 2. This corresponds to our intuitive realization that "SNAKE" can be transformed into "STAKES" by substituting "N" for "T" and adding "S" (one substitution and 1 insertion = 2 changes).

 

 

public void footer() {
About | Contact | Privacy Policy | Terms of Service | Site Map
Copyright© 2009-2012 John McCullock. All Rights Reserved.
}