Edit Distance

Question (LC.72)

Given two strings, find the minimum number of edits to transform string1 to string2.

Example

I: "Saturday", "Sunday"
O: 4
I: "More", "Less"

The most brute force search is just to enumerate all the possible matches for #edits = 1, 2, 3, ..., n until we find a match. Or we can do a smarter search that starts either from the beginning or the end, comparing the corresponding characters.

If we start from the beginning, then the minimum number of edits for s1 to match s2

if s1[m] == s2[n]
    return editDistance(s1, s2, m+1, n+1)
else
    return 1 + Math.min(
        editDistance(s1, s2, m, n + 1), // insertion
        editDistance(s1, s2, m + 1, n), // deletion
        editDistance(s1, s2, m + 1, n + 1)  // substitution
    ) ;

The recurrence above should capture all the possibilities. Inside the else statement, the presumed equivalent cost of edit 1 + the min of the rest is guaranteed to be optimal. What if spending one edit is actually more cost efficient then having a match in the if statement? We can get a formal proof by reordering steps in an arbitrary optimum solution. But this example should give some good intuition. Consider SabSunday and Sunday. If we match the S in the beginning, we just have to delete the second S when we get there. The minimum number of edits still remain as 3.

Brute Force Code

Search works. The exponential time complexity doesn't. A relative short example like dinitrophenylhydrazine and acetylphenylhydrazine will give Time Limit Exceeded. See the induction proof in Rod Cutting.

Bottom Up DP

Observe the recursion for the following example

If we define a subproblem as ED(i,j), then ED(i,j) represents the optimal solution at using the suffix s1[i:m] to fully match the suffix s2[j:n]. From the recursion tree, we can find quite a few subtrees can be reused if they are memoized. Dynamic programming is the way to go.

Solve by Suffix

Solve by Prefix

Defining the subproblem by prefix might be easier to write sometimes.

Prefix vs Suffix

The tabulation process for prefix is going down (removing from the end so tabulate down) and for suffix is going up (removing from the beginning so tabulate up).

By Prefix

By Suffix

Time & Space Complexity

Time is O(m*n) but space is trickier. We are using O(m*n) space right now for better understanding. But we can get to constant space with careful implementation since the dependencies are only two layers.

Last updated