Edit Distance

Question (LC.72)

Given two strings, find the minimum number of edits to transform string1 to string2.

Example

I: "Saturday", "Sunday"
O: 4
I: "More", "Less"

Brute Force Search

The most brute force search is just to enumerate all the possible matches for #edits = 1, 2, 3, ..., n until we find a match. Or we can do a smarter search that starts either from the beginning or the end, comparing the corresponding characters.

If we start from the beginning, then the minimum number of edits for s1 to match s2

if s1[m] == s2[n]
    return editDistance(s1, s2, m+1, n+1)
else
    return 1 + Math.min(
        editDistance(s1, s2, m, n + 1), // insertion
        editDistance(s1, s2, m + 1, n), // deletion
        editDistance(s1, s2, m + 1, n + 1)  // substitution
    ) ;

The recurrence above should capture all the possibilities. Inside the else statement, the presumed equivalent cost of edit 1 + the min of the rest is guaranteed to be optimal. What if spending one edit is actually more cost efficient then having a match in the if statement? We can get a formal proof by reordering steps in an arbitrary optimum solution. But this example should give some good intuition. Consider SabSunday and Sunday. If we match the S in the beginning, we just have to delete the second S when we get there. The minimum number of edits still remain as 3.

Brute Force Code

public int minDistance(String word1, String word2) {
    return editDistance(word1, word2, 0, 0);
}

private int editDistance(String s1, String s2, int m, int n) {
    // base case reach the end
    if (m == s1.length()) {
        // insert the rest of s2
        return s2.length() - n;
    }
    if (n == s2.length()) {
        // delete the rest of s1
        return s1.length() - m;
    }
    // search and return
    if (s1.charAt(m) == s2.charAt(n)) {
        return editDistance(s1, s2, m + 1, n + 1);
    } else {
        return 1 + minAdv(
            editDistance(s1, s2, m, n + 1), // insertion
            editDistance(s1, s2, m + 1, n), // deletion
            editDistance(s1, s2, m + 1, n + 1) // substitution
        );
    }
}

private int minAdv(int num1, int num2, int num3) {
    return Math.min(Math.min(num1, num2), num3);
}

Search works. The exponential time complexity doesn't. A relative short example like dinitrophenylhydrazine and acetylphenylhydrazine will give Time Limit Exceeded. See the induction proof in Rod Cutting.

Bottom Up DP

Observe the recursion for the following example

If we define a subproblem as ED(i,j), then ED(i,j) represents the optimal solution at using the suffix s1[i:m] to fully match the suffix s2[j:n]. From the recursion tree, we can find quite a few subtrees can be reused if they are memoized. Dynamic programming is the way to go.

Solve by Suffix

1. Define optimal subproblems
ED(i,j) = minimum number of edits for A[i:m] to fully match B[j:n]

2. Solve by suffix
if (A[i] == B[j])
    ED(i,j) = ED(i+1,j+1)
else
    ED(i,j) = 1 + Math.min(
        ED(i,j+1),
        ED(i+1,j),
        ED(i+1,j+1)
    )

3. Topo order
from small suffixes to the beginning
fill in col by col from right to left
for j from n-1 to 0
    for i from m-1 to 0

4. Init base
last row and last col
for i from m to 0
    ED(i,n) = m - i
for j from n to 0
    ED(m,j) = n - j

5. Final answer
ED(0,0)

public int minDistance(String s1, String s2) {
    int m = s1.length(), n = s2.length();
    char[] A = s1.toCharArray(), B = s2.toCharArray();
    // create memo table
    int[][] ED = new int[m + 1][n + 1];
    // init base case
    for (int i = m; i >= 0; i--) {
        ED[i][n] = m - i; // last col
    }
    for (int j = n; j >= 0; j--) {
        ED[m][j] = n - j; // last row
    }
    // topo order
    for (int j = n - 1; j >= 0; j--) {
        for (int i = m - 1; i >= 0; i--) {
            if (A[i] == B[j]) {
                ED[i][j] = ED[i+1][j+1];
            } else {
                ED[i][j] = 1 + minAdv(
                    ED[i][j+1], // insert
                    ED[i+1][j], // delete
                    ED[i+1][j+1] // substitute
                );
            }
        }
    }
    // final answer
    return ED[0][0];
}

Solve by Prefix

Defining the subproblem by prefix might be easier to write sometimes.

1. Define optimal subproblem
ED(i,j) = minimum number of edits for A[0:i] to fully match B[0:j]

2. Solve by prefix
if A[i] == B[j]
    ED(i,j) = ED(i-1,j-1)
else
    ED(i,j) = 1 + advMin(
        ED(i,j-1), // insert
        ED(i-1, j), // delete
        ED(i-1,j-1) // substitute
    )

3. Init base case
for i from 0 to m
    ED(i,0) = i // first col
for j from 0 to n
    ED(0,j) = j // first row

4. Topo Order
row by row from left to right
for i from 1 to m
    for j from 1 to n

5. Final answer
ED(m,n)

public int minDistance(String s1, String s2) {
    int m = s1.length(), n = s2.length();
    // create memo table
    int[][] memo = new int[m+1][n+1];
    // init base case
    for (int i = 0; i <= m; i++) {
        memo[i][0] = i; // first col
    }
    for (int j = 0; j <= n; j++) {
        memo[0][j] = j; // first row
    }
    // topo order
    for (int i = 1; i <= m; i++) {
        for (int j = 1; j <= n; j++) {
            if (s1.charAt(i-1) == s2.charAt(j-1)) {
                memo[i][j] = memo[i-1][j-1];
            } else {
                memo[i][j] = 1 + minAdv(
                    memo[i][j-1], // insert
                    memo[i-1][j], // delete
                    memo[i-1][j-1] // substitute
                );
            }
        }
    }
    // final answer
    return memo[m][n];
}

Prefix vs Suffix

The tabulation process for prefix is going down (removing from the end so tabulate down) and for suffix is going up (removing from the beginning so tabulate up).

By Prefix

By Suffix

Time & Space Complexity

Time is O(m*n) but space is trickier. We are using O(m*n) space right now for better understanding. But we can get to constant space with careful implementation since the dependencies are only two layers.

PreviousLongest Common Subsequence NextRegex Matching

Last updated 5 years ago