Given two strings, find the minimum number of edits to transform string1 to string2.
Example
I: "Saturday", "Sunday"
O: 4
I: "More", "Less"
Brute Force Search
The most brute force search is just to enumerate all the possible matches for #edits = 1, 2, 3, ..., n until we find a match. Or we can do a smarter search that starts either from the beginning or the end, comparing the corresponding characters.
If we start from the beginning, then the minimum number of edits for s1 to match s2
if s1[m] == s2[n]
return editDistance(s1, s2, m+1, n+1)
else
return 1 + Math.min(
editDistance(s1, s2, m, n + 1), // insertion
editDistance(s1, s2, m + 1, n), // deletion
editDistance(s1, s2, m + 1, n + 1) // substitution
) ;
The recurrence above should capture all the possibilities. Inside the else statement, the presumed equivalent cost of edit 1 + the min of the rest is guaranteed to be optimal. What if spending one edit is actually more cost efficient then having a match in the if statement? We can get a formal proof by reordering steps in an arbitrary optimum solution. But this example should give some good intuition. Consider SabSunday and Sunday. If we match the S in the beginning, we just have to delete the second S when we get there. The minimum number of edits still remain as 3.
Brute Force Code
public int minDistance(String word1, String word2) {
return editDistance(word1, word2, 0, 0);
}
private int editDistance(String s1, String s2, int m, int n) {
// base case reach the end
if (m == s1.length()) {
// insert the rest of s2
return s2.length() - n;
}
if (n == s2.length()) {
// delete the rest of s1
return s1.length() - m;
}
// search and return
if (s1.charAt(m) == s2.charAt(n)) {
return editDistance(s1, s2, m + 1, n + 1);
} else {
return 1 + minAdv(
editDistance(s1, s2, m, n + 1), // insertion
editDistance(s1, s2, m + 1, n), // deletion
editDistance(s1, s2, m + 1, n + 1) // substitution
);
}
}
private int minAdv(int num1, int num2, int num3) {
return Math.min(Math.min(num1, num2), num3);
}
Search works. The exponential time complexity doesn't. A relative short example like dinitrophenylhydrazine and acetylphenylhydrazine will give Time Limit Exceeded. See the induction proof in Rod Cutting.
Bottom Up DP
Observe the recursion for the following example
If we define a subproblem as ED(i,j), then ED(i,j) represents the optimal solution at using the suffix s1[i:m] to fully match the suffix s2[j:n]. From the recursion tree, we can find quite a few subtrees can be reused if they are memoized. Dynamic programming is the way to go.
Solve by Suffix
1. Define optimal subproblems
ED(i,j) = minimum number of edits for A[i:m] to fully match B[j:n]
2. Solve by suffix
if (A[i] == B[j])
ED(i,j) = ED(i+1,j+1)
else
ED(i,j) = 1 + Math.min(
ED(i,j+1),
ED(i+1,j),
ED(i+1,j+1)
)
3. Topo order
from small suffixes to the beginning
fill in col by col from right to left
for j from n-1 to 0
for i from m-1 to 0
4. Init base
last row and last col
for i from m to 0
ED(i,n) = m - i
for j from n to 0
ED(m,j) = n - j
5. Final answer
ED(0,0)
public int minDistance(String s1, String s2) {
int m = s1.length(), n = s2.length();
char[] A = s1.toCharArray(), B = s2.toCharArray();
// create memo table
int[][] ED = new int[m + 1][n + 1];
// init base case
for (int i = m; i >= 0; i--) {
ED[i][n] = m - i; // last col
}
for (int j = n; j >= 0; j--) {
ED[m][j] = n - j; // last row
}
// topo order
for (int j = n - 1; j >= 0; j--) {
for (int i = m - 1; i >= 0; i--) {
if (A[i] == B[j]) {
ED[i][j] = ED[i+1][j+1];
} else {
ED[i][j] = 1 + minAdv(
ED[i][j+1], // insert
ED[i+1][j], // delete
ED[i+1][j+1] // substitute
);
}
}
}
// final answer
return ED[0][0];
}
Solve by Prefix
Defining the subproblem by prefix might be easier to write sometimes.
1. Define optimal subproblem
ED(i,j) = minimum number of edits for A[0:i] to fully match B[0:j]
2. Solve by prefix
if A[i] == B[j]
ED(i,j) = ED(i-1,j-1)
else
ED(i,j) = 1 + advMin(
ED(i,j-1), // insert
ED(i-1, j), // delete
ED(i-1,j-1) // substitute
)
3. Init base case
for i from 0 to m
ED(i,0) = i // first col
for j from 0 to n
ED(0,j) = j // first row
4. Topo Order
row by row from left to right
for i from 1 to m
for j from 1 to n
5. Final answer
ED(m,n)
public int minDistance(String s1, String s2) {
int m = s1.length(), n = s2.length();
// create memo table
int[][] memo = new int[m+1][n+1];
// init base case
for (int i = 0; i <= m; i++) {
memo[i][0] = i; // first col
}
for (int j = 0; j <= n; j++) {
memo[0][j] = j; // first row
}
// topo order
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (s1.charAt(i-1) == s2.charAt(j-1)) {
memo[i][j] = memo[i-1][j-1];
} else {
memo[i][j] = 1 + minAdv(
memo[i][j-1], // insert
memo[i-1][j], // delete
memo[i-1][j-1] // substitute
);
}
}
}
// final answer
return memo[m][n];
}
Prefix vs Suffix
The tabulation process for prefix is going down (removing from the end so tabulate down) and for suffix is going up (removing from the beginning so tabulate up).
By Prefix
By Suffix
Time & Space Complexity
Time is O(m*n) but space is trickier. We are using O(m*n) space right now for better understanding. But we can get to constant space with careful implementation since the dependencies are only two layers.