Edit Distance and its Applications | Assignments Computer Programming

Programming Assignment : Edit Distance

Target: Many word processors and keyword search engines have a

spelling correction feature. If you type in a misspelled word x, the

word processor or search engine can suggest a correction y. The

correction y should be a word that is close to x. One way to measure

the similarity in spelling between two text strings is by “edit

distance”. The notion of edit distance is useful in other fields as well.

For example, biologists use edit distance to characterize the

similarity of DNA or protein sequences.

Background: The edit distance d(x, y) of two strings of text, x[1..m]

and y[1..n], is defined to be the minimum possible cost of a

sequence of “ transformation operations”(defined below) that

transforms string x[1..m] into string y[1..n]. To define the effect of

the transformation operations, we use an auxiliary string z[1..s] that

holds the intermediate results. At the beginning of the

transformation sequence s = m and z[1..s] = x[1..m] (i.e., we start

with string x[1..m]). At the end of the transformation sequence, we

should have s = n and z[1..s] = y[1..n](i.e., our goal is to transform

into string y[1..n]). Throughout the transformation, we maintain the

current length s of string z, as well as a cursor position i, i.e., an

index into string z. The invariant 1  i  s +1 holds at all times

during the transformation. (Notice that the cursor can move one

space beyond the end of the string z in order to allow insertion

at the end of the string.)

Each transformation operation may alter the string z, the size s, and

the cursor position

i. Each transformation operation also has an associated cost. The

cost of a sequence of transformation operations is the sum of the

costs of the individual operations on the sequence. The goal of the

edit-distance problem is to find a sequence of transformation

operation of minimum cost that transforms x[1..m] into y[1..n].

There are five transformation operations:

Operatio

Cos

Effect

left 0 If i = 1 then do nothing. Otherwise, set i  i-1

right 0 If i = s +1 then do nothing. Otherwise, set i  i-1.

replace 4 If i = s +1 then do nothing. Otherwise, replace the

character

under the cursor by another character c by setting

z[i] c, and

then incrementing i.

delete 2 If i = s +1 then do nothing. Otherwise, delete the

character c

under the cursor by setting z[i..s] 

z[i+1..s+1] and

Partial preview of the text

Download Edit Distance and its Applications and more Assignments Computer Programming in PDF only on Docsity!

Programming Assignment : Edit Distance Target : Many word processors and keyword search engines have a spelling correction feature. If you type in a misspelled word x , the word processor or search engine can suggest a correction y. The correction y should be a word that is close to x. One way to measure the similarity in spelling between two text strings is by “edit distance”. The notion of edit distance is useful in other fields as well. For example, biologists use edit distance to characterize the similarity of DNA or protein sequences. Background : The edit distance d ( x , y ) of two strings of text, x [1.. m ] and y [1.. n ], is defined to be the minimum possible cost of a sequence of “ transformation operations”(defined below) that transforms string x [1.. m ] into string y [1.. n ]. To define the effect of the transformation operations, we use an auxiliary string z [1.. s ] that holds the intermediate results. At the beginning of the transformation sequence s = m and z [1.. s ] = x [1.. m ] (i.e., we start with string x [1.. m ]). At the end of the transformation sequence, we should have s = n and z [1.. s ] = y [1.. n ](i.e., our goal is to transform into string y [1.. n ]). Throughout the transformation, we maintain the current length s of string z , as well as a cursor position i , i.e., an index into string z. The invariant 1  i  s +1 holds at all times during the transformation. (Notice that the cursor can move one space beyond the end of the string z in order to allow insertion at the end of the string.) Each transformation operation may alter the string z , the size s , and the cursor position i. Each transformation operation also has an associated cost. The cost of a sequence of transformation operations is the sum of the costs of the individual operations on the sequence. The goal of the edit-distance problem is to find a sequence of transformation operation of minimum cost that transforms x [1.. m ] into y [1.. n ]. There are five transformation operations: Operatio n Cos t Effect left 0 If i = 1 then do nothing. Otherwise, set i  i - right 0 If^ i^ =^ s^ +1^ then^ do^ nothing. Otherwise,^ set^ i^ ^ i -1. replace 4 If i = s +1 then do nothing. Otherwise, replace the character under the cursor by another character c by setting z [ i ]  c , and then incrementing i. delete 2 If i = s +1 then do nothing. Otherwise, delete the character c under the cursor by setting z [ i .. s ]  z[ i +1.. s +1] and

decrementing s. The cursor position i does not change. insert 3 Insert the character c into string z by incrementing s , setting z [ i +1.. s ]  z [ i .. s -1], setting z [ i ]  c , and then incrementing index i. As an example, one way to transform the source string algorithm to the target string

exhibits overlapping subproblems. (e) Describe a dynamic-programming algorithm that computes the edit distance from x [1.. m ] to y [1.. n ].(Do not use a memoized recursive algorithm. Your algorithm should be a classical, bottom-up, tabular algorithm.) Analyze the running time and space requirements of your algorithm.

(f) Implement your algorithm as a computer program in any language you wish. Your program should calculate the edit distance d ( x , y ) between two strings x and y using dynamic programming and print out the corresponding sequence of transformation operations in the style of Table 1. Run your program on the strings x = “electrical engineering”, y = “computer science”. Sample input and output text is provided in the data set to help you debug your program. These solutions are not necessarily unique: there may be other sequences of transformation operations that achieve the same cost. As usual, you may collaborate to solve this problem, but you must write the program by yourself. (g) Run your program on the three input files provided. Each input file contains the following four lines:

The number of characters m in the string x.
The string x.
The number of characters n in the string y.
The string y. Compute the edit distance d ( x , y ) for each input. Do not hand in a printout of the transformation operations for this problem part. (Extra bonus kudos if you can identify the source of all the texts, without searching the web.) (h) If z is implemented using an array, then the “insert” and “delete” operations requires Θ( n ) time. Design a suitable data structure that allow each of the five transformation operations to be implemented in Ο(1) time.

Edit Distance and its Applications, Assignments of Computer Programming

Related documents

Partial preview of the text

Download Edit Distance and its Applications and more Assignments Computer Programming in PDF only on Docsity!