Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Edit Distance and its Applications, Assignments of Computer Programming

The concept of edit distance and its applications in measuring the similarity between two text strings. It also describes the five transformation operations and their associated costs used to find the minimum cost sequence of transformation operations that transforms one string into another. The applications of edit distance in spelling correction and characterizing the similarity of DNA or protein sequences are also discussed.

Typology: Assignments

2022/2023

Available from 07/27/2023

dennis-durfort
dennis-durfort 🇺🇸

16 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Programming Assignment : Edit Distance
Target: Many word processors and keyword search engines have a
spelling correction feature. If you type in a misspelled word x, the
word processor or search engine can suggest a correction y. The
correction y should be a word that is close to x. One way to measure
the similarity in spelling between two text strings is by “edit
distance”. The notion of edit distance is useful in other fields as well.
For example, biologists use edit distance to characterize the
similarity of DNA or protein sequences.
Background: The edit distance d(x, y) of two strings of text, x[1..m]
and y[1..n], is defined to be the minimum possible cost of a
sequence of transformation operations”(defined below) that
transforms string x[1..m] into string y[1..n]. To define the effect of
the transformation operations, we use an auxiliary string z[1..s] that
holds the intermediate results. At the beginning of the
transformation sequence s = m and z[1..s] = x[1..m] (i.e., we start
with string x[1..m]). At the end of the transformation sequence, we
should have s = n and z[1..s] = y[1..n](i.e., our goal is to transform
into string y[1..n]). Throughout the transformation, we maintain the
current length s of string z, as well as a cursor position i, i.e., an
index into string z. The invariant 1 i s +1 holds at all times
during the transformation. (Notice that the cursor can move one
space beyond the end of the string z in order to allow insertion
at the end of the string.)
Each transformation operation may alter the string z, the size s, and
the cursor position
i. Each transformation operation also has an associated cost. The
cost of a sequence of transformation operations is the sum of the
costs of the individual operations on the sequence. The goal of the
edit-distance problem is to find a sequence of transformation
operation of minimum cost that transforms x[1..m] into y[1..n].
There are five transformation operations:
Operatio
n
Cos
t
Effect
left 0 If i = 1 then do nothing. Otherwise, set i i-1
right 0 If i = s +1 then do nothing. Otherwise, set i i-1.
replace 4 If i = s +1 then do nothing. Otherwise, replace the
character
under the cursor by another character c by setting
z[i] c, and
then incrementing i.
delete 2 If i = s +1 then do nothing. Otherwise, delete the
character c
under the cursor by setting z[i..s] 
z[i+1..s+1] and
pf3
pf4
pf5

Partial preview of the text

Download Edit Distance and its Applications and more Assignments Computer Programming in PDF only on Docsity!

Programming Assignment : Edit Distance Target : Many word processors and keyword search engines have a spelling correction feature. If you type in a misspelled word x , the word processor or search engine can suggest a correction y. The correction y should be a word that is close to x. One way to measure the similarity in spelling between two text strings is by “edit distance”. The notion of edit distance is useful in other fields as well. For example, biologists use edit distance to characterize the similarity of DNA or protein sequences. Background : The edit distance d ( x , y ) of two strings of text, x [1.. m ] and y [1.. n ], is defined to be the minimum possible cost of a sequence of “ transformation operations”(defined below) that transforms string x [1.. m ] into string y [1.. n ]. To define the effect of the transformation operations, we use an auxiliary string z [1.. s ] that holds the intermediate results. At the beginning of the transformation sequence s = m and z [1.. s ] = x [1.. m ] (i.e., we start with string x [1.. m ]). At the end of the transformation sequence, we should have s = n and z [1.. s ] = y [1.. n ](i.e., our goal is to transform into string y [1.. n ]). Throughout the transformation, we maintain the current length s of string z , as well as a cursor position i , i.e., an index into string z. The invariant 1  is +1 holds at all times during the transformation. (Notice that the cursor can move one space beyond the end of the string z in order to allow insertion at the end of the string.) Each transformation operation may alter the string z , the size s , and the cursor position i. Each transformation operation also has an associated cost. The cost of a sequence of transformation operations is the sum of the costs of the individual operations on the sequence. The goal of the edit-distance problem is to find a sequence of transformation operation of minimum cost that transforms x [1.. m ] into y [1.. n ]. There are five transformation operations: Operatio n Cos t Effect left 0 If i = 1 then do nothing. Otherwise, set ii - right 0 If^ i^ =^ s^ +1^ then^ do^ nothing. Otherwise,^ set^ i^ ^ i -1. replace 4 If i = s +1 then do nothing. Otherwise, replace the character under the cursor by another character c by setting z [ i ]  c , and then incrementing i. delete 2 If i = s +1 then do nothing. Otherwise, delete the character c under the cursor by setting z [ i .. s ]  z[ i +1.. s +1] and

decrementing s. The cursor position i does not change. insert 3 Insert the character c into string z by incrementing s , setting z [ i +1.. s ]  z [ i .. s -1], setting z [ i ]  c , and then incrementing index i. As an example, one way to transform the source string algorithm to the target string

exhibits overlapping subproblems. (e) Describe a dynamic-programming algorithm that computes the edit distance from x [1.. m ] to y [1.. n ].(Do not use a memoized recursive algorithm. Your algorithm should be a classical, bottom-up, tabular algorithm.) Analyze the running time and space requirements of your algorithm.

(f) Implement your algorithm as a computer program in any language you wish. Your program should calculate the edit distance d ( x , y ) between two strings x and y using dynamic programming and print out the corresponding sequence of transformation operations in the style of Table 1. Run your program on the strings x = “electrical engineering”, y = “computer science”. Sample input and output text is provided in the data set to help you debug your program. These solutions are not necessarily unique: there may be other sequences of transformation operations that achieve the same cost. As usual, you may collaborate to solve this problem, but you must write the program by yourself. (g) Run your program on the three input files provided. Each input file contains the following four lines:

  1. The number of characters m in the string x.
  2. The string x.
  3. The number of characters n in the string y.
  4. The string y. Compute the edit distance d ( x , y ) for each input. Do not hand in a printout of the transformation operations for this problem part. (Extra bonus kudos if you can identify the source of all the texts, without searching the web.) (h) If z is implemented using an array, then the “insert” and “delete” operations requires Θ( n ) time. Design a suitable data structure that allow each of the five transformation operations to be implemented in Ο(1) time.