Laboratory Worksheet, Monday, Oct. 10.

Math/Stats/BI 548, Fall 2005:

Computations in Biological Sequence Analysis

D. Burns and J. DeWet

Laboratory Worksheet, Monday, Oct. 10.

I. HMM Viterbi Algorithm. This is just to finish off what was started in class. In the

coursetools you should find a file of Matlab scripts and data. First of all, you have to log

in to the Mac computers in the plaza level lab (it has access to Matlab) and open matlab

from the applications directory. Then download the Matlab scripts onto your desktop.

Then download the Kevin Murphy toolbox file on the course Web Resources page (it is

the last entry). You will have to follow some links here. Then in Matlab open setpath

from the file menu. I will explain this in class; there is a subtlety in that you cannot save

the pathway to the matlab directory, but you can use it this session on your desktop.

When this is sorted out, upload the dicedata.mat into the matlab workspace. I will show

you how to do this. Locate the variables in the workspace. We will first use the command

dataOL.m to convert the data string from the dicedata into a 2 x 300 matrix of observed

likelihoods. Then use this as part of the input to viterbi path to learn the Viterbi decoding

of the HMM.

II. Training Exercise. This time let us assume we do not know the parameters for the

HMM. We want to create data to train the HMM, i.e., to find the HMM’s probability

parameters from data. This is done by the script casinorandomizer.m. Open this function

file up and read what the inputs are. Now create a matrix 10 x 300 in size which give

random data with the Markov parameters we knew form the original dishonest casino

problem. Yes, this is a bit circular, strictly speaking, but the idea is to rediscover these

parameters from the Baum-Welch (expectation-maximization) method. We will use the

function dhmm em.m from the HMM toolbox.

As a write up for this week, please copy form the screen your best approximation to the

parameteres we used to generate data, as learned by the training algorithm dhmm em.

What adjustments seemed to help or harm your getting this result? That is, did changing

the threshold number of repetitions help? Did generating more data help? Did insisting

on a more stringent threshold for change in LL from one iteration to the next help?

III. p-values and Pairwise Sequence Alignment. We have to transfer back to 2036

PC for this one, because we have the USC alignment package mounted in “our” laboratory

(and not in the UM IT lab on the 3rd floor). Go back to the exercise to compare E. coli

tRNA’s against the 16S subunit of the ribosome. From the 548 Resources page, you can

download the data files ECORRD and EctRNAdata. You will have to use the function

pvlocal from the command line in the Linux based lab computers. I will hopefully be able

to mount the results of this comparison form an older paper of Waterman’s. Be sure t do

the comparison involving the tRNA for cysteine.

Since we have a lab day knocked out by the Fall Break this year, we will probably try

to do this example inn class before the (distant!) next lab day. I have attached two

pages form the paper “”Hearing Distant Echoes” by Michael Waterman from Calculating

the Secrets of Life, E. Lander and M. Waterman, eds., NAS Press, 1995. It shows an

analysis (just the data) of pairwise comparison between E. coli 16S ribosomal RNA and

Partial preview of the text

Download Laboratory Worksheet, Monday, Oct. 10. and more Summaries Probability and Statistics in PDF only on Docsity!

Math/Stats/BI 548, Fall 2005:

Computations in Biological Sequence Analysis

D. Burns and J. DeWet

I. HMM Viterbi Algorithm. This is just to finish off what was started in class. In the coursetools you should find a file of Matlab scripts and data. First of all, you have to log in to the Mac computers in the plaza level lab (it has access to Matlab) and open matlab from the applications directory. Then download the Matlab scripts onto your desktop. Then download the Kevin Murphy toolbox file on the course Web Resources page (it is the last entry). You will have to follow some links here. Then in Matlab open setpath from the file menu. I will explain this in class; there is a subtlety in that you cannot save the pathway to the matlab directory, but you can use it this session on your desktop.

When this is sorted out, upload the dicedata.mat into the matlab workspace. I will show you how to do this. Locate the variables in the workspace. We will first use the command dataOL.m to convert the data string from the dicedata into a 2 x 300 matrix of observed likelihoods. Then use this as part of the input to viterbi path to learn the Viterbi decoding of the HMM.

II. Training Exercise. This time let us assume we do not know the parameters for the HMM. We want to create data to train the HMM, i.e., to find the HMM’s probability parameters from data. This is done by the script casinorandomizer.m. Open this function file up and read what the inputs are. Now create a matrix 10 x 300 in size which give random data with the Markov parameters we knew form the original dishonest casino problem. Yes, this is a bit circular, strictly speaking, but the idea is to rediscover these parameters from the Baum-Welch (expectation-maximization) method. We will use the function dhmm em.m from the HMM toolbox.

As a write up for this week, please copy form the screen your best approximation to the parameteres we used to generate data, as learned by the training algorithm dhmm em. What adjustments seemed to help or harm your getting this result? That is, did changing the threshold number of repetitions help? Did generating more data help? Did insisting on a more stringent threshold for change in LL from one iteration to the next help?

III. p-values and Pairwise Sequence Alignment. We have to transfer back to 2036 PC for this one, because we have the USC alignment package mounted in “our” laboratory (and not in the UM IT lab on the 3rd floor). Go back to the exercise to compare E. coli tRNA’s against the 16S subunit of the ribosome. From the 548 Resources page, you can download the data files ECORRD and EctRNAdata. You will have to use the function pvlocal from the command line in the Linux based lab computers. I will hopefully be able to mount the results of this comparison form an older paper of Waterman’s. Be sure t do the comparison involving the tRNA for cysteine.

Since we have a lab day knocked out by the Fall Break this year, we will probably try to do this example inn class before the (distant!) next lab day. I have attached two pages form the paper “”Hearing Distant Echoes” by Michael Waterman from Calculating the Secrets of Life, E. Lander and M. Waterman, eds., NAS Press, 1995. It shows an analysis (just the data) of pairwise comparison between E. coli 16S ribosomal RNA and

the various tRNA’s for the bug. The point is the significance column. The second figure uses a more accurate estimation of the significance. Unfortunately, it is given in standard deviations and not straight p-value. The p-value for cystine’s σ = 6.2 is about 10−^3.

Laboratory Worksheet, Monday, Oct. 10., Summaries of Probability and Statistics

Related documents

Partial preview of the text

Download Laboratory Worksheet, Monday, Oct. 10. and more Summaries Probability and Statistics in PDF only on Docsity!

Math/Stats/BI 548, Fall 2005:

Computations in Biological Sequence Analysis

D. Burns and J. DeWet

Laboratory Worksheet, Monday, Oct. 10.