Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Laboratory Worksheet, Monday, Oct. 10., Summaries of Probability and Statistics

Laboratory Worksheet, Monday, Oct. 10. I. HMM Viterbi Algorithm. ... We want to create data to train the HMM, i.e., to find the HMM's probability.

Typology: Summaries

2022/2023

Uploaded on 05/11/2023

gaurish
gaurish 🇺🇸

4.7

(15)

235 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Math/Stats/BI 548, Fall 2005:
Computations in Biological Sequence Analysis
D. Burns and J. DeWet
Laboratory Worksheet, Monday, Oct. 10.
I. HMM Viterbi Algorithm. This is just to finish off what was started in class. In the
coursetools you should find a file of Matlab scripts and data. First of all, you have to log
in to the Mac computers in the plaza level lab (it has access to Matlab) and open matlab
from the applications directory. Then download the Matlab scripts onto your desktop.
Then download the Kevin Murphy toolbox file on the course Web Resources page (it is
the last entry). You will have to follow some links here. Then in Matlab open setpath
from the file menu. I will explain this in class; there is a subtlety in that you cannot save
the pathway to the matlab directory, but you can use it this session on your desktop.
When this is sorted out, upload the dicedata.mat into the matlab workspace. I will show
you how to do this. Locate the variables in the workspace. We will first use the command
dataOL.m to convert the data string from the dicedata into a 2 x 300 matrix of observed
likelihoods. Then use this as part of the input to viterbi path to learn the Viterbi decoding
of the HMM.
II. Training Exercise. This time let us assume we do not know the parameters for the
HMM. We want to create data to train the HMM, i.e., to find the HMM’s probability
parameters from data. This is done by the script casinorandomizer.m. Open this function
file up and read what the inputs are. Now create a matrix 10 x 300 in size which give
random data with the Markov parameters we knew form the original dishonest casino
problem. Yes, this is a bit circular, strictly speaking, but the idea is to rediscover these
parameters from the Baum-Welch (expectation-maximization) method. We will use the
function dhmm em.m from the HMM toolbox.
As a write up for this week, please copy form the screen your best approximation to the
parameteres we used to generate data, as learned by the training algorithm dhmm em.
What adjustments seemed to help or harm your getting this result? That is, did changing
the threshold number of repetitions help? Did generating more data help? Did insisting
on a more stringent threshold for change in LL from one iteration to the next help?
III. p-values and Pairwise Sequence Alignment. We have to transfer back to 2036
PC for this one, because we have the USC alignment package mounted in “our” laboratory
(and not in the UM IT lab on the 3rd floor). Go back to the exercise to compare E. coli
tRNA’s against the 16S subunit of the ribosome. From the 548 Resources page, you can
download the data files ECORRD and EctRNAdata. You will have to use the function
pvlocal from the command line in the Linux based lab computers. I will hopefully be able
to mount the results of this comparison form an older paper of Waterman’s. Be sure t do
the comparison involving the tRNA for cysteine.
Since we have a lab day knocked out by the Fall Break this year, we will probably try
to do this example inn class before the (distant!) next lab day. I have attached two
pages form the paper “”Hearing Distant Echoes” by Michael Waterman from Calculating
the Secrets of Life, E. Lander and M. Waterman, eds., NAS Press, 1995. It shows an
analysis (just the data) of pairwise comparison between E. coli 16S ribosomal RNA and
1
pf3

Partial preview of the text

Download Laboratory Worksheet, Monday, Oct. 10. and more Summaries Probability and Statistics in PDF only on Docsity!

Math/Stats/BI 548, Fall 2005:

Computations in Biological Sequence Analysis

D. Burns and J. DeWet

Laboratory Worksheet, Monday, Oct. 10.

I. HMM Viterbi Algorithm. This is just to finish off what was started in class. In the coursetools you should find a file of Matlab scripts and data. First of all, you have to log in to the Mac computers in the plaza level lab (it has access to Matlab) and open matlab from the applications directory. Then download the Matlab scripts onto your desktop. Then download the Kevin Murphy toolbox file on the course Web Resources page (it is the last entry). You will have to follow some links here. Then in Matlab open setpath from the file menu. I will explain this in class; there is a subtlety in that you cannot save the pathway to the matlab directory, but you can use it this session on your desktop.

When this is sorted out, upload the dicedata.mat into the matlab workspace. I will show you how to do this. Locate the variables in the workspace. We will first use the command dataOL.m to convert the data string from the dicedata into a 2 x 300 matrix of observed likelihoods. Then use this as part of the input to viterbi path to learn the Viterbi decoding of the HMM.

II. Training Exercise. This time let us assume we do not know the parameters for the HMM. We want to create data to train the HMM, i.e., to find the HMM’s probability parameters from data. This is done by the script casinorandomizer.m. Open this function file up and read what the inputs are. Now create a matrix 10 x 300 in size which give random data with the Markov parameters we knew form the original dishonest casino problem. Yes, this is a bit circular, strictly speaking, but the idea is to rediscover these parameters from the Baum-Welch (expectation-maximization) method. We will use the function dhmm em.m from the HMM toolbox.

As a write up for this week, please copy form the screen your best approximation to the parameteres we used to generate data, as learned by the training algorithm dhmm em. What adjustments seemed to help or harm your getting this result? That is, did changing the threshold number of repetitions help? Did generating more data help? Did insisting on a more stringent threshold for change in LL from one iteration to the next help?

III. p-values and Pairwise Sequence Alignment. We have to transfer back to 2036 PC for this one, because we have the USC alignment package mounted in “our” laboratory (and not in the UM IT lab on the 3rd floor). Go back to the exercise to compare E. coli tRNA’s against the 16S subunit of the ribosome. From the 548 Resources page, you can download the data files ECORRD and EctRNAdata. You will have to use the function pvlocal from the command line in the Linux based lab computers. I will hopefully be able to mount the results of this comparison form an older paper of Waterman’s. Be sure t do the comparison involving the tRNA for cysteine.

Since we have a lab day knocked out by the Fall Break this year, we will probably try to do this example inn class before the (distant!) next lab day. I have attached two pages form the paper “”Hearing Distant Echoes” by Michael Waterman from Calculating the Secrets of Life, E. Lander and M. Waterman, eds., NAS Press, 1995. It shows an analysis (just the data) of pairwise comparison between E. coli 16S ribosomal RNA and

the various tRNA’s for the bug. The point is the significance column. The second figure uses a more accurate estimation of the significance. Unfortunately, it is given in standard deviations and not straight p-value. The p-value for cystine’s σ = 6.2 is about 10−^3.