




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Sample Files,Pit Fall,Simulation Script With Excercises and Solutions.
Typology: Exercises
1 / 8
This page cannot be seen from the preview
Don't miss anything!
with Case Studies in Deep Sequencing Data Analysis
$ /tmp/bootcamp/exercise1.sh
You have 30 samples and you treated each of them in three di erent conditions: untreated, overexpressed, underexpressed. After your initial analysis, you collected all your les in one folder. You named the les as condition.$i.bam and condition.$i.bed where condition 2 funtreated; overexpressed; underexpressedg and i 2 f 1 ; 2 ; : : : ; 30 g. Now you want to or- ganize this folder according to condition name. You want to put all les that belong to the condition untreated in the folder untreated, all les that belong to the condition overex- pressed in the folder overexpressed and so on. How can you do this without typing the move command for each le one-by-one? Hint: Take a look at shell wild cards. For this case "*" will be useful for you.
Solution: We assume that you are in the exercise 1 files directory. Then, rst, create the sub- directories for the conditions.
$ mkdir untreated overexpressed underexpressed
Now move the les that belong to the untreated condition. The wildcard "" stands for any number of any characters. To match all bam and bed les coming from the untreated condition, we can use untreated.. All le names in that condiotion start with untreated and have a ".". Hence untreated.* matches untreated2.bam, untreated5.bed and etc. So
$ mv untreated. untreated/
will move the les that belong to the condition untreated. For the remaining libraries, using the same method, we can move the les using the commands
$ mv overexpressed. overexpressed/
and
$ mv underexpressed. underexpressed/
Pitfall: A command like
$ mv untreated* untreated/
doesn't work in this case. The pattern untreated* matches "untreated" as well. So the above command will try to move the directory untreated* on itself. This will cause an error.
$ bkill JOB ID 1 JOB ID 2 ... JOB ID n
where JOB ID i is the identi cation number of the corresponding job to be cancelled. How can you cancel all of your tophat jobs without typing the job identi cation number of each individual job? For this solution, assume that
$ bkill JOB ID 1 JOB ID 2 ... JOB ID n
is the only way to cancel jobs though there are alternative and easier ways. Simulation Script: There is a simulation script for this question. Run the script /tmp/bootcamp/exercise4.sh to start a set of tophat, bwa and fastqc jobs.
$ /tmp/bootcamp/exercise4.sh
These jobs will be alive for 10 minutes. If you are not done after ten minutes, run the script again. They will actually be dummy jobs but their names will be relevant to this exercise. Hint 1: Use pipes "j" and awk. Hint 2: Once you gure out how to get the job id's of tophat jobs, give the job id list to bkill.
Solution: We can list all our tophat jobs using grep
$ bjobs | grep tophat
Note that we can make sure that we get the correct job ids by
$ bjobs | grep tophat | less
In the resulting list, we need the rst column. We get this by
$ bjobs | grep tophat | awk 'fprint $ 1 g'
Now, we are ready to cancel our tophat jobs.
$ bkill bjobs | grep tophat | awk 'fprint $ 1 g'
Note that we need to put `around our previous command to provide it as input to bkill.
Solution: If you use grep with the -v option, you will get all non-matching entries. So
$ bkill bjobs | grep -v tophat | grep -v fastqc
will cancel all jobs that are neither tophat jobs nor fastqc jobs.
Solution: The command
$ grep chr
will not work because it will also bring all entries from chr11. The reason is that the string chr1 is a substring of chr11. To get the entries that are from chr1 only, we use awk as follows. Compare the rst column of the bed le to the string "chr1". Print only the rows for which the equality holds.
$ awk 'f if($1 == "chr1")print($0) g' exercise6.bed
Note that, in awk, by default, the nth^ column is accessed by $n. The whole line is accessed by $ 0.
Solution: Note that the nucleotide sequence of the 22nd^ read is in the line 21 4 + 2 = 86. So, the problem is getting the 86th^ line in a fastq le. Say, our fastq le is file.fq. We can get the rst 86 lines by head -n 86 file.fq. We need the last line among these 86 lines. For this, tail -n 1 does the job. So, we pipe the output of head to tail as follows.
$ head -n 86 file.fq | tail -n 1
Solution: Say, our fastq le is file.fq. First we need all the nucleotide sequences in this le. This can be done by
$ awk 'f if( NR % 2 == 4 ) print($0) g' file.fq
We can get the rst two nucleotides using the substr function. Namely, substr($1,1,2) gives us the two characters of $1 starting at the rst position. We take only the reads whose rst two characters are TT. Hence
$ awk 'f if( NR % 2 == 4 ) print($0) g' file.fq | awk 'f if(substr($0,1,2) == "TT")print($0)g'
gives all entries that belong to the sample of interest. The function substr can be used to remove the barcodes. For this we take all nucleotides that come on and after the third position. This is done obtained by substr($0,3). So, to get the reads with barcodes removed is done by
$ awk 'f if( NR % 2 == 4 ) print($0) g' file.fq | awk 'f if(substr($0,1,2) == "TT")print(substr($0,3))g'
Solution: The strand information is found on the sixth column of a bed le. So, using awk, we can pick all reads coming from the negative strand.
$ awk 'fif($6 == "-")print($0)g' exercise6.bed
To get the entries from chr11, we can use grep. Note that grep works as there is no chromosome name such as chr110 or so.
$ awk 'fif($6 == "-")print($0)g' exercise6.bed | grep chr
We can save the output in a le named chr11 negative.bed by
$ awk 'fif($6 == "-")print($0)g' exercise6.bed | grep chr11 > chr11 negative.bed