Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Unix Excercises with case Study in Deep Sequencing Data Analysis., Exercises of Operating Systems

United Theological Seminary (UTS)Operating Systems

Sample Files,Pit Fall,Simulation Script With Excercises and Solutions.

Typology: Exercises

2021/2022

Uploaded on 02/11/2022

geek45 🇺🇸

4.4

(10)

275 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

Unix Exercises

with Case Studies in Deep Sequencing Data Analysis

Partial preview of the text

Download Unix Excercises with case Study in Deep Sequencing Data Analysis. and more Exercises Operating Systems in PDF only on Docsity!

Unix Exercises

with Case Studies in Deep Sequencing Data Analysis

Exercises

Sample Files: For this exercise, please run the script /tmp/bootcamp/exercise1.sh to create a sample set of les in your home folder. This will create a directory containing a set of sample les for this question.

$ /tmp/bootcamp/exercise1.sh

You have 30 samples and you treated each of them in three di erent conditions: untreated, overexpressed, underexpressed. After your initial analysis, you collected all your les in one folder. You named the les as condition.$i.bam and condition.$i.bed where condition 2 funtreated; overexpressed; underexpressedg and i 2 f 1 ; 2 ; : : : ; 30 g. Now you want to or- ganize this folder according to condition name. You want to put all les that belong to the condition untreated in the folder untreated, all les that belong to the condition overex- pressed in the folder overexpressed and so on. How can you do this without typing the move command for each le one-by-one? Hint: Take a look at shell wild cards. For this case "*" will be useful for you.

Solution: We assume that you are in the exercise 1 files directory. Then, rst, create the sub- directories for the conditions.

$ mkdir untreated overexpressed underexpressed

Now move the les that belong to the untreated condition. The wildcard "" stands for any number of any characters. To match all bam and bed les coming from the untreated condition, we can use untreated.. All le names in that condiotion start with untreated and have a ".". Hence untreated.* matches untreated2.bam, untreated5.bed and etc. So

$ mv untreated. untreated/

will move the les that belong to the condition untreated. For the remaining libraries, using the same method, we can move the les using the commands

$ mv overexpressed. overexpressed/

and

$ mv underexpressed. underexpressed/

Pitfall: A command like

$ mv untreated* untreated/

doesn't work in this case. The pattern untreated* matches "untreated" as well. So the above command will try to move the directory untreated* on itself. This will cause an error.

You have submitted a bunch of jobs to the cluster. Afterwards you accidentally submitted many tophat jobs. If the tophat jobs nish and produce output, you will run out of disk space and your previous jobs won't be able to nish successfully. So you want to cancel your tophat jobs. Each job submitted to the HPC system has a unique job identi cation number. This number is displayed on the rst column when you type bjobs. Each job has a name (not necessarily unique) and it is displayed on the seventh column of this list. Luckily you named all your tophat jobs in the form tophat sample number where sample number is an identi er of the RNASeq library being aligned. You can cancel a set of running or scheduled jobs using the bkill command followed by the job id numbers separated by white space, i.e.,

$ bkill JOB ID 1 JOB ID 2 ... JOB ID n

where JOB ID i is the identi cation number of the corresponding job to be cancelled. How can you cancel all of your tophat jobs without typing the job identi cation number of each individual job? For this solution, assume that

$ bkill JOB ID 1 JOB ID 2 ... JOB ID n

is the only way to cancel jobs though there are alternative and easier ways. Simulation Script: There is a simulation script for this question. Run the script /tmp/bootcamp/exercise4.sh to start a set of tophat, bwa and fastqc jobs.

$ /tmp/bootcamp/exercise4.sh

These jobs will be alive for 10 minutes. If you are not done after ten minutes, run the script again. They will actually be dummy jobs but their names will be relevant to this exercise. Hint 1: Use pipes "j" and awk. Hint 2: Once you gure out how to get the job id's of tophat jobs, give the job id list to bkill.

Solution: We can list all our tophat jobs using grep

$ bjobs | grep tophat

Note that we can make sure that we get the correct job ids by

$ bjobs | grep tophat | less

In the resulting list, we need the rst column. We get this by

$ bjobs | grep tophat | awk 'fprint $ 1 g'

Now, we are ready to cancel our tophat jobs.

$ bkill bjobs | grep tophat | awk 'fprint $ 1 g'

Note that we need to put `around our previous command to provide it as input to bkill.

In the previous exercise, suppose that you have some other jobs having the name fastqc i, where i is the sample identi er. (You will see this if you run the simulation script mentioned in the rst exercise.) How can you cancel all running or scheduled jobs except for the tophat jobs and the fastqc jobs?

Solution: If you use grep with the -v option, you will get all non-matching entries. So

$ bkill bjobs | grep -v tophat | grep -v fastqc

will cancel all jobs that are neither tophat jobs nor fastqc jobs.

You have a bed le where the chromosome the entry is coming from, is given in the rst column. The chromosomes are labeled as chr1, chr2, ..., chr22, chrX, chrY. How can you extract the entries that only come from chromosome 1? Warning: A simple grep chr1 statement will not work. Can you tell why? Hint: Use awk. Example File: /tmp/bootcamp/exercise6.bed Note: Detailed information on bed le format can be found in http://genome.ucsc.edu/FAQ/FAQformat.html#format

Solution: The command

$ grep chr

will not work because it will also bring all entries from chr11. The reason is that the string chr1 is a substring of chr11. To get the entries that are from chr1 only, we use awk as follows. Compare the rst column of the bed le to the string "chr1". Print only the rows for which the equality holds.

$ awk 'f if($1 == "chr1")print($0) g' exercise6.bed

Note that, in awk, by default, the nth^ column is accessed by $n. The whole line is accessed by $ 0.

Recall that each read in a fastq le spans four consecutive lines. The nucleotide sequence of the read is given in the second line (out of these four lines). How can you obtain the nucleotide sequence of the 22nd^ read in a given fastq le? Hint: You need head and tail programs with the -n option. A quick internet search will provide docuimentation and examples. Note that there are sample fastq le in the transcrip- tomics folder in ~/bootcamp directory.

Solution: Note that the nucleotide sequence of the 22nd^ read is in the line 21 4 + 2 = 86. So, the problem is getting the 86th^ line in a fastq le. Say, our fastq le is file.fq. We can get the rst 86 lines by head -n 86 file.fq. We need the last line among these 86 lines. For this, tail -n 1 does the job. So, we pipe the output of head to tail as follows.

$ head -n 86 file.fq | tail -n 1

Coming from your deep sequencing experiment, you have a fastq le consisting of the reads of one lane. You mixed several libraries in that lane and barcoded the molecules to de- termine the sample of each read. In the nucleotide sequence of each individual read, the rst two nucleotides form the barcode of the read. The barcode of the sample you are interested in is TT. How can you get all nucleotide sequences that come from this sample? Once you nd these nucleotide sequences, how can you remove the barcodes from the nucleotide sequences? Hint: Use awk. There is a useful function of awk called substr. An internet search will provide you examples and documentation. Note: In practice there are software tools, such as cutadapt, that are used to remove barcodes in fastq les.

Solution: Say, our fastq le is file.fq. First we need all the nucleotide sequences in this le. This can be done by

$ awk 'f if( NR % 2 == 4 ) print($0) g' file.fq

We can get the rst two nucleotides using the substr function. Namely, substr($1,1,2) gives us the two characters of $1 starting at the rst position. We take only the reads whose rst two characters are TT. Hence

$ awk 'f if( NR % 2 == 4 ) print($0) g' file.fq | awk 'f if(substr($0,1,2) == "TT")print($0)g'

gives all entries that belong to the sample of interest. The function substr can be used to remove the barcodes. For this we take all nucleotides that come on and after the third position. This is done obtained by substr($0,3). So, to get the reads with barcodes removed is done by

$ awk 'f if( NR % 2 == 4 ) print($0) g' file.fq | awk 'f if(substr($0,1,2) == "TT")print(substr($0,3))g'

You have a bed le containing reads from both + and - strands. You want to save all reads in this bed le that come from the negative strand and chr11. How can you create another bed le that contains only reads from the - strand and chr11? Sample File: You can use the sample bed le that comes with a previous exercise. Hint: Use awk and the redirection operator ">".

Solution: The strand information is found on the sixth column of a bed le. So, using awk, we can pick all reads coming from the negative strand.

$ awk 'fif($6 == "-")print($0)g' exercise6.bed

To get the entries from chr11, we can use grep. Note that grep works as there is no chromosome name such as chr110 or so.

$ awk 'fif($6 == "-")print($0)g' exercise6.bed | grep chr

We can save the output in a le named chr11 negative.bed by

$ awk 'fif($6 == "-")print($0)g' exercise6.bed | grep chr11 > chr11 negative.bed

Unix Excercises with case Study in Deep Sequencing Data Analysis., Exercises of Operating Systems

Related documents

Partial preview of the text

Download Unix Excercises with case Study in Deep Sequencing Data Analysis. and more Exercises Operating Systems in PDF only on Docsity!

Unix Exercises

Exercises