Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Structural and Functional Atlas of Frameshift Variations in Human Genome, Slides of Architecture

This thesis explores the potential functional proteins produced by frameshift mutations in the human genome using protein blast tool and InterProScan. It identifies 11,313 and 10,278 polypeptide sequences resulting from +1 and +2 frameshift mutations, respectively, that are homologous to existing proteins and could potentially carry out a function. The research also discusses the challenges faced in simulating frameshift mutations and detecting their domain structures.

What you will learn

  • What are the challenges faced in simulating frameshift mutations and detecting their domain structures?
  • What is the significance of frameshift mutations in molecular biology?
  • How can frameshift mutations contribute to genetic disease?
  • How were the potential functional proteins identified?
  • What are the potential functional proteins produced by frameshift mutations in the human genome?

Typology: Slides

2021/2022

Uploaded on 09/12/2022

leonpan
leonpan 🇺🇸

4

(12)

286 documents

1 / 43

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Structural and functional atlas of frameshift
variation capacity in human genome
by
Nan Hu
A Thesis
Submitted to the Faculty
of the
WORCESTER POLYTECHNIC INSTITUTE
in partial fulfillment of the requirements for the
Degree of Master of Science
in
Bioinformatics and computational biology
May 2018
APPROVED:
Professor Dmitry Korkin, Thesis Advisor
Professor Elizabeth F. Ryder, Thesis Reader
Professor Dmitry Korkin, Head of Department
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b

Partial preview of the text

Download Structural and Functional Atlas of Frameshift Variations in Human Genome and more Slides Architecture in PDF only on Docsity!

Structural and functional atlas of frameshift

variation capacity in human genome

by Nan Hu A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degree of Master of Science in Bioinformatics and computational biology May 2018 APPROVED: Professor Dmitry Korkin, Thesis Advisor Professor Elizabeth F. Ryder, Thesis Reader Professor Dmitry Korkin, Head of Department

Abstract

Currently, it is widely accepted that frameshift mutations yield truncated and dysfunctional proteins. Frameshift mutation products are mainly non-functional, abnormal, polypeptides and therefore have gained little attention from the point of view of their structural and functional analyses. However, recent studies have shown that frameshift proteins do have structures and can be functional. While most studies about frameshift mutation focus on the nucleotide sequence level, here we simulate and directly analyze frameshift mutation on the protein domain level. We focus on the protein domain, because it is the smallest structural and functional protein unit. By using protein blast tool to analyze the protein domain yield from all coding gene sequences in the human genome (45,139 mRNAs), we found out that 11,313 polypeptide sequences resulting from +1 frameshift mutation and 10,278 sequences resulting from +2 frameshift mutation are homologous to the existing proteins and could potentially carry out a function. Moreover, for 464 and 448 frameshift products in each type, respectively, we detected at least one protein domain by using Interproscan tool. We also compared the genes where we found the frameshift-produced protein domains with the genes associated with frameshift mutations reported by Clinvar database. The result shows that 47 genes from our set were also found to carry clinically-relevant frameshift mutations. This work provides the first whole-genome view of the frameshift effects on the protein domain structure and function, which would shed new insights about this variation mechanism with applications in a wide range of areas from evolutionary biology to precision medicine.

Table of Contents

  • Abstract................................................................................................................................
  • Acknowledgements
  • Table of Contents
  • List of Figures
  • List of Tables
  • 1 Background...................................................................................................................
    • 1.1 Introduction
    • 1.2 Frameshift mutation
    • 1.3 Evolution factors
    • 1.4 Genetic disorder factors
  • 2 Methods
    • 2.1 Data collection
    • 2.2 Frameshift simulation
    • 2.3 Protein Blast
    • 2.4 Domain detection by Interproscan
    • 2.5 Domain analysis
    • 2.6 Homologue superfamily analysis
    • 2.7 Gene comparison
    • 2.8 Domain structure conformation
  • 3 Results and analysis
    • 3.1 Protein blast results
    • 3.2 Domain and homologue superfamily detection
    • 3.3 Summary of simulation, blast and domain detection.........................................................
    • 3.4 Domain conformation
    • 3.5 Evolution route map -- Domain level
    • 3.6 Sequence identity within the same named domains
    • 3.7 Evolution route map -- Homologue superfamily level
    • 3.8 Gene detection
    • 3.9 Protein domain structural atlas of known genes associated with frameshift disease
  • 4 Conclusion
  • 5 Discussion
  • References
  • Appendix A : Domain frameshift to a new domain
  • Appendix B : Homologue superfamily frameshift to a new Homologue superfamily

List of Figures

Figure 1. Schematic representation of frame-shift events with their +1 and - 1 versions. [6] ............................................................................................................... 8 Figure 2. Methodology workflow ........................................................................... 12 Figure 3. Simulation design..................................................................................... 13 Figure 4. An example of frameshift Simulation design. ......................................... 14 Figure 5. Position overlap between the original SH3 domain and frameshifted PH domain. .................................................................................................................... 18 Figure 6. Quantity of frameshifted products with blast results and original ones... 21 Figure 7. 1-Frameshift sequence identity ................................................................ 22 Figure 8. 2-Frameshift sequence identity ................................................................ 22 Figure 9. An example of Interproscan results. ........................................................ 23 Figure 10. A summary of frameshift mutated proteins ........................................... 24 Figure 11. Domain conformation example NM_001224.4, the structural protein domain architecture of the original, 1-Frameshift, and 2-Frameshift products are shown above in respective order. ............................................................................ 25 Figure 12.Domain conformation example NM_001276698.1 ................................ 25 Figure 13. An example of evolution route map in domain level. ............................ 27 Figure 14. Sequence identity within the same name domains ................................ 28 Figure 15. An example of evolution route map in homologue superfamily level... 29 Figure 16. Candidate genes compare with known genes associated with frameshift disease ..................................................................................................................... 29 Figure 17 Protein domain architecture of NM_007299.3 and its frameshift products ................................................................................................................................. 30 Figure 18. Protein domain architecture of NM_001276698.1 and its frameshift products ................................................................................................................... 32 Figure 19. Protein domain architecture of NM_0011654146.1 and its frameshift products ................................................................................................................... 32 Figure 20. The composition of a new protein ......................................................... 34

1 Background

1.1 Introduction

The research we performed in this thesis belongs to the area of molecular biology; this is a critical area of research since mutations are the contributor of evolution, as well as the contributor of genetic disease. Unraveling the consequence of frameshift mutation has a lot benefits. First of all, it is well understood that beneficial point mutations accumulated and finally contribute to species evolution [1], but whether an organism can ever benefit from a frameshift mutation is still mysterious. Besides, frameshift mutations lead cancers and only gene therapy could be used to treat disease nowadays [2]. Studying the consequence of frameshift mutation could discover potential drug target and help researchers to design medicines. For this reason, the researcher`s role in expanding the knowledge and understanding of frameshift mutation is critical and valuable. Due to recent surge in research on frameshift mutation, the consequences of it either partially or completely change the DNA sequence after the spot of frameshift mutation happens [3]. Even so, researchers still need to study how the mutation will bring to change to its transcripts or even protein products. In this research, we use computational methods to study the changes in mRNA sequence resulting from frameshift mutation. We attempted to use bioinformatics tools such as mutation simulation to tackle this problem. Furthermore, recent research has shown that protein domain centric approach is better than genetic centric approach. That is analyzing the frameshift mutation from a protein domain perspective instead of looking its mRNA nucleotide sequences [4]. The benefits of doing protein domain centric approach is obvious because protein domain is the functional unit of protein. Clearly,

frameshift mutation will cause protein losing its function or gain protein a new function. In this paper, we are trying to identify those frameshift mutations that lead to a protein that perform the same or a new function. The main challenge usually faced when using these computational tools is to simulate frameshift mutation in mRNA, then translate the frameshifted mRNAs to protein sequences, and finally blast against protein in-real-world and to detect their domain structures. These works are computationally expensive, especially for novel domain structure conformations. Our results could contribute to a future research of evolution and also drug design to produce a precise medicine against genetic disease.

1.2 Frameshift mutation

Frameshift mutation is a genetic mutation caused by a number of base indels (insertion or deletion) in DNA which cannot be divided by three. Because gene expression count on codon which consists of three nucleotides, frameshift mutation will lead to a different reading frame, and finally translate into a whole new peptide contrast with the original peptide sequences. Everything after the spot of indels will partially or completely change [5]. Based on the sequence shift position against the original sequence, frameshift mutation can be categorized into two types: +1, +2. They are also written as +1, - 1 in some literature. (Figure 1).

Researchers identified and studied these mutations by analyzing tissue samples collected from patients affected by pathogenic frameshift mutations. Therefore, many frameshift mutations, along with the evolutionary information they may contain, remain unidentified. However, previous studies showed that frameshift coding genes can be expressed, and frameshift proteins can be functional by themselves [8]. Recent study has shown that the universal genetic code, protein coding genes and genomes of all species were optimized for frameshift tolerance [8]. This work points out that frameshift homologs are defined as a set of frameshifted but yet functional coding genes/proteins that were evolved from a common ancestor gene via frameshift mutation. Another study shows that a frameshift mutation in CCR5 genes will give resistance ability to infect with HIV [9]. CCR5 is a co-factor which is responsible for HIV entering the cell [10]. A 32-base pair deletion in CCR5 has been identified as a mutation that negates the likelihood of an HIV infection [11]. This region on the open reading frame contains a frameshift mutation which introduce a premature stop codon [12]. This mutation leads to the loss of function of biding HIV. CCR5- 1 is considered as the wild type and CCR5-2 is regarded as the mutant allele. People with a heterozygous CCR5 mutation were less sensitive to the infection with HIV [13]. In a study, even through one exposure to a high concentration level of HIV virus, no one homozygous for the CCR5- 2 mutation was reported as positive for HIV [14]. This kind of frameshift mutation could be considered as a beneficial for individual`s life, and thus this mutation could be considered as an evolution to some degree. Besides, researchers can mimic this molecular behavior and decide to knock out protein domain in order to prevent from infection by pathogen. For these reasons, this paper will help identify a protein domain evolution road map, and provide information to researchers to navigate these maps. This

method could also expand to other species to detect the protein domain evolution route since protein domains exist in all species. Moreover, this method could apply to other form of RNAs such as 'NR_', which is for RNA not coding, to help decipher the regulatory regions.

1.4 Genetic disorder factors

A genetic disorder is a disease caused by an abnormality in DNA. These abnormalities can range from a single nucleotide mutation to a deletion or insertion of an entire chromosome [15]. Frameshift mutations are mutations caused by insertions or deletions of one or two nucleotides from a DNA sequence. Because tRNA translates codons, groups of three mRNA nucleotides, to amino acids [16], frameshift mutations lead to a shift in the tRNA reading frame and thus a perturbed protein [17]. These mutations generally occur in hot spots, repeated sequences of one or two nucleotides. This is due to a ‘slip’ of the DNA polymerase followed by the realignment of the DNA template and nascent strand during replication [18]. However, frameshift mutations can also occur elsewhere in a DNA sequence. They lead to either an inactive protein or a protein with an altered structure and function. Both of these cases are very dangerous and can result in many severe diseases such as Crohn’s disease [19], Cystic Fibrosis [20], Tay-Sachs disease [21] and several types of cancer. For this reason, this paper will generate potential cancer development candidates in the level of protein domain, and provide information to researchers to identify drug target and design new medicines.

2 Methods

Figure 2. Methodology workflow

2.1 Data collection

The mRNA accession numbers were retrieved from the file RefSeq transcripts of GRCh38 downloaded from NCBI. Then we developed a script to connect to the NCBI Entrez API and fetch mRNAs from the NCBI nucleotide database [22]. 45,139 mRNAs were retrieved from the database in genebank format. This format allows us to identify the start position of translation on mRNA. This is the position that we simulate our frameshift mutation.

2.2 Frameshift simulation

After we get the mRNA sequence, we started to simulate frameshift mutation in the open reading frame. In this step, we use biopython packages [23] to perform +1/-1 simulation in mRNA, and then translate them into protein sequence (Figure 3 ). The frameshift start point at the translation start point in mRNA (marked as a dark red arrow), all the rest residues will be translated. We labeled it with a pink grill

pattern. Because we ignore all the stop codons, the poly-A tail in mRNA will also be translate into proteins. Figure 3. Simulation design. Since frameshift mutation could happen in any spot of the sequence in real scenario, we set up a simulation design without considering any inner stop codon among the whole sequence. In other words, there are multiple stop codons within the frameshifted sequence we get. In this way, it allows us to include all possible frameshift cases and all possible incoming protein domains. We give an example in figure 4 The nucleotide sequence frameshifted and resulting in 5 inner stop codons. In order to include all possible domains, we set up a "stop=False" in our program and obtained the peptides. The packages we use is SeqIO which allows us to extract nucleotide sequence from genebank format file. Imbedded function "translate" will directly give us frameshift protein products. "Stop=False" is a parameter in the "translate" function.

within a local computer, and did the protein BLAST locally. The protein reference sequence is 'GRCh38_latest_protein.faa'. In this step, we keep * in the sequences as a stop codon for the following reasons. First, the BLSAT program recognize the * mark as a stop codon. Second, the * will not match to any amino acids if it includes in a query sequence, and this practice will lower the total score of that alignment. Therefore, we keep * in sequences and BLAST them. The aim of doing blast is select frameshift proteins comparing with the well documented protein database. So, in this step, the frameshifted proteins are blast against human RefSeq proteins (GRCh38_latest_protein.faa, downloaded from NCBI on Jan/15/2018). 'GRCh38_latest_protein.faa' include NP_ labled protein sequence as well as XP_ labled protien sequences. NP_ labled protein sequences are those with biochemical evidence, while XP_ are those predicted proteins. The threshold is a 0.0001 e-value, and the BLOSUM62 matrix was used. The Expect Value was reduced to 0.0001 from the default of 10 in order to increase the speed of the BLAST processing time and ensure meaningful results. BLOSUM62 is most effective in finding all potential similarities including 30-40%. Using BLOSUM will cover a broader range of potential functional frameshift proteins into our candidates. We retain those sequences which have blast results and discard other sequences. Now, these retained proteins are called potential functional protein since they exist in the real-world to some degree. We report the BLAST criteria we performed below. The e-value and matrix are selected; all others are default values.

database GRCh38_latest_protein.faa e-value 0. word size 2 gap open 11 gap extend 1 matrix BLOSUM62 matrix comp based stats 0 Table 1. BLAST parameters

2.4 Domain detection by Interproscan

InterProScan is the software package that scan the sequences (protein and nucleic) against InterPro's signatures and return annotations of the input sequences. By classifying sequences into families and predicting the presence of domains and important sites, InterPro uses as a tool that analyze the structure and function of protein sequences [26]. To reach this goal, InterPro uses signatures which is the predictive models. These models are provided by several different databases (referred to as member databases) which make up the InterPro resources. This step is also computationally expensive. In order to save time and submit a large number of jobs at the same time, we download the Interproscan tool and do protein domain detection locally. The version of the software is interproscan-5.27- 66.0 which downloaded on Feb 24th,2018. This software requires Linux environment. We write a bash script to automatically submit thousands of jobs orderly to Interproscan. In this step, we aim to identify if these potential functional proteins have domains. We detect +1/-1 protein domain first and then select those candidates who have the domain found. If the frameshifted ones turn up positive, we will annotate its original sequences by InterproScan. This is a trick to shorten the time of running Interproscan.

Figure 5. Position overlap between the original SH3 domain and frameshifted PH domain.

2.6 Homologue superfamily analysis

A homologous superfamily is a group of proteins that share a common evolutionary origin, reflected by the similarity in their structure [29]. Since superfamily members often display very low similarity at the sequence level, this type of InterPro entry is usually based on a collection of underlying hidden Markov models, rather than a single signature [30]. The tool we use is a simple XML parser [31] written in python. This step was performed in Jupyter notebook. We retrieve those domains who meet the requirement of entry.attrib['type']=="HOMOLOGUE_SUPERFAMILY". We extract information, including access number; homologue superfamily name; homologue superfamily position of both original and frameshifted sequence. We then compare the position of original sequence and frameshift sequence. Once there is an overlap between two homologue superfamily, we claim that they are evolution connected because of frameshift mutation. A homologue superfamily

evolution map caused by frameshift mutation was made in order to come up with the pattern that human can easily recognize. An interactive protein domain evolutionary map was generated by JavaScript. Also, we filtered out homologue superfamily which only become to a new homologue superfamily, and generate patterns based on their mutation strength and homologue superfamily frequency. This work performs in Cystoscope.

2.7 Gene comparison

We also compare genes, which have frameshifted results according to our analysis, to the ClinVar database. ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence [32]. We retrieve all genes (4059) which have been annotated by Clinvar as frameshift mutation disease. We prepare our candidate diseases associated genes in the following ways. First, we identify the mRNAs which have frameshifted domain found. Then, we extract gene names, including gene name synonyms and turn these into the candidate data. There are 863 mRNAs coding frameshifted proteins which are domains found by InterproScan. 1931 gene names including synonyms are extracted from these mRNAs. This step could be considered as an evaluation of the whole research work. Because all we do is in silico experiment, and if the evidence found in the real world, we could say that partially our experiment design is right. Besides, other potential functional proteins might exist, but not reported yet. These results will give a new guidance of research in cancer biology.