genome annotation bbsi july 14, 2005 rita shiang
TRANSCRIPT
![Page 1: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/1.jpg)
Genome Annotation
BBSI
July 14, 2005
Rita Shiang
![Page 2: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/2.jpg)
Genome Annotation
Identification of important components in genomic DNA
![Page 3: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/3.jpg)
What is a Gene?
Fundamental unit of heredity DNA involved in producing a polypeptide; it
includes regions preceding and following the coding region (leader and trailer) as well as intervening sequences (introns)
Entire DNA sequence including exons, introns, and noncoding transcription-control regions
![Page 4: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/4.jpg)
What Components are Important in Protein Coding Genes?
Sequences that initiate transcription Sequences that process hnRNA to mRNA Signals important in translation
![Page 5: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/5.jpg)
TATA Box
Lodishet al, Molecular Cell Biology, 2000, Fig. 10.30.
![Page 6: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/6.jpg)
Other Promoters
Initiator consensus– 5’Py Py A(+1) N T/A Py Py Py
N = A, T, G or C Py = pyrimidine = C or T
GC rich sequences– Stretch of 20-50 GC nucleotides ~100 bp upstream
of start site (CpG not common in genome)– Housekeeping genes– Multiple initiation sites
![Page 7: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/7.jpg)
Polyadenylation & Cleavage
Addition of a string of As to mRNAs Polyadenylation signal AAUAAA found before
cleavage site GU or UU rich region ~50 bp from the cleavage
site Stabilizes mRNA transcripts
Lodishet al, Molecular Cell Biology, 2000, Fig. 11.23.
![Page 8: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/8.jpg)
Splicing
Lodishet al, Molecular Cell Biology, 2000, Fig. 11,13.
Electron micrograph of adenovirus DNA and hexon gene mRNA
![Page 9: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/9.jpg)
Splice Reaction
Lodishet al, Molecular Cell Biology, 2000, Fig. 11.15.
![Page 10: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/10.jpg)
Splice Sites
Lodishet al, Molecular Cell Biology, 2000, Fig. 11,14.
![Page 11: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/11.jpg)
Additional Splice Sites
Consensus Py7NCAG-G(exon)AG – GUAAGU 98.12%Nonconsensus
GCU12 introns AC PuUAUCCUPy 0.76%Other rare sequences 1%
Py = C or UPu = A or G
![Page 12: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/12.jpg)
Translation Signals
5’ Cap structure directs ribosomal binding AUG codes for methionine. The first AUG in a
transcript is where translation starts Open reading frame (ORF)
– Stretch of sequence that codes for amino acids before a stop codon
Translation stop codons UAG, UAA, UGA
![Page 13: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/13.jpg)
Capping of 5’RNA with 7’-methylguanylate (m7G)
Lodish et al, Molecular Cell Biology, 2000, Fig. 11.8.
![Page 14: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/14.jpg)
Known Gene Components
Lodishet al, Molecular Cell Biology, 2000, Fig. 10.34.
![Page 15: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/15.jpg)
Genome Annotation
What is in a genome besides protein coding genes?
![Page 16: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/16.jpg)
Repetitive DNA makes up at least 50% of the genome
Transposon-derived interspersed repeats Inactive retroposed copies of genes –pseudogenes Simple short repeats Segmental Duplications Blocks of tandemly repeated sequences
– Centromeres– Telomeres– Short arm of acrocentric chromosomes– Ribosomal gene clusters
![Page 17: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/17.jpg)
Non-protein coding genes or non-coding RNA (ncRNA)
tRNA genes rRNA genes snRNA genes
– Splicing– Telomere maintenance
snoRNA genes Other
– microRNA
![Page 18: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/18.jpg)
Annotation of Genomic DNA
Identifying Protein Coding Genes Placing the genes on the genome (where are
they?)
![Page 19: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/19.jpg)
How Many Genes in the Genome?
Early on based on reassociation kinetics the estimate was ~40,000
Walter Gilbert estimated ~100,000 based on gene and genome size
70,000 – 80,000 based on an extrapolated number of CpG islands
With the Human sequence the estimate is 30,000 – 40,000
![Page 20: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/20.jpg)
Annotation of Genomic DNA Specifically for Genes that Code for Proteins
Match genomic DNA to genes that have been previously cloned and sequenced looking for sequence similarity using BLAST programs
Predict genes using computer programs to scan genomic DNA using known elements
Many strategies use a combination of both methods
![Page 21: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/21.jpg)
Lodishet al, Molecular Cell Biology, 2000, Fig. 7.14
cDNA Library Construction
![Page 22: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/22.jpg)
Lodishet al, Molecular Cell Biology, 2000, Fig. 7.15
![Page 23: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/23.jpg)
Gene AnnotationCelera
Constructed gene models using sequence from cDNAs
Used Unigene database Partitions GenBank sequences (mRNAs & ESTs) into non-
redundant set using 3’ UTRs 111,064 Unigene clusters for human
![Page 24: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/24.jpg)
Gene AnnotationCelera cont.
Predicts gene boundaries by identifying overlapping sets of EST and protein matches
Known full-length genes were annotated on the map (matched w/50% of the length & >92% identity)
Clusters that did not match a full-length gene were evaluated using other references
– Conservation of genomic sequence between mouse & human– Similarity between human & rodent transcripts– Similarity to known proteins
![Page 25: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/25.jpg)
Validation
Validated by construction of known genes (RefSeq)
6.1% of RefSeq genes were not annotated by Otto
![Page 26: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/26.jpg)
Gene Annotation - Human Genome Sequencing Consortium
Start with Ensemble predicted genes– ab initio predictions using Genscan
Based on probabilistic model of genome sequence composition and gene structure
– Confirm similarity to mRNAs, ESTs, protein motifs from all organisms
– Extend protein matches using GeneWise Compares protein based information to genomic sequence
and allows for frameshifts and large introns
– Produces partial gene predictions
![Page 27: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/27.jpg)
Consortium cont.
Merge Ensemble gene predictions w/ Genie predictions– Genie identifies matches of mRNAs and ESTs
Employs hidden Markov models (HMMs) to extend matches using ab initio statistical methods
Links information from 5’ and 3’ ESTs from the same cDNA clone to complete a sequence from the ATG to the stop codon
Can generate alternatively spliced products (though only longest used in this build)
Merge results with genes in RefSeq, SWISSPROT and TrEMBL databases
![Page 28: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/28.jpg)
Validation
Validate method by comparing to a new set of known genes, a set of mouse cDNAs and genes on Chromosome 22 (Finished Sequence)
85% Sensitivity 13% spurious predictions
![Page 29: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/29.jpg)
Factors Affecting Gene Annotation
Splice sites do not conform to consensus Noncoding exons are common
– Exon – what is left over after splicing after introns are removed and does not refer to a stretch of coding information
– tRNAs are spliced but noncoding– >35% of human genes have noncoding exons– No statistical bias so they are difficult to identify
![Page 30: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/30.jpg)
Factors Affecting Gene Annotation Cont.
Internal exons can be very small– Avg. size of internal exons are ~130 bp– ~65% of vertebrate exons are 68-208 bp– >10% are <60 bp– Exons < 10 bp have been identified– Invected gene in Drosophila
One of four exons is 6 bp (GTCGAA) Flanked by introns of 27.6 and 1.1 kb Not correctly recognized by cDNA alignment software and creates a
frameshift in the gene– Exons of size 0
Resizing exons create an intermediate splice product
![Page 31: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/31.jpg)
Places to View Annotated Genomes
National Center for Biotechnology Information (NCBI)
Ensemble The Golden Path (UCSC Genome Browser) Celera
![Page 32: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/32.jpg)
Verification of Annotation in C. elegans by Experimentation
Complete genomic sequence Small introns Small intergenic regions
![Page 33: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/33.jpg)
![Page 34: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/34.jpg)
Results
11,984 cDNAs successfully cloned out of a prediction of 19,477
4,365 were not represented by cDNAs or ESTs Failure of cloning could be due to:
– Wrongly predicted exons– Very low expressing genes– Not a real gene
![Page 35: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/35.jpg)
Verification of intron/exon structures
![Page 36: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/36.jpg)
Comparison of a Single Transcript
![Page 37: Genome Annotation BBSI July 14, 2005 Rita Shiang](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649e585503460f94b5159d/html5/thumbnails/37.jpg)
Greater than 50% of intron/exon structures need correcting?