Gene & genome organisation
Computational gene identification
Eubacterial gene
Regulatory elements
PromoterTranslation start
Transcription stop
polyA signal
Transcription start
Translation stop
Exons
Introns
DNA
Eukaryotic gene
Promoters in eukaryotic DNA
1. TATA box
2. Initiators5’ Y Y A+1 N [T,A] Y Y Y 3’
3. CpG islands
EMBOSSCpGPlot
polyA tail
Splicing
Translation
RNA (primarytranscript)
RNA (spliced)
Protein
Regulated splicing
Primary RNA transcript
Exons may be combined differently during splicing. One gene can in this way give rise to multiple forms of a protein.
Splicevariants
1
1
1
1
1
1
1
2
2
3
3
3
3
3
3
4
4
4
4
4
4
4
5
5
5
5
5
5
5
6
6
6
6
6
6
6
7
7
8
8
8
8
8
8
9
9
9
9
9
9
9
10
10
10
10
10
10
10
13a
13a
13
13
13
13
13
13
11
11
11
12
12
Nonmuscle
Smooth muscle
Striated muscle
Striated muscle'
Hepatoma
Brain
Alternative splicing of the - tropomyosin pre-mRNA?
Genome No of genes Genes / MB
Homo sapiens 3000 Mb ~40,000? ~13
Mycoplasma genitalium 0.6 MB ~600 ~1000
Higher eukaryote genomes contain a substantial amount of non-coding sequences
LINES long interspersed elements 6-7 kbLINE1 : 600,000 copies in human genome
= 15 % of genomic DNA
Repetitive DNA~50 % of human genomic DNA
mobile elements - - viral retrotransposons
common in yeast & Drosophila- non-viral retrotransposons
common in mammals LINES SINES
SINES short interspersed elements~300 bp
Sequence conservation ~80 % within the same species
Alu sequence the most abundant class of SINE 1 million copies = 10 % of genomic DNA
Many Alu sequences have cleavage sites for the restriction enzyme AluI, (AGCT), hence the name
Originally derived from SRP RNA by reverse transcription
Repetitive sequences genome size
Mammals 35 - 45 % ~ 3 GB
Fugu : < 15 % ~365 MB
Why so many repetitive elements in higher mammals?
Mobile elements probably had a significant influenceon evolution of higher organisms :
Novel genes and new controls on gene expressionwere created because mobile elements have served as sites for recombination, leading to gene duplications and other gene rearrangements (exon shuffling).
Detection of repetitive DNA
Dotplot analysis
Detection of repetitive DNA
RepeatMaskerhttp://ftp.genome.washington.edu/RM/RepeatMasker.html
RepeatMasker is a program that screens DNA sequences for interspersed repeats known to exist in mammalian genomes as well as for low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (replaced by Ns). On average, over 40% of a human genomic DNA sequence is masked by the program. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green.
1159 13.2 3.2 0.0 HSU08988 6563 6781 (22462) + MER7A DNA/MER2_type 1 226 (109)5901 11.3 2.5 0.8 HSU08988 6782 7720 (21523) C TIGGER1 DNA/MER2_type (0) 2418 14651617 12.7 6.3 1.8 HSU08988 7738 8021 (21222) C AluSx SINE/Alu (4) 298 23811 8.5 1.5 1.5 HSU08988 8027 8699 (20544) C TIGGER1 DNA/MER2_type (943) 1475 8032035 11.0 0.3 0.7 HSU08988 8700 9000 (20243) C AluSg SINE/Alu (0) 300 12055 9.1 4.4 0.0 HSU08988 9003 9695 (19548) C TIGGER1 DNA/MER2_type (1608) 810 2 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334 (1)
Human genome contains a substantial number ofpseudogenes - non-functional gene variants
Non-processed pseudogeneGene duplication has resulted in new copy of geneCopy has mutated to become non-functional
Processed pseudogenesNon-functional genomic copies of mRNAs.Often contain multiple mutations
Human genome: * variation of GC content
* longer introns in AT-rich regions
Gene prediction methods
- Ab initio, pattern recognition- Database searching
Identification of ORFsFinding long ORFs
Stop codon expected every 64/3 = 21 codonsnumber of stop codons=3(UAA, UAG, UGA)
Average proteins are much longer
Disadvantages: short genes are not detectedsome ORFs are false positivesnot suitable for eukaryotes
LOCUS AAB32243 47 aa BCT 03-MAR-1995DEFINITION aepH=putative exoenzyme production regulatory peptide [Erwinia carotovora, carotovora, Peptide, 47 aa].ACCESSION AAB32243PID g691744VERSION AAB32243.1 GI:691744DBSOURCE locus S74077 accession S74077.1KEYWORDS .SOURCE Pectobacterium carotovorum carotovora. ORGANISM Pectobacterium carotovorum Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Pectobacterium.REFERENCE 1 (residues 1 to 47) AUTHORS Murata,H., Chatterjee,A., Liu,Y. and Chatterjee,A.K. TITLE Regulation of the production of extracellular pectinase, cellulase, and protease in the soft rot bacterium Erwinia carotovora subsp. carotovora: evidence that aepH of E. carotovora subsp. carotovora 71 activates gene expression in E. carotovora subsp. car JOURNAL Appl. Environ. Microbiol. 60 (9), 3150-3159 (1994) MEDLINE 95031027 REMARK GenBank staff at the National Library of Medicine created this entry [NCBI gibbsq 157517] from the original journal article. This sequence comes from Fig. 2A.COMMENT Method: conceptual translation supplied by author.FEATURES Location/Qualifiers source 1..47 /organism="Pectobacterium carotovorum" /db_xref="taxon:554" Protein 1..47 /product="aepH" /name="putative exoenzyme production regulatory peptide" CDS 1..47 /gene="aepH+" /coded_by="S74077.1:576..719" /note="Author translates GTG start as Val"ORIGIN 1 vgqepkgies rkiqdghvrk kvgrqqglwv rttkkekfsr msrdanv
Example of awful ORF prediction
Codon usage for enteric bacterial (highly expressed) genes 7/19/83
AmAcid Codon Number /1000 Fraction ..
Gly GGG 13.00 1.89 0.02Gly GGA 3.00 0.44 0.00Gly GGU 365.00 52.99 0.59Gly GGC 238.00 34.55 0.38
Glu GAG 108.00 15.68 0.22Glu GAA 394.00 57.20 0.78Asp GAU 149.00 21.63 0.33Asp GAC 298.00 43.26 0.67
Val GUG 93.00 13.50 0.16Val GUA 146.00 21.20 0.26Val GUU 289.00 41.96 0.51Val GUC 38.00 5.52 0.07
Ala GCG 161.00 23.37 0.26Ala GCA 173.00 25.12 0.28Ala GCU 212.00 30.78 0.35Ala GCC 62.00 9.00 0.10
Arg AGG 1.00 0.15 0.00Arg AGA 0.00 0.00 0.00Ser AGU 9.00 1.31 0.03Ser AGC 71.00 10.31 0.20...
Compositional bias in coding regions
CodonPreference
Codon preference plot is constructed by calculating a codon preferencestatistic for each position of each of three reading frames. The statistic is calculated over a window of length w and window moved along the sequencein increments of three bases, maintainin the reading frame. The magnitude of the codon preference statistic is a measure of the likeness of particular window of codons to a predetermined preferred usage.
p = preference parameter = relative likelihood of a codon being found in a gene as opposed to a random sequence
fABC/FABCp = ------- rABC/RABC
f frequency of codon ABC(found in frequency table)F sum of frequencies for all codons that are members of ABCs synonymous family
r frequency of codon ABC in a random sequenceR sum of frequencies of ABCs synonymous family in a random sequence
Codon preference statistic P
(sum logpi/w)P = e
w is between 25 and 50
CodonPreference is a frame-specific gene finder that tries to recognize protein coding sequences by virtue of the similarity of their codon usage to a codon frequency table or by the bias of their composition (usually GC) in the third position of each codon.
Compositional bias of exonsK-tuple methodfrom Bishop ed., Guide to Human Genome Computing
Consider a sequence S = {s i} of length L.It can be transformed into a sequence of k-tuples (i.e oligonucleotidesof length k) :
W = {Wk,i} (i = 1, …, L - k + (1) ; Wk,i ? ? ?
Here ? = {Wk,i} is the set of all the possible oligonucleotidesWk of length k. In this way it is possible to construct a table F with the occurrencefrequency F(Wk) for all possible k-tuples of the set of sequences {S} having the function of interest.
Consider two sets of sequences {S(1)} and {S(2)} with mutually exclusive functions, for instance intron and exon. It is possible to calculate the k-tuple frequency tables F1 and F2 for these two sets of sequences. The difference in frequencies between these tables can be used for discrimination.To analyze the test sequence using the F1 and F2 tables, calculate the local discriminant index for the ith position:
d(i) = F1(Sk,i) / (F1( Sk,i) + F2 ( Sk,i))
d(i) is smoothed using an averaging window of 2w+1 consecutive positions i + w
?D????????? ?d(i) j = i - w
k= 6 is often used
Local sites (=signals) used in the prediction of genes
PromotersTerminators of transcriptionStart and stop codonsSplice sitesBranch pointsPolyadenylation sites
Signal sensors = methods for detecting signals
Content sensorsHexamer counts to discriminate betweenexons and introns
Gene finding methods:Combination of signal and content sensors
Gene prediction methods
Ab initio: HMM methods
Genscan http://genes.mit.edu/GENSCAN.htmlHMMGene http://www.cbs.dtu.dk/services/HMMgene/Genie http://www.fruitfly.org/seq_tools/genie.htmlGeneMark.hmm http://genemark.biology.gatech.edu/GeneMark/eukhmm.cgiFGENEH http://genomic.sanger.ac.uk/gf/gf.shtmlGeneID http://www1.imim.es/geneid.html
Ab initio: Neural network methods
GRAIL http://compbio.ornl.gov/Grail-1.3/NetGene2 http://www.cbs.dtu.dk/services/NetGene2/
Homology based
Blast http://www.ncbi.nlm.nih.gov/BLASTProcrustes http://www-hto.usc.edu/software/procrustes/index.htmlGenewise http://www.sanger.ac.uk/Software/Wise2
Limitations
Non-coding parts, 5’ and 3’ UTRs,and non-coding RNAs are not detected
Lack of suitable training sets ofvery long genomic sequences
Methods are conservative - they have been trained on “typical genes”