comp. genomics
DESCRIPTION
Comp. Genomics. Recitation 7 2/4/09 PSSMs+Gene finding. Partially based on slides by Irit Gat-Viks and Metsada Pasmanik-Chor. Biological Motifs. Biological units with common functions frequently exhibit similarities at the sequence level. These include very short “motifs”, such as: - PowerPoint PPT PresentationTRANSCRIPT
Comp. Genomics
Recitation 72/4/09PSSMs+Gene finding
Partially based on slides by Irit Gat-Viks and Metsada Pasmanik-Chor
Biological Motifs• Biological units with common functions
frequently exhibit similarities at the sequence level. These include very short “motifs”, such as:• Gene splice sites • DNA regulatory binding sites (bound by transcription
factors)• Often it is desirable to model such motifs, to
enable searching for new ones. Probabilistic models are very useful. Today we deal with PSSM - the simplest.
E. Coli Promoters
Regulation of Genes
GeneRegulatory Element
RNA polymerase(Protein)
Transcription Factor(Protein)
DNA
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
Regulation of Genes
Gene
RNA polymeraseTranscription Factor
Regulatory Element
DNA
New protein
Regulation of Genes
Motif Logo• Motifs can mutate on less
important bases. • The five motifs at top right
have mutations in position 3 and 5.
• Representations called motif logos illustrate the conserved regions of a motif.
http://weblogo.berkeley.eduhttp://fold.stanford.edu/eblocks/acsearch.html
1234567TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA
Position:
Example: Calmodulin-Binding Motif (calcium-binding proteins)
PSSM Starting Point
• A gap-less MSA of known instances of a given motif. Representing the motif by either:
• Consensus.• Position Specific Scoring Matrix
(PSSM).
Usage of a PSSM
• For a putative k-mer GTGC– multiply the probabilities: p1(G)·p2(T)·p3(G)·p4(C)
• This gives the likelihood of the motif given the PSSM model
TATA box motif
Gene finding
• Only part of the genome encodes proteins• 80-90% in bacteria, ab. 2% in humans
• Goal: Given a genome sequence, identify gene boundaries
The genetic code
• A protein-coding gene, an open reading frame (ORF) begins with an ATG and ends with one of three stop codons
Prokaryotic genes• The ‘easy’ problem• Difficulty – not all possible ORFs are actually
genes• In E.Coli: 6500 ORFs while there are 4290
genes.• Additional “handles” are needed
Handle #1: Long ORFs
• In random DNA, one stop codon every 64/3=21 codons on average.
• Average protein is ~300 codons long.
• => search long ORFs.• Problems:
• Short genes• Overlapping long ORFs on opposite strands
Handle #2: Codon frequencies
• Coding DNA is not random:• In random DNA, expect Leu : Ala : Trp
ratio of 6 : 4 : 1• In real proteins, 6.9 : 6.5 : 1
• Different frequencies for different species.
16
Using Codon Frequencies/Usage
• The probability that the ith reading frame is the coding region:
11332221
1322211
222111
...
...
...
3
2
1
nnn
nnn
nnn
bacbacbac
acbacbacb
cbacbacba
fffp
fffp
fffp
321 ppppP i
i
• Assume each codon is independent.• For codon abc calculate frequency f(abc) in
coding region.• Given coding sequence a1b1c1,…,
an+1bn+1cn+1
• Calculate
Handle #3: G+C content• C+G content (“isochore”) has strong
effect on gene density, gene length etc.• < 43% C+G : 62% of genome, 34% of genes• >57% C+G : 3-5% of genome, 28% of genes
• Gene density in C+G rich regions is 5 times higher than moderate C+G regions and 10 times higher than rich A+T regions• Amount of intronic DNA is 3 times higher for A+T
rich regions. (Both intron length and number).• Etc…
Handle #4: Promoter motifs• Transcription depends on regulatory
regions.• Common regulatory region – the promoter• RNA polymeraseRNA polymerase binds tightly to a specific
DNA sequence in the promoter
19
Gene prediction programsScan the sequence in all 6 reading frames: 1. Start and stop codons2. Long ORF3. Codon usage4. GC content5. Gene features: promotor, terminator,
poly A sites, exons and introns, …
Frame +1Frame +2Frame +3
Moving to eukaryotes
• Less of the genome is protein coding + introns are a (very) serious headache
21
Eukaryote gene structure
• Gene length: 30kb, coding region: 1-2kb • Binding site: ~6bp; ~30bp upstream of TSS• Average of 6 exons, 150bp long• Huge variance: - dystrophin: 2.4Mb long• Blood coagulation factor: 26 exons, 69bp to 3106bp;
intron 22 contains another unrelated gene
22
Splicing• Splicing: the removal of the introns.• Performed by complexes called spliceosomes,
containing both proteins and snRNA.• The snRNA recognizes the splice sites through
RNA-RNA base-pairing• Recognition must be precise: a 1nt error can
shift the reading frame making nonsense of its message.
• Many genes have alternative splicing which changes the protein created.
23
Splice Sites