analysis of biological sequences sph 140 · analysis of biological sequences sph 140.638...
TRANSCRIPT
nuts and bolts
• meet Tuesdays & Thursdays, 8:30-9:50
• no exam; grade derived from 3-4 homework assignments plus a final project (open book, open note, collaborations allowed as long as work is not copied)
• no single recommended textbook. Website has a few recommendations with guidance for choosing a resource.
• I will try to keep it updated with upcoming lecture notes, a “daily dozen” for each lecture, and homework assignments
• “daily dozen” is just some questions (probably not always 12!) that you don’t have to turn in but that you should be able to answer easily after each lecture
Course objectives
• describe the algorithms used in estimating function of biological sequences
• determine which methods are appropriate for analyzing sequences derived from different experiments
• design analysis pipelines that are biologically meaningful and mathematically rigorous
concepts covered
• algorithms, including• HMM• MCMC• dynamic programming• heuristic methods• enrichment of spatial associations
• experimental methods• ChIP• RNAseq• bisulfite, RRBS, MBDseq, MeDIP• variant calling• HiC & similar structural methods
waaaay back: prebiotic soup/primordial sandwich
early Earth was too hot for stable molecules, but as atmosphere cooled, molecules formed at random
many hypotheses about what happened next . . .
but eventually molecules appeared that had catalytic capabilities
and could replicate themselves.
amazing property of nucleotides
RNA
• single stranded but self-complementary, so complex 3D structures with enzymatic capacity are possible
RNA
• amino acids were likely also present in the prebiotic soup
Next steps
• an RNA that gained a permanent function could out-reproduce other RNAs
• proteins are much more stable than RNA
• proteins are linear arrangements of information (like RNA)
RNA encodes proteins
the genetic code is a wobbling degenerate
protein synthesis (translation)unsurprisingly, protein synthesis involves large RNA/protein complexes (ribosomes)
protein synthesis
• translation is energetically expensive
• highly regulated
• ribosomes have proofreading functions
• all components are recycled
becoming a useful protein
and then?
RNA and proteins were working well, and there were probably many “genetic codes” . . . but the RNA that won was the one that invented a stable version of itself
DNA
transcription
• a usually short-lived RNA copy of the DNA is created through transcription
• RNA is exported to the cytoplasm to encode proteins
• some types of RNA do not encode proteins
transcription: the cell knows where to start!
• transcription is expensive and potentially damaging, so it is highly regulated at many levels:
• signal sequences (activating or repressive)• chromatin structure• polymerase control (elongation speed, etc)• cleavage of nascent RNA
classical eukaryotic promoter
biological signals: how do we find these signal sequences, in a big sequence?
motif finding
Simplest example: look for exact matches to a known motif
Next example: imperfect matches to a known motif
Finally: finding enriched motifs in a pile of sequences
additional questions: conservation throughout evolution, coordinated changes
exact match to a known motif: TATA
the TATA box is one of many signals in DNA sequences, that mark the location for transcriptional initiation in a large percentage of eukaryotic genes.
does my sequence contain a TATA box?
ACGCTAGCGCATATAGCATGACTAGTATAGCTAGACGAGCTAGCATATCCGAT
finding an exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGATTATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA…ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA
how many comparisons are needed?
(hint: 4 comparisons for each position x # positions to be compared)
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA
are there ways to speed this up?
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT * * * * * * * * * * * *
reduce search space by flagging all Ts—how many comparisons are made?
find and catalog all 4mers in advancehow do we store that information?lots of options:
big text tablehash tabletree structure
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT
I found a gene!! or did I?p(TATA) = ?
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT
p(TATA at any site) = p(T)*p(A)*p(T)*p(A)and, assuming that the nucleotides are equally represented (which isn’t true)
p(TATA) = 0.25^4 = 0.0039
our sequence is 53 nucleotides long, so we have 50 possible start sites.
expect 53 * 0.0039 occurrences of TATA = 0.2so is our result surprising (do we have a gene)?
What if we’re working with a genome that is 80% AT?
imperfect motifs
Many proteins bind DNA or RNA with less strict sequence preferences.
good example: splicing
Our understanding is still very incomplete . . . but a cell knows how to do it!
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
5’ UTR
3’ UTR
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
5’ UTR
3’ UTR
Start point for transcription
Start point for Translation (AUG)
Terminator for translation (UGA,
UAA, UAG)
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
Pre-mRNA
transcription
5’ UTR
3’ UTR
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
Pre-mRNA
transcription
mRNAsplicing
5’ UTR
3’ UTR
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
Pre-mRNA
transcription
mRNAsplicing
translation
5’ UTR
3’ UTR
Alternative splicing
Splice site and branch site consensus sequences
The problem:
Consensus 5' and 3' splice site sequences, branch site sequences occur frequently in any genome
what is the probability of finding a GT sequence?
More information necessary to define bona fide exons
Splice site and branch site consensus sequences
UGACAUUACUGUGAGUAAAACUUGUUUUCAGGUACAGUAGUCGCAAGUCAUGGUAAGUCCUCUGACUUAACAGGUACUAUAUAUAAAGGAUUAGGUAUGUAUACCUUCAACACAGGUAACUGACUUGGGGCUGCAGGUACAGUCAUGAGUCAUGUCUGUAUCCUUUUGACCUUACAGUGUGAUGGGCAGAGAGGAUGAUGUAAGUAAUGGAUCAUUCGGGGUGAGUAUUUUCAAAAUGGGGGUAAGAAGACUUUCAACAAAGGUAAGACCAUUCAAAAAUAAGGUGAUUGGCACUAUGAAUUAGGUAAGAACUAUUGCGUAACAGGUGAGGCCCUUCGAGCAGAAGGUGAGAACUGACUGGAGCAAGGUAAUUGUGAGUAUGAUGAAGGUAAAUCUUUACAAACUGGAGGUACUUCAAUUUCUUUUUAGGGUUUCACUAAG
-2 -1 1 2
A 0.64 0 0 0
C 0.05 0 0 0
G 0.30 0.85 1 0
U 0.01 0.15 0 1
Splice site and branch site consensus sequences
Cartegni et al. 2002
interpretation of sequence logos: If the letters occupy the entire vertical space, the height of each letter is the proportion of sequences with that base at that position. If the letters do not occupy the entire vertical space, the height of each letter typically signifies information content.
U1 U2 U2AF
splice signals . . . not much information
how many times would the sequence GT occur in a 3GB genome with 60% AT? how about the sequence CAGGTAAG?
so how does the cell know how to splice things?
as we’ll see, eukaryotic gene prediction is a tricky problem! but cells know where their exons are . . . what are some approaches that we can take?
finding signals in eukaryotic genes: additional tools and approaches
• evolutionary conservation: take advantage of history
• open reading frames (we know what a protein-coding gene should look like!)
evolutionary conservation: where are mutations tolerated?
this doesn’t look like a random coincidence . . .
evolutionary conservation
probability of seeing a mutation is the product of the probability of mutation occurrence (mutation rate) and the probability of retaining the mutation (selection)
surrogate for biological importance?
evolutionary conservation
to think about this in a meaningful way we need metrics. How do you define “surprisingly conserved”?
open reading frames
if you are looking for protein-coding genes, you also want to look for open reading frames.
open reading frames
There are 64 codons and 3 of them are stop codons. If the codons are equally likely to appear, how long would you expect an ORF to be, in random sequence?
frequency of stop codon = 3/64 expected probability of stop codon in random sequence = 3/64
open reading frames
frequency of stop codon = 3/64 expected probability of stop codon in random sequence p = 3/64
the length of ORFs in random sequence follows a negative binomial distribution, with mean 1/p (21.3 codons, or 64 nucleotides)
put the signals together to make a gene:
intron exon intergenic
ORF mean length 20aa spans exon mean length
20aa
splice site random occurrence
at boundaries +
random
random occurrence
conservation low high low
this is, of course, an immense oversimplification and ignores lots of biological entities (pseudogenes, conserved noncoding regions etc). Also, these are noisy signals!
highly conserved locus
finding signals in DNA
• lots of data available for various organisms, to infer conservation, binding sites, function
• look for known motifs, known sequence signals
• mine experimental data for overrepresented sequences and motifs
• laboratory approaches for exploration (e.g. mutagenesis) and confirmation
• gene finding algorithms may assess as many data modalities as possible, with weighting
First generation sequencing Pairwise sequence alignment Dot plots Needleman-Wunsch and Smith-Waterman BLAST, gapped BLAST Phylogeny Multiple sequence alignment Next (and nextnext) generation sequencing Short read and not-so-short read alignment Hidden Markov Models ChIPseq variant calling Gene expression: approaches and statistics Functional analysis in genome space Alignment-free sequence comparison Metagenomics Visualizing big data: Circos, Hive plots