analysis of biological sequences sph 140 · analysis of biological sequences sph 140.638...

49
Analysis of Biological Sequences SPH 140.638 [email protected]

Upload: others

Post on 21-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

Analysis of Biological Sequences SPH 140.638

[email protected]

Page 2: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

nuts and bolts

• meet Tuesdays & Thursdays, 8:30-9:50

• no exam; grade derived from 3-4 homework assignments plus a final project (open book, open note, collaborations allowed as long as work is not copied)

• no single recommended textbook. Website has a few recommendations with guidance for choosing a resource.

• I will try to keep it updated with upcoming lecture notes, a “daily dozen” for each lecture, and homework assignments

• “daily dozen” is just some questions (probably not always 12!) that you don’t have to turn in but that you should be able to answer easily after each lecture

Page 3: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

Course objectives

• describe the algorithms used in estimating function of biological sequences

• determine which methods are appropriate for analyzing sequences derived from different experiments

• design analysis pipelines that are biologically meaningful and mathematically rigorous

Page 4: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

concepts covered

• algorithms, including• HMM• MCMC• dynamic programming• heuristic methods• enrichment of spatial associations

• experimental methods• ChIP• RNAseq• bisulfite, RRBS, MBDseq, MeDIP• variant calling• HiC & similar structural methods

Page 5: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

waaaay back: prebiotic soup/primordial sandwich

early Earth was too hot for stable molecules, but as atmosphere cooled, molecules formed at random

many hypotheses about what happened next . . .

but eventually molecules appeared that had catalytic capabilities

and could replicate themselves.

Page 6: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

amazing property of nucleotides

Page 7: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

RNA

• single stranded but self-complementary, so complex 3D structures with enzymatic capacity are possible

Page 8: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

RNA

• amino acids were likely also present in the prebiotic soup

Page 9: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

Next steps

• an RNA that gained a permanent function could out-reproduce other RNAs

• proteins are much more stable than RNA

• proteins are linear arrangements of information (like RNA)

Page 10: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

RNA encodes proteins

Page 11: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

the genetic code is a wobbling degenerate

Page 12: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

protein synthesis (translation)unsurprisingly, protein synthesis involves large RNA/protein complexes (ribosomes)

Page 13: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

protein synthesis

• translation is energetically expensive

• highly regulated

• ribosomes have proofreading functions

• all components are recycled

Page 14: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

becoming a useful protein

Page 15: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

and then?

RNA and proteins were working well, and there were probably many “genetic codes” . . . but the RNA that won was the one that invented a stable version of itself

Page 16: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

DNA

Page 17: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

transcription

• a usually short-lived RNA copy of the DNA is created through transcription

• RNA is exported to the cytoplasm to encode proteins

• some types of RNA do not encode proteins

Page 18: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

transcription: the cell knows where to start!

• transcription is expensive and potentially damaging, so it is highly regulated at many levels:

• signal sequences (activating or repressive)• chromatin structure• polymerase control (elongation speed, etc)• cleavage of nascent RNA

classical eukaryotic promoter

biological signals: how do we find these signal sequences, in a big sequence?

Page 19: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

motif finding

Simplest example: look for exact matches to a known motif

Next example: imperfect matches to a known motif

Finally: finding enriched motifs in a pile of sequences

additional questions: conservation throughout evolution, coordinated changes

Page 20: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

exact match to a known motif: TATA

the TATA box is one of many signals in DNA sequences, that mark the location for transcriptional initiation in a large percentage of eukaryotic genes.

does my sequence contain a TATA box?

ACGCTAGCGCATATAGCATGACTAGTATAGCTAGACGAGCTAGCATATCCGAT

Page 21: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

finding an exact match to a known motif: TATA

ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGATTATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA…ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA

Page 22: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

exact match to a known motif: TATA

ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA

how many comparisons are needed?

(hint: 4 comparisons for each position x # positions to be compared)

Page 23: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

exact match to a known motif: TATA

ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA

are there ways to speed this up?

Page 24: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

exact match to a known motif: TATA

ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT * * * * * * * * * * * *

reduce search space by flagging all Ts—how many comparisons are made?

find and catalog all 4mers in advancehow do we store that information?lots of options:

big text tablehash tabletree structure

Page 25: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

exact match to a known motif: TATA

ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT

I found a gene!! or did I?p(TATA) = ?

Page 26: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

exact match to a known motif: TATA

ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT

p(TATA at any site) = p(T)*p(A)*p(T)*p(A)and, assuming that the nucleotides are equally represented (which isn’t true)

p(TATA) = 0.25^4 = 0.0039

our sequence is 53 nucleotides long, so we have 50 possible start sites.

expect 53 * 0.0039 occurrences of TATA = 0.2so is our result surprising (do we have a gene)?

What if we’re working with a genome that is 80% AT?

Page 27: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

imperfect motifs

Many proteins bind DNA or RNA with less strict sequence preferences.

good example: splicing

Our understanding is still very incomplete . . . but a cell knows how to do it!

Page 28: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

CAAT TATA

Enhancer Promoter exon intron exon intron exon polyA signal

5’ 3’

5’ UTR

3’ UTR

Page 29: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

CAAT TATA

Enhancer Promoter exon intron exon intron exon polyA signal

5’ 3’

5’ UTR

3’ UTR

Start point for transcription

Start point for Translation (AUG)

Terminator for translation (UGA,

UAA, UAG)

Page 30: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

CAAT TATA

Enhancer Promoter exon intron exon intron exon polyA signal

5’ 3’

Pre-mRNA

transcription

5’ UTR

3’ UTR

Page 31: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

CAAT TATA

Enhancer Promoter exon intron exon intron exon polyA signal

5’ 3’

Pre-mRNA

transcription

mRNAsplicing

5’ UTR

3’ UTR

Page 32: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

CAAT TATA

Enhancer Promoter exon intron exon intron exon polyA signal

5’ 3’

Pre-mRNA

transcription

mRNAsplicing

translation

5’ UTR

3’ UTR

Page 33: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

Alternative splicing

Page 34: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

Splice site and branch site consensus sequences

The problem:

Consensus 5' and 3' splice site sequences, branch site sequences occur frequently in any genome

what is the probability of finding a GT sequence?

More information necessary to define bona fide exons

Page 35: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

Splice site and branch site consensus sequences

UGACAUUACUGUGAGUAAAACUUGUUUUCAGGUACAGUAGUCGCAAGUCAUGGUAAGUCCUCUGACUUAACAGGUACUAUAUAUAAAGGAUUAGGUAUGUAUACCUUCAACACAGGUAACUGACUUGGGGCUGCAGGUACAGUCAUGAGUCAUGUCUGUAUCCUUUUGACCUUACAGUGUGAUGGGCAGAGAGGAUGAUGUAAGUAAUGGAUCAUUCGGGGUGAGUAUUUUCAAAAUGGGGGUAAGAAGACUUUCAACAAAGGUAAGACCAUUCAAAAAUAAGGUGAUUGGCACUAUGAAUUAGGUAAGAACUAUUGCGUAACAGGUGAGGCCCUUCGAGCAGAAGGUGAGAACUGACUGGAGCAAGGUAAUUGUGAGUAUGAUGAAGGUAAAUCUUUACAAACUGGAGGUACUUCAAUUUCUUUUUAGGGUUUCACUAAG

-2 -1 1 2

A 0.64 0 0 0

C 0.05 0 0 0

G 0.30 0.85 1 0

U 0.01 0.15 0 1

Page 36: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

Splice site and branch site consensus sequences

Cartegni et al. 2002

interpretation of sequence logos: If the letters occupy the entire vertical space, the height of each letter is the proportion of sequences with that base at that position. If the letters do not occupy the entire vertical space, the height of each letter typically signifies information content.

U1 U2 U2AF

Page 37: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

splice signals . . . not much information

how many times would the sequence GT occur in a 3GB genome with 60% AT? how about the sequence CAGGTAAG?

Page 38: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

so how does the cell know how to splice things?

as we’ll see, eukaryotic gene prediction is a tricky problem! but cells know where their exons are . . . what are some approaches that we can take?

Page 39: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

finding signals in eukaryotic genes: additional tools and approaches

• evolutionary conservation: take advantage of history

• open reading frames (we know what a protein-coding gene should look like!)

Page 40: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

evolutionary conservation: where are mutations tolerated?

this doesn’t look like a random coincidence . . .

Page 41: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

evolutionary conservation

probability of seeing a mutation is the product of the probability of mutation occurrence (mutation rate) and the probability of retaining the mutation (selection)

surrogate for biological importance?

Page 42: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

evolutionary conservation

to think about this in a meaningful way we need metrics. How do you define “surprisingly conserved”?

Page 43: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

open reading frames

if you are looking for protein-coding genes, you also want to look for open reading frames.

Page 44: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

open reading frames

There are 64 codons and 3 of them are stop codons. If the codons are equally likely to appear, how long would you expect an ORF to be, in random sequence?

frequency of stop codon = 3/64 expected probability of stop codon in random sequence = 3/64

Page 45: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

open reading frames

frequency of stop codon = 3/64 expected probability of stop codon in random sequence p = 3/64

the length of ORFs in random sequence follows a negative binomial distribution, with mean 1/p (21.3 codons, or 64 nucleotides)

Page 46: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

put the signals together to make a gene:

intron exon intergenic

ORF mean length 20aa spans exon mean length

20aa

splice site random occurrence

at boundaries +

random

random occurrence

conservation low high low

this is, of course, an immense oversimplification and ignores lots of biological entities (pseudogenes, conserved noncoding regions etc). Also, these are noisy signals!

Page 47: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

highly conserved locus

Page 48: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

finding signals in DNA

• lots of data available for various organisms, to infer conservation, binding sites, function

• look for known motifs, known sequence signals

• mine experimental data for overrepresented sequences and motifs

• laboratory approaches for exploration (e.g. mutagenesis) and confirmation

• gene finding algorithms may assess as many data modalities as possible, with weighting

Page 49: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade

First generation sequencing Pairwise sequence alignment Dot plots Needleman-Wunsch and Smith-Waterman BLAST, gapped BLAST Phylogeny Multiple sequence alignment Next (and nextnext) generation sequencing Short read and not-so-short read alignment Hidden Markov Models ChIPseq variant calling Gene expression: approaches and statistics Functional analysis in genome space Alignment-free sequence comparison Metagenomics Visualizing big data: Circos, Hive plots