analysis of biological sequences sph 140 · analysis of biological sequences sph 140.638...

Analysis of Biological Sequences SPH 140.638

[email protected]

mailto:[email protected]?subject=

nuts and bolts

• meet Tuesdays & Thursdays, 8:30-9:50

• no exam; grade derived from 3-4 homework assignments plus a final project (open book, open note, collaborations allowed as long as work is not copied)

• no single recommended textbook. Website has a few recommendations with guidance for choosing a resource.

• I will try to keep it updated with upcoming lecture notes, a “daily dozen” for each lecture, and homework assignments

• “daily dozen” is just some questions (probably not always 12!) that you don’t have to turn in but that you should be able to answer easily after each lecture

Course objectives

• describe the algorithms used in estimating function of biological sequences

• determine which methods are appropriate for analyzing sequences derived from different experiments

• design analysis pipelines that are biologically meaningful and mathematically rigorous

concepts covered

• algorithms, including• HMM• MCMC• dynamic programming• heuristic methods• enrichment of spatial associations

• experimental methods• ChIP• RNAseq• bisulfite, RRBS, MBDseq, MeDIP• variant calling• HiC & similar structural methods

waaaay back: prebiotic soup/primordial sandwich

early Earth was too hot for stable molecules, but as atmosphere cooled, molecules formed at random

many hypotheses about what happened next . . .

but eventually molecules appeared that had catalytic capabilities

and could replicate themselves.

amazing property of nucleotides

RNA

• single stranded but self-complementary, so complex 3D structures with enzymatic capacity are possible

RNA

• amino acids were likely also present in the prebiotic soup

Next steps

• an RNA that gained a permanent function could out-reproduce other RNAs

• proteins are much more stable than RNA

• proteins are linear arrangements of information (like RNA)

RNA encodes proteins

the genetic code is a wobbling degenerate

protein synthesis (translation)unsurprisingly, protein synthesis involves large RNA/protein complexes (ribosomes)

protein synthesis

• translation is energetically expensive

• highly regulated

• ribosomes have proofreading functions

• all components are recycled

becoming a useful protein

and then?

RNA and proteins were working well, and there were probably many “genetic codes” . . . but the RNA that won was the one that invented a stable version of itself

transcription

• a usually short-lived RNA copy of the DNA is created through transcription

• RNA is exported to the cytoplasm to encode proteins

• some types of RNA do not encode proteins

transcription: the cell knows where to start!

• transcription is expensive and potentially damaging, so it is highly regulated at many levels:

• signal sequences (activating or repressive)• chromatin structure• polymerase control (elongation speed, etc)• cleavage of nascent RNA

classical eukaryotic promoter

biological signals: how do we find these signal sequences, in a big sequence?

motif finding

Simplest example: look for exact matches to a known motif

Next example: imperfect matches to a known motif

Finally: finding enriched motifs in a pile of sequences

additional questions: conservation throughout evolution, coordinated changes

exact match to a known motif: TATA

the TATA box is one of many signals in DNA sequences, that mark the location for transcriptional initiation in a large percentage of eukaryotic genes.

does my sequence contain a TATA box?

ACGCTAGCGCATATAGCATGACTAGTATAGCTAGACGAGCTAGCATATCCGAT

finding an exact match to a known motif: TATA

ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGATTATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA…ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA


ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA

how many comparisons are needed?

(hint: 4 comparisons for each position x # positions to be compared)


ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA

are there ways to speed this up?


ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT * * * * * * * * * * * *

reduce search space by flagging all Ts—how many comparisons are made?

find and catalog all 4mers in advancehow do we store that information?lots of options:

big text tablehash tabletree structure


ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT

I found a gene!! or did I?p(TATA) = ?


ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT

p(TATA at any site) = p(T)*p(A)*p(T)*p(A)and, assuming that the nucleotides are equally represented (which isn’t true)

p(TATA) = 0.25^4 = 0.0039

our sequence is 53 nucleotides long, so we have 50 possible start sites.

expect 53 * 0.0039 occurrences of TATA = 0.2so is our result surprising (do we have a gene)?

What if we’re working with a genome that is 80% AT?

imperfect motifs

Many proteins bind DNA or RNA with less strict sequence preferences.

good example: splicing

Our understanding is still very incomplete . . . but a cell knows how to do it!

CAAT TATA

Enhancer Promoter exon intron exon intron exon polyA signal

5’ 3’

5’ UTR

3’ UTR

CAAT TATA


5’ 3’

5’ UTR

3’ UTR

Start point for transcription

Start point for Translation (AUG)

Terminator for translation (UGA,

UAA, UAG)

CAAT TATA


5’ 3’

Pre-mRNA

transcription

5’ UTR

3’ UTR

CAAT TATA


5’ 3’

Pre-mRNA

transcription

mRNAsplicing

5’ UTR

3’ UTR

CAAT TATA


5’ 3’

Pre-mRNA

transcription

mRNAsplicing

translation

5’ UTR

3’ UTR

Alternative splicing

Splice site and branch site consensus sequences

The problem:

Consensus 5' and 3' splice site sequences, branch site sequences occur frequently in any genome

what is the probability of finding a GT sequence?

More information necessary to define bona fide exons


UGACAUUACUGUGAGUAAAACUUGUUUUCAGGUACAGUAGUCGCAAGUCAUGGUAAGUCCUCUGACUUAACAGGUACUAUAUAUAAAGGAUUAGGUAUGUAUACCUUCAACACAGGUAACUGACUUGGGGCUGCAGGUACAGUCAUGAGUCAUGUCUGUAUCCUUUUGACCUUACAGUGUGAUGGGCAGAGAGGAUGAUGUAAGUAAUGGAUCAUUCGGGGUGAGUAUUUUCAAAAUGGGGGUAAGAAGACUUUCAACAAAGGUAAGACCAUUCAAAAAUAAGGUGAUUGGCACUAUGAAUUAGGUAAGAACUAUUGCGUAACAGGUGAGGCCCUUCGAGCAGAAGGUGAGAACUGACUGGAGCAAGGUAAUUGUGAGUAUGAUGAAGGUAAAUCUUUACAAACUGGAGGUACUUCAAUUUCUUUUUAGGGUUUCACUAAG

-2 -1 1 2

A 0.64 0 0 0

C 0.05 0 0 0

G 0.30 0.85 1 0

U 0.01 0.15 0 1


Cartegni et al. 2002

interpretation of sequence logos: If the letters occupy the entire vertical space, the height of each letter is the proportion of sequences with that base at that position. If the letters do not occupy the entire vertical space, the height of each letter typically signifies information content.

U1 U2 U2AF

splice signals . . . not much information

how many times would the sequence GT occur in a 3GB genome with 60% AT? how about the sequence CAGGTAAG?

so how does the cell know how to splice things?

as we’ll see, eukaryotic gene prediction is a tricky problem! but cells know where their exons are . . . what are some approaches that we can take?

finding signals in eukaryotic genes: additional tools and approaches

• evolutionary conservation: take advantage of history

• open reading frames (we know what a protein-coding gene should look like!)

evolutionary conservation: where are mutations tolerated?

this doesn’t look like a random coincidence . . .

evolutionary conservation

probability of seeing a mutation is the product of the probability of mutation occurrence (mutation rate) and the probability of retaining the mutation (selection)

surrogate for biological importance?

evolutionary conservation

to think about this in a meaningful way we need metrics. How do you define “surprisingly conserved”?

open reading frames

if you are looking for protein-coding genes, you also want to look for open reading frames.

open reading frames

There are 64 codons and 3 of them are stop codons. If the codons are equally likely to appear, how long would you expect an ORF to be, in random sequence?

frequency of stop codon = 3/64 expected probability of stop codon in random sequence = 3/64

open reading frames

frequency of stop codon = 3/64 expected probability of stop codon in random sequence p = 3/64

the length of ORFs in random sequence follows a negative binomial distribution, with mean 1/p (21.3 codons, or 64 nucleotides)

put the signals together to make a gene:

intron exon intergenic

ORF mean length 20aa spans exon mean length

20aa

splice site random occurrence

at boundaries +

random

random occurrence

conservation low high low

this is, of course, an immense oversimplification and ignores lots of biological entities (pseudogenes, conserved noncoding regions etc). Also, these are noisy signals!

highly conserved locus

finding signals in DNA

• lots of data available for various organisms, to infer conservation, binding sites, function

• look for known motifs, known sequence signals

• mine experimental data for overrepresented sequences and motifs

• laboratory approaches for exploration (e.g. mutagenesis) and confirmation

• gene finding algorithms may assess as many data modalities as possible, with weighting

First generation sequencing Pairwise sequence alignment Dot plots Needleman-Wunsch and Smith-Waterman BLAST, gapped BLAST Phylogeny Multiple sequence alignment Next (and nextnext) generation sequencing Short read and not-so-short read alignment Hidden Markov Models ChIPseq variant calling Gene expression: approaches and statistics Functional analysis in genome space Alignment-free sequence comparison Metagenomics Visualizing big data: Circos, Hive plots

analysis of biological sequences sph 140 · analysis of biological sequences sph 140.638...

Documents