last lecture summary

31
Last lecture summary

Upload: varden

Post on 24-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Last lecture summary. New generation sequencing (NGS). The completion of human genome was just a start of modern DNA sequencing era – “high-throughput next generation sequencing” (NGS). New approaches, reduce time and cost. Holly Grail of sequencing – complete human genome below $ 1000. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Last lecture summary

Last lecture summary

Page 2: Last lecture summary

New generation sequencing (NGS)• The completion of human genome was just a start of

modern DNA sequencing era – “high-throughput next generation sequencing” (NGS).

• New approaches, reduce time and cost.• Holly Grail of sequencing – complete human genome

below $ 1000.• 1st generation – Sanger dideoxy method• 2nd generation – sequencing by synthesis

(pyrosequencing)• 3rd generation – single molecule sequencing

Page 3: Last lecture summary

cDNA, EST libraries• cDNA – reverse transcriptase, containsonly expressed genes (no introns)cDNA library – a collection of different DNA sequences that have been incorporated into a vector

• EST – Expressed Sequence Tag• short, unedited (single-pass read),

randomly selected subsequence (200-800 bps) of cDNA sequence generated either from 5’ or from 3’

• higher quality in the middle

• cDNA/EST – direct evidence of transcriptome

Page 4: Last lecture summary

What is sequence alignment ?

CTTTTCAAGGCTTA GGCTTATTATTGC

CTTTTCAAGGCTTA GGCTATTATTGC

CTTTTCAAGGCTTA GGCT-ATTATTGC

Fragments overlaps

Page 5: Last lecture summary

What is sequence alignment ?

CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG

TTACCCCATGGTGGCGGCTGGGGACAGTCGCGCATAATTCCG

“EST clustering”

CCCCATGGTGGCGGCAGGTGACAGCATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTTTGGCGGCTCGGGACAGTCGCGCATAATCCATGGTGGTGGCTGGGGATAGTATGAGGCAGTCGCGCATAATTCCG

consensus

Page 6: Last lecture summary

Sequence alignment• Procedure of comparing sequences• Point mutations – easy

• More difficult example

• However, gaps can be inserted to get something like this

ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT

gapless alignment

gapped alignmentinsertion × deletionindel

Page 7: Last lecture summary

Why align sequences – continuation• The draft human genome is available• Automated gene finding is possible• Gene: AGTACGTATCGTATAGCGTAA

• What does it do?• One approach: Is there a similar gene in another

species?• Align sequences with known genes• Find the gene with the “best” match

Page 8: Last lecture summary

Flavors of sequence alignment• gapped x gapless• pairwise x multiple• global x local

Page 9: Last lecture summary

Evolution of sequences• The sequences are the products of molecular evolution.• When sequences share a common ancestor, they tend to

exhibit similarity in their sequences, structures and biological functions.

Similar functionSequence similarity Similar 3D structure

Protein1 Protein2

DNA1 DNA2

However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: 11178260

Similar sequences produce similar proteins

Page 10: Last lecture summary

Homology• Sequences diverge over time• Common ancestor – homologous sequences• The variation between sequences – changes occurred

during evolution in the form of substitutions (mutations) and/or indels.

• Traces of evolution may still remain in certain portions of the sequences to allow identification of the common ancestry.

• Residues performing key roles are conserved (preserved) by natural selection.

• Orthology vs paralogy

Page 11: Last lecture summary

New stuff

Page 12: Last lecture summary

Identity matrix

Scoring systems I• DNA and protein sequences can be aligned so that the

number of identically matching pairs is maximized.

• Counting the number of matches gives us a score (3 in this case). Higher score means better alignment.

• This procedure can be formalized using substitution matrix.

A T T G - - - TA – - G A C A T

A T C G

A 1

T 0 1

C 0 0 1

G 0 0 0 1

Page 13: Last lecture summary

Scoring systems II• identity matrix: NAs – OK, proteins – not enough• AAs are not exchanged with the same probability as can

be conceived theoretically.• For example substitution of aspartic acids D by glutamic

acid E is frequently observed. And change from aspartic acid to tryptophan W is very rare.

D E W

Page 14: Last lecture summary

Scoring systems II• Why is that?

1. Triplet-based genetic code

GAT (D) → GAA (E), GAT (D) → TGG (W)

2. Both D and E have similar properties, but D and W differ considerably. D is hydrophilic, W is hydrophobic, D → W mutation can greatly alter 3D structure and consequently function.

Page 15: Last lecture summary

Genetic code

http://www.doctortee.com/dsu/tiftickjian/bio100/gene-expression.html

Page 16: Last lecture summary

Gaps or no gaps

Page 17: Last lecture summary

Scoring DNA sequence alignment (1)• Match score: +1• Mismatch score: +0• Gap penalty: –1•

ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

• Matches: 18 × (+1)• Mismatches: 2 × 0• Gaps: 7 × (– 1)

Score = +11

Page 18: Last lecture summary

Length penalties• We want to find alignments that are evolutionarily likely.• Which of the following alignments seems more likely to

you?

ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT

• We can achieve this by penalizing more for a new gap, than for extending an existing gap

Page 19: Last lecture summary

Scoring DNA sequence alignment (2)• Match/mismatch score: +1/+0• Origination/length penalty: –2/–1•

ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

• Matches: 18 × (+1)• Mismatches: 2 × 0• Origination: 2 × (–2)• Length: 7 × (–1)

Score = +7

Page 20: Last lecture summary

Substitution matrices• Substitution (score) matrices show scores for amino acids

substitution. Higher score means higher probability of mutation.

• Conservative substitutions – conserve the physical and chemical properties of the amino acids, limit structural/functional disruption

• Substitution matrices should reflect:• Physicochemical properties of amino acids.• Different frequencies of individual amino acids occuring in proteins.• Interchangeability of the genetic code.

Page 21: Last lecture summary

PAM matrices I• How to assign scores? Let’s get nature – evolution –

involved!• If you choose set of proteins with very similar sequences,

you can do alignment manually.• Also, if sequences in your set are similar, then there is high

probability that amino acid difference are due to single mutation.

• From the frequencies of mutations in the set of similar protein sequences probabilities of substitutions can be derived.

• This is exactly the approach take by Margaret Dayhoff in 1978 to construct PAM (Accepted Point Mutation) matrices.

Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found.. pp. 345–358.

Page 22: Last lecture summary

PAM matrices II• Alignments of 71 groups of very similar (at least 85%

identity) protein sequences. 1572 substitutions were found.• These mutations do not significantly alter the protein

function. Hence they are called accepted mutations (accepted by natural selection).

• Probabilities that any one amino acid would mutate into any other were calculated.

• If I know probabilities of individual amino acids, what is the probability for the given sequence?• Product

• But to calculate the score, we would like to sum probabilities, not multiply. How to achieve this?• Logarithm

Excellent discussion of the derivation and use of PAM matrices: George DG, Barker WC, Hunt LT. Mutation data matrix and its uses. Methods Enzymol. 1990,183:333-51. PMID: 2314281.

Page 23: Last lecture summary

PAM matrices III• Dayhoff’s definition of accepted mutation was thus based

on empirically observed amino acids substitutions.• The used unit is a PAM. Two sequences are 1 PAM apart

if they have 99% identical residues.• PAM1 matrix is the result of computing the probability of

one substitution per 100 amino acids.• PAM1 matrix represents probabilities of point mutations

over certain evolutionary time.• in Drosophila 1 PAM corresponds to ~2.62 MYA• in Human 1 PAM corresponds to ~4.58 MYA

Page 24: Last lecture summary

PAM1 matrix

numbers are multiplied by 10 000

Page 25: Last lecture summary

Higher PAM matrices• What to do if I want get probabilities over much longer

evolutionary time? • Dayhoff proposed a model of evolution that is a Markov

process.• A case of Markov process is a linear dynamical system.

Page 26: Last lecture summary

Linear dynamical system IA new species of frog has been introduced into an area where it has too few natural predators. In an attempt to restore the ecological balance, a team of scientists is considering introducing a species of bird which feeds on this frog. Experimental data suggests that the population of frogs and birds from one year to the next can be modeled by linear relationships. Specifically, it has been found that if the quantities Fk and Bk represent the populations of the frogs and birds in the kth year, then

The question is this: in the long run, will the introduction of the birds reduce or eliminate the frog population growth?

Page 27: Last lecture summary

Linear dynamical system II

• So this system evolves in time according to x(k+1) = Ax(k). Such a system is called discrete linear dynamical system, matrix A is called transition matrix.

• If we need to know the state of the system in time k = 50, we have to compute x(50) = A50 x(0).

• And the same is true for Dayhoff’s model of evolution.• If we need to obtain probability matrices for higher

percentage of accepted mutations (i.e. covering longer evolutionary time), we do matrix powers.

• Let’s say we want PAM120 – 120 mutations fixed on average per 100 residues. We do PAM1120.

Page 28: Last lecture summary

Higher PAM matrices• Biologically, the PAM120 matrix means that in 100 amino

acids there have been 50 substitutions, while in PAM250 there have been 2.5 amino acid mutation at each side.

• This may sound unusual, but remember, that over evolutionary time, it is possible that an alanine was changed to glycine, then to valine, and then back to alanine.

• These are called silent substituions.

Page 29: Last lecture summary

PAM 120small, polar

small, nonpolar

polar or acidic

basic

large, hydrophobic

aromatic

Zvelebil, Baum, Understanding bioinformatics.

Positive score – frequency of substitutions is greater than would have occurred by random chance.

Zero score – frequency is equal to that expected by chance.

Negative score – frequency is less than would have occurred by random chance.

Page 30: Last lecture summary

PAM matrices assumptions• Mutation of amino acid is independent of previous

mutations at the same position (Markov process requirement).

• Only PAM1 was “measured”, all other are extrapolations (i.e. predictions based on some model).

• Each amino acid position is equally mutable.• Mutations are assumed to be independent of surrounding

residues.• Forces responsible for sequence evolution over short time

are the same as these over longer times.• PAM matrices are based on protein sequences available in

1978 (bias towards small, globular proteins)• New generation of Dayhoff-type – e.g. PET91

Page 31: Last lecture summary

How to calculate score?Selzer, Applied bioinformatics.

substitution matrix

2