sequence comparison: introduction and...

28
Scoring Alignments Genome 373 Genomic Informatics Elhanan Borenstein

Upload: others

Post on 06-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Scoring Alignments

Genome 373

Genomic Informatics

Elhanan Borenstein

Page 2: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

A quick review

Course logistics

Genomes (so many genomes)

The computational bottleneck

Page 3: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Informatic Challenges: Examples

• Sequence comparison:

– Find the best alignment of two sequences

– Find the best match (alignment) of a given sequence in a large dataset of sequences

– Find the best alignment of multiple sequences

• Motif and gene finding

• Relationship between sequences

– Phylogeny

• Clustering and classification

• Many many many more …

Page 4: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Motivation

• Why compare two protein or DNA sequences?

Page 5: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Motivation

• Why compare two protein or DNA sequences?

– Determine whether they are descended from a common ancestor (homologous)

– Infer a common function

– Locate functional elements (motifs or domains)

– Infer protein or RNA structure, if the structure of one of the sequences is known

– Analyze sequence evolution

– Infer the species from which a sequence originated

Page 6: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Informatic Challenges: Examples

• Sequence comparison:

– Find the best alignment of two sequences

– Find the best match (alignment) of a given sequence in a large dataset of sequences

– Find the best alignment of multiple sequences

• Motif and gene finding

• Relationship between sequences

– Phylogeny

• Clustering and classification

• Many many many more …

Page 7: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Informatic Challenges: Examples

• Sequence comparison:

Find the best alignment of two sequences

Find the best match (alignment) of a given sequence in a large dataset of sequences

– Find the best alignment of multiple sequences

• Motif and gene finding

• Relationship between sequences

– Phylogeny

• Clustering and classification

• Many many many more …

Page 8: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

One of many commonly used tools that depend on sequence alignment.

Page 9: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Sequence Alignment

Page 10: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Mission: Find the best alignment

between two sequences.

Page 11: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Mission: Find the best alignment

between two sequences.

GAATC

CATAC

GAATC-

CA-TAC

GAAT-C

C-ATAC

GAAT-C

CA-TAC

-GAAT-C

C-A-TAC

GA-ATC

CATA-C

(some of a very large number of possibilities)

GAAT-C

C-ATAC

GAAT-C

CA-TAC

Find the best alignment of GAATC and CATAC:

Page 12: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Mission: Find the best alignment

between two sequences.

This is an optimization problem!

What do we need to solve this problem?

Page 13: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Mission: Find the best alignment

between two sequences.

A method for scoring

alignments

A “search” algorithm for finding the alignment

with the best score

Page 14: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Scoring Principles

• Score each locus independently.

• The alignment score will be the sum of the scores in all loci.

• Perfect Matches will get a positive (good) score.

• What about mismatches?

GAATC

CATAC

Page 15: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Scoring Principles

• Score each locus independently.

• The alignment score will be the sum of the scores in all loci.

• Perfect Matches will get a positive (good) score.

• What about mismatches?

GAATC

CATAC

(transitions are typically about 2x as frequent as transversions in real sequences)

Page 16: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Scoring Aligned Bases

A C G T

A 10 -5 0 -5

C -5 10 -5 0

G 0 -5 10 -5

T -5 0 -5 10

• A reasonable substitution matrix:

GAATC

CATAC

-5 + 10 + -5 + -5 + 10 = 5

What about gaps?

Page 17: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

What About Gaps?

A C G T

A 10 -5 0 -5

C -5 10 -5 0

G 0 -5 10 -5

T -5 0 -5 10

• A reasonable substitution matrix:

GAAT-C

CA-TAC

-5 + 10 + ? + 10 + ? + 10 = ?

What if gaps have no penalty?

What do gaps mean?

What if gaps have no penalty?

What do gaps mean?

Page 18: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

• Linear gap penalty: every gap receives a score of d:

GAAT-C d=-4

CA-TAC

-5 + 10 + -4 + 10 + -4 + 10 = 17

Scoring Gaps?

Page 19: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

• Linear gap penalty: every gap receives a score of d:

• Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e:

GAAT-C d=-4

CA-TAC

-5 + 10 + -4 + 10 + -4 + 10 = 17

G--AATC d=-4

CATA--C e=-1

-5 + -4 + -1 + 10 + -4 + -1 + 10 = 5

Scoring Gaps?

Page 20: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Same Method Applies to AA

regular 20 amino acids ambiguity codes and stop

BLOSUM62 Score Matrix

YMEGDLEIAPDAK

VL--DKELSPDGT

Y mutates to V receives -1

M mutates to L receives 2

E gets deleted receives -10

G gets deleted receives -10

D matches D receives 6

Total score = -13

Page 21: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Mission: Find the best alignment

between two sequences.

A method for scoring

alignments

A “search” algorithm for finding the alignment

with the best score

?

Page 22: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Exhaustive search • Align the two sequences: GAATC and CATAC

Simple (exhaustive search) algorithm 1) Construct all possible alignments 2) Use the substitution matrix and gap

penalty to score each alignment 3) Pick the alignment with the best score

GAATC

CATAC

GAATC-

CA-TAC

GAAT-C

C-ATAC

GAAT-C

CA-TAC

-GAAT-C

C-A-TAC

GA-ATC

CATA-C

GAAT-C

C-ATAC

GAAT-C

CA-TAC

Page 23: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

How many possibilities?

• How many different possible alignments of

two sequences of length n exist?

• Align the two sequences: GAATC and CATAC

GAATC

CATAC

GAATC-

CA-TAC

GAAT-C

C-ATAC

GAAT-C

CA-TAC

-GAAT-C

C-A-TAC

GA-ATC

CATA-C

GAAT-C

C-ATAC

GAAT-C

CA-TAC

Page 24: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

How many possibilities?

• How many different possible alignments of

two sequences of length n exist? 5 2.5x102

10 1.8x105

20 1.4x1011

30 1.2x1017

40 1.1x1023

• Align the two sequences: GAATC and CATAC

GAATC

CATAC

GAATC-

CA-TAC

GAAT-C

C-ATAC

GAAT-C

CA-TAC

-GAAT-C

C-A-TAC

GA-ATC

CATA-C

GAAT-C

C-ATAC

GAAT-C

CA-TAC

Page 25: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

Mission: Find the best alignment

between two sequences.

A method for scoring

alignments

A “search” algorithm for finding the alignment

with the best score

Needleman–Wunsch Algorithm

Dynamic programming

Page 26: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

The Needleman–Wunsch Algorithm

• An algorithm for global alignment on two sequences

• A Dynamic Programming (DP) approach

– Yes, it’s a weird name.

– DP is closely related to recursion and to mathematical induction

• We can prove that the resulting score is optimal.

Page 27: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences

DP matrix

G A A T C

C

A

T

A

C

i

0

1

2

3

4

5

j 0 1 2 3 etc.

Page 28: Sequence comparison: Introduction and motivationelbo.gs.washington.edu/.../slides/2A-Sequence_comparison_scoring.… · •Sequence comparison: –Find the best alignment of two sequences