bioinformatics phd. course

8
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (<10.000 bps) 4 Sequence assembly 3 Comparison of large sequences (up to 250 000 000) 5 Efficient data search structures and algorithms 6 Proteins...

Upload: ross-barton

Post on 01-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Bioinformatics PhD. Course. Summary (approximate). 1. Biological introduction. 2. Comparison of short sequences (

TRANSCRIPT

Page 1: Bioinformatics PhD. Course

Bioinformatics PhD. Course

Summary (approximate)

• 1. Biological introduction• 2. Comparison of short sequences (<10.000 bps)

• 4 Sequence assembly

• 3 Comparison of large sequences (up to 250 000 000)

• 5 Efficient data search structures and algorithms

• 6 Proteins...

Page 2: Bioinformatics PhD. Course

2. Comparison of short sequences (<10.000 bps)

Summary (more or less)

• 2.1 Dot matrix• 2.2 Pairwise alignment. • 2.3 Hash algorithms.• 2.4 Multiple alignment.

Page 3: Bioinformatics PhD. Course

2. Dot matrix

Given two sequences, how we can analyse their degree of identity?

By searching those parts that match:

S1

S2

x

y

1/0

1 if both characters coincide

Page 4: Bioinformatics PhD. Course

2. Dot matrix

Given two sequences, how we can analyse their degree of identity?

By searching those parts that match:

S1

S2

x

y

S1

S2

x..

y . . . . .

1/0

1 if both characters coincide ?

Page 5: Bioinformatics PhD. Course

2.1 Dot matrix

What is the cost of the algorithm?

When are the matchings relevant?

accaccacaccacaacgagcata … acctgagcgatat

acc..t

L=window length

• m(i,j)=1 iff S1(i..i+L)=S2(j..j+L): exact matching

• m(i,j)=1 iff k over L coincide: approximate matching.

• m(i,j)=k iff k over L coincide: approximate matching

Page 6: Bioinformatics PhD. Course

2.1. Dot matrix: algorithm cost

accaccacaccacaacgagcata … acctgagcgatat

acc..t

• long(S1)*long(S2)* L in other words O(n2 L)

• can long(S1)*long(S2) be possible? can we also say that O(n2 ) is independent of L?

Page 7: Bioinformatics PhD. Course

2.1. Dot matrix: signals

A: transposons C: Random B: S1=S2

When are signals statistically significant?

Page 8: Bioinformatics PhD. Course

2.1. Dot matrix: statistical significance:

We need to define a random model against which to compare the signals:

we define RV: X number of characters that coincide,

then Prob(X=k)=comb(L,k) pk (1-p)L-k

Given

x..

y . . . . .

S1

S2

L=window length

What is its expected value?