burrows-wheeler transform and bwt-indexkoutcher/lectures/lecture4-1.pdf · succinct and compressed...

44
Burrows-Wheeler transform and BWT-index

Upload: others

Post on 20-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform and BWT-index

Page 2: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Succinct and compressed indexes

!  succinct index takes space in bits proportional to that of the text itself

!  previous indexes are not succinct as they take O(n) computer words but O(n·log(n)) bits

!  compressed index takes space in bits proportional to that of the compressed text

!  self-index does not require storing the text

Page 3: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!

T=acatacagatg$!

Page 4: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!

12 5 1 7 3 9 6 2 11 8 4 10

T=acatacagatg$!SA

Page 5: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!

T=acatacagatg$!

Page 6: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

Page 7: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  BWT has been defined for the purpose of compression, as BWT compresses better than the input text

•  BWT is reversible!

1 2 3 4 5 6 7 8 9 10 11 12

Page 8: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10]

F L

1 2 3 4 5 6 7 8 9 10 11 12

Page 9: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

T= $!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10]

F L

1 2 3 4 5 6 7 8 9 10 11 12

Page 10: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= $!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10]

F L

1 2 3 4 5 6 7 8 9 10 11 12

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 11: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= g$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10]

F L

1 2 3 4 5 6 7 8 9 10 11 12

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 12: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= g$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

x x

i

j

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 13: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= tg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 14: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= tg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 15: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= atg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 16: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= atg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 17: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= gatg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 18: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

LF function

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

BWT T[SA[i]]

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

T=acatacagatg$!

•  LF[i] yields the index (in SA) of the suffix immediately preceding (in T) the i-th suffix (in SA). Formally, SA[LF[i]]=SA[i]-1.

1 2 3 4 5 6 7 8 9 10 11 12

12 5 1 7 3 9 6 2 11 8 4 10

SA

SA[LF[7]] = SA[7]-1 = 6-1 = 5

Page 19: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank function

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

BWT T[SA[i]]

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Page 20: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank function

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

BWT T[SA[i]]

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i] 0 0 0 0 1 1 0 1 1 2 3 4

rank[BWT[i],i]

Page 21: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank function

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

BWT T[SA[i]]

how about general queries rank[a,i] for any letter a and any

position i?

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i] 0 0 0 0 1 1 0 1 1 2 3 4

rank[BWT[i],i]

Page 22: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank/select functions

!  given a string T, efficiently answer queries rank(a,i) on the number of a’s in T[1..i]

!  rank function (on bit vectors) turns out to be a fundamental algorithmic block for building succinct data structures [Jacobson 89]

!  on bit vectors rank can be supported in time O(1) using o(n) additional bits of memory

!  complementary function select(a,j): output the position of the j-th occurrence of a in T. select can also be supported in O(1) time

!  [Jacobson 89]: using rank/select to represent binary trees to support navigation

Page 23: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Implementing rank on bitmaps

!  consider a bitmap B of size n !  tabulate rank within all blocks of size (log n)/2 there are 2(log n)/2 = √n different blocks, and (log n)/2 possible queries,

with the result taking (log log n) bits. Overall space: O(√n·(log n)·(log log n))=o(n)

!  idea: compute “cumulative rank” for block borders

!  takes 2n/(log n) · (log n) = 2n bits too much! trick: introduce two levels of

blocks

Page 24: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Implementing rank on bitmaps

!  consider a bitmap B of size n !  tabulate rank within all blocks of size (log n)/2 there are 2(log n)/2 = √n different blocks, and (log n)/2 possible queries,

with the result taking (log log n) bits. Overall space: O(√n·(log n)·(log log n))=o(n) !  split B into n/(log2 n) superblocks of size (log2 n); compute

cumulative rank. This takes n/(log2 n) · (log n)=n/(log n)=o(n) bits !  split each superblock into blocks of size (log n)/2; compute

cumulative rank inside superblock; the result takes O(log log n) bits. Therefore we only need

2n/(log n) · (log log n) = o(n) bits

DONE!

Page 25: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank for large alphabets: wavelet trees

$,a,b c,d,r

$,a b

a $

c,d r

d c

Space: n·log(σ) bits

rank(a,11)=?

a: 001

Page 26: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank for large alphabets: wavelet trees

$,a,b c,d,r

$,a b

a $

c,d r

d c

Space: n·log(σ) bits

rank(a,11)=?

a: 001

rank(0,11,Sε)=5

Page 27: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank for large alphabets: wavelet trees

$,a,b c,d,r

$,a b

a $

c,d r

d c

Space: n·log(σ) bits

rank(a,11)=?

a: 001

rank(0,11,Sε)=5 rank(0,5,S0)=3

Page 28: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank for large alphabets: wavelet trees

$,a,b c,d,r

$,a b

a $

c,d r

d c

Space: n·log(σ) bits

rank(a,11)=?

a: 001

rank(0,11,Sε)=5 rank(0,5,S0)=3 rank(1,3,S00)=2

rank is computed in O(log(σ)) time

Page 29: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

Page 30: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

Page 31: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

Page 32: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!

f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 33: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!

f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e] f:= C[x]+rank[x,f]-1

Page 34: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 35: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 36: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 37: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

e!f!

Page 38: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

S : current string (pattern suffix) [e,f] : current interval x : letter

compute new interval for xS

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 39: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

What position is it??

Page 40: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

It is position 3 !

Page 41: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

BWT-index (FM-index)

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

Solution: store only a fraction of values of SA.

Storing one value over log(n) leads to O(n·log(n)/log(n))=O(n) bits

12 5 1 7 3 9 6 2 11 8 4 10

SA

Page 42: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

BWT-index (FM-index)

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

Solution: store only a fraction of values of SA.

Storing one value over log(n) leads to O(n·log(n)/log(n))=O(n) bits

Search time becomes O(|P|+occ·log(n))

[Ferragina, Manzini 00]

FM-index includes: •  BWT •  selection of SA values •  auxiliary structures: array C,

rank, position marking …

12

6 2

8 4 10

SA

Page 43: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

BWT-index: practical issues

!  BWT-index can be implemented using ~3 bits/char (!!) !  BWT-index is now widely used in practical bioinformatics

software: BWA, bowtie, SOAP2 (mapping), CGA (assembly)

!  Variant: bi-directional BWT-index

Page 44: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

BWT-index: practical issues

!  BWT-index can be implemented using ~3 bits/char (!!) !  BWT-index is now widely used in practical bioinformatics

software: BWA, bowtie, SOAP2 (mapping), CGA (assembly)

!  Variant: bi-directional BWT-index !  other succinct data structure exist (including compact suffix

array) and continue to appear !  construction may require much more space than the

resulting structure !  external memory algorithms are important !  algorithms specialized to multi-core or GPU processor

architectures !  dynamic indexes