sequence assembly - princeton university · 2017. 9. 15. · sequence assembly from corrupted...
TRANSCRIPT
Sequence assemblyfrom corrupted shotgun reads
Shirshenducanguly Elhanan Mossel Miklos Z.
Ratz
U. Washington UC Berkeley & U
. Penn Microsoft Research
1¥016
DNA sequencing :
• Prevalent technique : shotgun sequencing
• Goal of de novo assembly :
reconstruct X from reads R
DNA
sequencing
:
robustalgorithms.2.ee
Prevalent technique : shotgun sequencing
• Goal of de novo assembly :
reconstruct X from reads R
•. Many sequencing technologiesw/ different error profiles
: Are there robust assembly algorithms ?
DNA
sequencing
: robustalgorithmsn
• Prevalent technique : shotgun sequencing
• Goal of de novo assembly :
reconstruct X from reads R
•. Many sequencing technologiesw/ different error profiles
: Are there robust assembly algorithms ?
A : Yes,
and a simple sequential algorithm works well.
Sequencing technologies• Sanger sequencing
- 800-1000 bp reads,
< 1% error
- very expensive
•. Next gen ( Edgar) sequencing- high throughput , cheap Example :
- short reads ( too -200 bp ) Illuming- low error rate ( 1- 3%)
•• Emerging ( 3rd gen) technologiesExamples :
- long reads ( > 10000 bp )• Paasio 's SMRT
- high error rate ( 10 - 22% ) • Oxford Nanopore
Sequencing technologies• Sanger sequencing
- 800-1000 bp reads,
< 1% error
- very expensive
•• Next gen ( Edgar) sequencing- high throughput , cheap Example :
- short reads ( too -200 bp ) Illuming- low error rate ( 1- 3%)
•• Emerging ( 3rd gen) technologiesExamples :
- long reads ( > 10000 bp )• Paasio 's SMRT
- high error rate ( 10 - 22% ) • Oxford Nanopore
: how robust are reconstruction algorithmsWirt .
different sequencing technologies ?
Adversarial corruption / error model
• Instead of getting true reads R ,
get corrupted reads FL
• Assume only that
ed ( Ri ,R~i)< EL
where ed = edit distance ,
and L = length of R ; .
Approximate reconstruction problem1
.
Choose XE ["
uniformly at random,
E = { A ,C , G ,T]
2.
Draw reads
R = { R, ,R , , ... ,Rµ } of length L
from uniformly random positions
3.
Get corrupted reads
k={ RT,RT , . . spin }
Satisfying edl Ri ,ki)< EL
Approximate reconstruction problem1
.
Choose XE ["
uniformly at random,
E={ A ,C , G ,T]
2.
Draw reads
R = { R, ,R , , ... ,Rµ } of length L
from uniformly random positions
3.
Get corrupted reads
k={ RT,RT , . . spin }
Satisfying edl Ri ,ki)< EL
Goal : approximate reconstruction
Output : F- Re )eE* st.
ed ( I ,X ) ecen
Wlprob . 21-8.
Main obstructions to reconstruction
I.
Short reads lead to repeats ( Ukkonen ' 92 )repeat -
limited
regime
Main obstructions to reconstruction
1.
Short reads lead to repeats ( Ukkonen ' 92 )repeat -
limited
regime
2. Need enough reads to cover Xcoverage
-
limited
regime
Lander,
Waterman ( 1988 ) :
Nav = Naval ,
D=Eln ( to )
Exact reconstruction ( to )Thin ( A
.Motahan , G. Bresler
, D Tse , 2013 )+
These are the only obstructions.
Exact reconstruction ( to )Thin ( A
.Motahan , G
.Bresler
, D Tse , 2013 )+
These are the only obstructions.
More precisely : let X be random ,L= I lnlnl
,5<112 .
Then :
repeated • if I < ¥2 , then exact reconstruction is impossible ;vigifted . if I > Ii then king ,NTo÷ = 1.
Exact reconstruction ( to )Thin ( A
.Motahan , G
.Bresler
, D Tse , 2013 )+
These are the only obstructions.
More precisely : let X be random ,L= I ln (nl
,5<42 .
Then :
repeated • if I < ¥2 , then exact reconstruction is impossible ;viwaited . if I > ¥ ,then nltg ,Nvmj÷ = 1
.
low
For arbitrary sequences:
G.
Bresler,
M Bresler, D .
Tse ( 2013 )thresholds based on repeat statistics of genome
Approximate reconstruction
Thy Approximate reconstruction is possible ,
if L and N are large enough .
Approximate reconstruction
Thy Approximate reconstruction is possible ,
if L and N are large enough .
More precisely : Let X be random ,and L=Iln(n )
.
For every C >3 there exist constants E= EK ) ,Eo=E°K , C)
, CECYE, C)
st. for every E e ( 0
, ED if [ 2 Ek,
NZC ' Nav /E
then there exists an approximate reconstruction algorithmfor error rate E with approximation factor C
.
Approximate reconstruction
Thy Approximate reconstruction is possible ,
if L and N are large enough .
More precisely : Let X be random ,and L=Iln(n )
.
For every C >3 there exist constants E= EK ) ,Eo=E°K , C)
, ECYE , C)
st. for every E e ( 0
, ED if [ 2 Ek,
NZC ' Nav /E
then there exists an approximate reconstruction algorithmfor error rate E with approximation factor C
.
Comments• simple sequential algorithm works
•. dependence of [ and N on E not necessary( but get worse C)
••• best achievable C might depend on [ and N
• related work :
Motahari ,Ramchandran
, Tse ,Ma ( 2013 ) ; Shomorony ,
Gurtade,Tse ( 2015 )
Edit distance between random strings
Lemme Xm ,Yme Em independent,
Uniformly random .
Then
lim In ed ( Xm ,Ym) = Cind > 0
.
For KK 4 :
• empirically Cnd = 0.51 .
• volume argument : Cind > 0.33.
Edit distance between random strings
^
utftmntntqrxanmdgtmmewenmindependent,
} )limtmedl Xm ,Ym)=Cind>0 .
Lenny XEIM uniformly random .
Then :
For kI=4 :
• empirically cindaost . ed(X[1,m],X[ t.tk ,m+k])=2k• volume argument : Cind >0 -33
. foray k£cm with prob .71 - £ " ?
Sequential reconstruction algorithm~
Rk XEL
pi
• Fix a appropriately .
• Given Pin , find TZETZ s .
t.
ed ( ksiftix,Fire ' ) -42+214 el
• Concatenates pen and lIn%×.
• at each step , gain = (1- Le)L ,make error s3EL
Summary• introduced adversarial read error model
• approximate reconstruction is possible• simple sequential algorithm works
• edit distance results key to analysis
Summary• introduced adversarial read error model
• approximate reconstruction is possible• simple sequential algorithm works
• edit distance results key to analysis
Challenges• determine fundamental limits
of approximate reconstruction
• results for arbitrary sequences•. bridge gap between models
Summary• introduced adversarial read error model
• approximate reconstruction is possible• simple sequential algorithm works
• edit distance results key to analysis
Challenges• determine fundamental limits
of approximate reconstruction
• results for arbitrary sequences•. bridge gap between models
Thank you!