a statistical base-caller for the illumina genome analyzer wally gilks university of leeds
TRANSCRIPT
DNA sequencing technologies
• Sanger sequencing
• “Next-Generation” sequencing
• Roche 454
• ABI SOLiD
• Illumina (Solexa)
• “Next-Next (3rd) Generation” sequencing
• VisiGen
• Helicos
• Oxford Nanopore
Illumina Genome Analyzer
• Description of technology
• Technological problems
• Our statistical model for base-calling
• Comparing our accuracy with Illumina’s
One tile of a flow cell
Chi, K.R., Nature Methods - 5, 11 - 14 (2008)
sequenceclusters(30,000 per tile)
tile
DNA sample preparation (over-simplified)
1) Extract DNA
2) Randomly shatter
3) Attach adapter sequence
Sequence clusters on the flow cell
A
C
T
G
A
A
.
.
.
.
.
.
adapter sequence
sequencefragment
C
T
G
A
.
.
.
.
.
.
T
G
C
G
.
.
.
.
.
.
T
T
G
A
Cluster 1 Cluster 2 Cluster 3
adapter sequence
flow-cellsurface
A
C
T
G
A
A
.
.
.
.
.
.
A
C
T
G
A
A
.
.
.
.
.
.
C
T
G
A
.
.
.
.
.
.
T
G
C
G
.
.
.
.
.
.
T
T
G
A
C
G
.
.
.
.
.
.
T
T
G
A
C
G
.
.
.
.
.
.
T
T
G
A
Illumina Genome Analyzer
• Description of technology
• Technological problems
• Our statistical model for base-calling
• Comparing our accuracy with Illumina’s
Sticky-T: solution
• Regress intensity for cluster c against cycle number i, for each dye k.
• Normalise
k
kkrawkic
kic
ixx
ˆ
ˆˆ
rawkicx
Illumina Genome Analyzer
• Description of technology
• Technological problems
• Our statistical model for base-calling
• Comparing our accuracy with Illumina’s
The “cross-talk” problem
• Ideally, base “A” would produce a strong and distinct intensity on the A dye.
• Similarly for the other bases.• But in reality, base “A” can produce a signal on the “C”
dye, and so on.• This is called dye “cross-talk”.
Cross-talk: solution
'),,,( TicGicCicAicic xxxxx
ibibic VNbx ,~
Model the normalised intensity at cycle i in cluster c:
as a 4-dimensional multivariate normal distribution
whose mean vector and variance matrix V depend on cycle number i and true base b.
The “phase” problem
A
C
T
G
A
A
.
.
.
.
.
.
Cycle 4: ideal
A
C
T
G
A
A
.
.
.
.
.
.
Cycle 4: misphased
A
C
T
G
A
A
.
.
.
.
.
.
Cycle 4: misphased
Phase problem: solution
• Assume probability c of a base-incorporation error at a given cycle i, constant over all cycles, but depending on cluster c.
• This implies a probability of
ic )1(
of being correctly phased at cycle i.
The “drop-off” problem
A
C
T
G
A
A
.
.
.
.
.
.
Cycle 4: ideal
A
C
T
G
A
A
.
.
.
.
.
.
Cycle 4: dropped off
Sequencing reactions terminated,perhaps due to failure of block release
Drop-off problem: solution
• Assume probability of dropping off at a given cycle i, constant over all cycles and clusters.
• This implies a probability of
i)1( of not having dropped off before cycle i.
Putting it all together
• We do not know when a molecule becomes misphased or drops off. We integrate over these events.
• Many identical molecules in each cluster: assume their independence, motivating normal theory.
The resulting model of the mean intensity vector
'),,,( TicGicCicAicic xxxxx
at cycle i in cluster c when the true base is b, is :
ibibic VNbx ,~ where
b bbi
cbi
ci
ib
bi
ci
ib
VVV
))1(1()1()1(
)1()1(fixed parameters
cluster-specific parameter known base frequency
Illumina Genome Analyzer
• Description of technology
• Technological problems
• Our statistical model for base-calling
• Comparing our accuracy with Illumina’s
Base-calling
b cicb
cicb
icbxp
bxpxbp
)ˆ,ˆ,(
)ˆ,ˆ,()(
Posterior probability that cluster c at cycle i has base b is:
where
ibibic VNbx ,~
as described above.
Call b to maximise this posterior.
BLASTing reads
• Study should be designed with many replicates
• BLAST is used to group similar reads
• A consensus sequence is called for each group
Conclusion
• Currently, our method performs about as well as the Illumina pipeline.
• Our method produces a posterior probability of correctness of each base
call.
• Further work addressing heavy tails in the residuals should improve results.
• Others are trying to estimate the phase at each cycle for each cluster.