revealing sequence variation patterns in rice with machine ... · human (hinds et al., science,...
TRANSCRIPT
Revealing Sequence Variation Patterns in Rice
with Machine Learning Methods
Regina Bohnert
Friedrich Miescher Laboratory of the Max Planck SocietyTubingen, Germany
July 18, 2008
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 1 / 13
Motivation
What distinguishes the sequences of subpopulations with differenttraits?
Identify sequence variations within one species
Basis for further evolutionary and functional studies
Genome-wide identification of sequence polymorphisms
High-density oligonucleotide microarrays for high-throughputresequencing
Array-based resequencing applied for
Human (Hinds et al., Science, 2005)Arabidopsis thaliana (Clark et al., Science, 2007)Mouse (Frazer et al., Nature, 2007)Oryza sativa (rice)
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 2 / 13
Motivation
What distinguishes the sequences of subpopulations with differenttraits?
Identify sequence variations within one species
Basis for further evolutionary and functional studies
Genome-wide identification of sequence polymorphisms
High-density oligonucleotide microarrays for high-throughputresequencing
Array-based resequencing applied for
Human (Hinds et al., Science, 2005)Arabidopsis thaliana (Clark et al., Science, 2007)Mouse (Frazer et al., Nature, 2007)Oryza sativa (rice)
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 2 / 13
Motivation
What distinguishes the sequences of subpopulations with differenttraits?
Identify sequence variations within one species
Basis for further evolutionary and functional studies
Genome-wide identification of sequence polymorphisms
High-density oligonucleotide microarrays for high-throughputresequencing
Array-based resequencing applied for
Human (Hinds et al., Science, 2005)Arabidopsis thaliana (Clark et al., Science, 2007)Mouse (Frazer et al., Nature, 2007)Oryza sativa (rice)
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 2 / 13
Motivation
What distinguishes the sequences of subpopulations with differenttraits?
Identify sequence variations within one species
Basis for further evolutionary and functional studies
Genome-wide identification of sequence polymorphisms
High-density oligonucleotide microarrays for high-throughputresequencing
Array-based resequencing applied for
Human (Hinds et al., Science, 2005)Arabidopsis thaliana (Clark et al., Science, 2007)Mouse (Frazer et al., Nature, 2007)Oryza sativa (rice)
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 2 / 13
Oryza sativa
Prominent model organism
Most important food source
Representative of grass family
Closely related to other cereals
372 Mb genome on 12 chr.
Challenges relative to A. thaliana
Different experimental design
Data not as clean
No gold standard set oflabelled sequences
from Koehler’s Medicinal-Plants, 1887
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 3 / 13
Oryza sativa
Prominent model organism
Most important food source
Representative of grass family
Closely related to other cereals
372 Mb genome on 12 chr.
Challenges relative to A. thaliana
Different experimental design
Data not as clean
No gold standard set oflabelled sequences
from Koehler’s Medicinal-Plants, 1887
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 3 / 13
The Resequencing Data – Tiling Arrays
Array SurfaceHybridisationSignal
T
G
C
A
TACCGGTCGGAAGTCGATCGGTTGA
TACCGGTCGGAACTCGATCGGTTGA
TACCGGTCGGAATTCGATCGGTTGA
TACCGGTCGGAAATCGATCGGTTGA
|||||||||||||||||||||||||ATGGCCAGCCTTGAGCTAGCCAACTTGAAT
A
C
G
T
TCAACCGATCGAATTCCGACCGGTA
TCAACCGATCGATTTCCGACCGGTA
TCAACCGATCGAGTTCCGACCGGTA
TCAACCGATCGACTTCCGACCGGTA
|||||||||||||||||||||||||AGTTGGCTAGCTCAAGGCTGGCCATAGGTA
Referenceprobe
SNPprobes
Referenceprobe
SNPprobes
Oligonucleotides on glasssurface
ssDNA labelled withfluorescence as target DNA
Rice resequencing arrays
Tiling strategy with 1 bpresolution
Each base queried with aforward and reverse quartet
800 million oligos on 246arrays for each of the 20cultivars
∼ 32 % of the genomerepresented
Target DNA amplified bylong-range PCR
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 4 / 13
The Resequencing Data – Tiling Arrays
Array SurfaceHybridisationSignal
T
G
C
A
TACCGGTCGGAAGTCGATCGGTTGA
TACCGGTCGGAACTCGATCGGTTGA
TACCGGTCGGAATTCGATCGGTTGA
TACCGGTCGGAAATCGATCGGTTGA
|||||||||||||||||||||||||ATGGCCAGCCTTGAGCTAGCCAACTTGAAT
A
C
G
T
TCAACCGATCGAATTCCGACCGGTA
TCAACCGATCGATTTCCGACCGGTA
TCAACCGATCGAGTTCCGACCGGTA
TCAACCGATCGACTTCCGACCGGTA
|||||||||||||||||||||||||AGTTGGCTAGCTCAAGGCTGGCCATAGGTA
Referenceprobe
SNPprobes
Referenceprobe
SNPprobes
Oligonucleotides on glasssurface
ssDNA labelled withfluorescence as target DNA
Rice resequencing arrays
Tiling strategy with 1 bpresolution
Each base queried with aforward and reverse quartet
800 million oligos on 246arrays for each of the 20cultivars
∼ 32 % of the genomerepresented
Target DNA amplified bylong-range PCR
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 4 / 13
The Resequencing Data – Tiling Arrays
Array SurfaceHybridisationSignal
T
G
C
A
TACCGGTCGGAAGTCGATCGGTTGA
TACCGGTCGGAACTCGATCGGTTGA
TACCGGTCGGAATTCGATCGGTTGA
TACCGGTCGGAAATCGATCGGTTGA
|||||||||||||||||||||||||ATGGCCAGCCTTGAGCTAGCCAACTTGAAT
A
C
G
T
TCAACCGATCGAATTCCGACCGGTA
TCAACCGATCGATTTCCGACCGGTA
TCAACCGATCGAGTTCCGACCGGTA
TCAACCGATCGACTTCCGACCGGTA
|||||||||||||||||||||||||AGTTGGCTAGCTCAAGGCTGGCCATAGGTA
Referenceprobe
SNPprobes
Referenceprobe
SNPprobes
Oligonucleotides on glasssurface
ssDNA labelled withfluorescence as target DNA
Rice resequencing arrays
Tiling strategy with 1 bpresolution
Each base queried with aforward and reverse quartet
800 million oligos on 246arrays for each of the 20cultivars
∼ 32 % of the genomerepresented
Target DNA amplified bylong-range PCR
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 4 / 13
The Resequencing Data – Tiling Arrays
Array SurfaceHybridisationSignal
T
G
C
A
TACCGGTCGGAAGTCGATCGGTTGA
TACCGGTCGGAACTCGATCGGTTGA
TACCGGTCGGAATTCGATCGGTTGA
TACCGGTCGGAAATCGATCGGTTGA
|||||||||||||||||||||||||ATGGCCAGCCTTGAGCTAGCCAACTTGAAT
A
C
G
T
TCAACCGATCGAATTCCGACCGGTA
TCAACCGATCGATTTCCGACCGGTA
TCAACCGATCGAGTTCCGACCGGTA
TCAACCGATCGACTTCCGACCGGTA
|||||||||||||||||||||||||AGTTGGCTAGCTCAAGGCTGGCCATAGGTA
Referenceprobe
SNPprobes
Referenceprobe
SNPprobes
Oligonucleotides on glasssurface
ssDNA labelled withfluorescence as target DNA
Rice resequencing arrays
Tiling strategy with 1 bpresolution
Each base queried with aforward and reverse quartet
800 million oligos on 246arrays for each of the 20cultivars
∼ 32 % of the genomerepresented
Target DNA amplified bylong-range PCR
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 4 / 13
Polymorphism Detection
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A:C
ACGT
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T
Reference
Cultivar A
A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T
Data analysis challenge
Hybridisation signaldependent on sequenceproperties of oligomer,repeats, amplicon, cultivar
Measurement noise
⇒ Machine learning based SNPcalling
Problematic cases
Highly polymorphic regions
Deletions and insertions
⇒ Margin-based prediction ofpolymorphic regions
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 5 / 13
Polymorphism Detection
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A:C
ACGT
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T
Reference
Cultivar A
A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T
Data analysis challenge
Hybridisation signaldependent on sequenceproperties of oligomer,repeats, amplicon, cultivar
Measurement noise
⇒ Machine learning based SNPcalling
Problematic cases
Highly polymorphic regions
Deletions and insertions
⇒ Margin-based prediction ofpolymorphic regions
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 5 / 13
Polymorphism Detection
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A:CA:T
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A:C
ACGT
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T
Reference
Cultivar A
Cultivar B
A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T
A T G C T T T C T G G A C T T C T G C A A A A T A C T G T C A T C A T
Data analysis challenge
Hybridisation signaldependent on sequenceproperties of oligomer,repeats, amplicon, cultivar
Measurement noise
⇒ Machine learning based SNPcalling
Problematic cases
Highly polymorphic regions
Deletions and insertions
⇒ Margin-based prediction ofpolymorphic regions
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 5 / 13
Polymorphism Detection
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A:CA:T
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A:C
ACGT
4
5
6
7
8
9
Log
Mea
n In
ten
sity
A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T
Reference
Cultivar A
Cultivar B
A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T
A T G C T T T C T G G A C T T C T G C A A A A T A C T G T C A T C A T
Data analysis challenge
Hybridisation signaldependent on sequenceproperties of oligomer,repeats, amplicon, cultivar
Measurement noise
⇒ Machine learning based SNPcalling
Problematic cases
Highly polymorphic regions
Deletions and insertions
⇒ Margin-based prediction ofpolymorphic regions
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 5 / 13
SNP Calling – ML Approach
Feature 1
Feat
ure
2
Support Vector Machines (SVM)
Extract features
Array dataSequenceRepetitiveness
Labelled data generated bysequencing of randomly selectedfragments
Apply SVMs using RBF kernel ina two-layered approach
2nd layer exploits informationacross cultivars
cf. Clark et al., Science, 2007
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 6 / 13
SNP Calling – ML Approach
+Feature 1
Feat
ure
2
+ +
Support Vector Machines (SVM)
Extract features
Array dataSequenceRepetitiveness
Labelled data generated bysequencing of randomly selectedfragments
Apply SVMs using RBF kernel ina two-layered approach
2nd layer exploits informationacross cultivars
cf. Clark et al., Science, 2007
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 6 / 13
SNP Calling – ML Approach
+Feature 1
Feat
ure
2
+ +
Support Vector Machines (SVM)
Extract features
Array dataSequenceRepetitiveness
Labelled data generated bysequencing of randomly selectedfragments
Apply SVMs using RBF kernel ina two-layered approach
2nd layer exploits informationacross cultivars
cf. Clark et al., Science, 2007
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 6 / 13
SNP Calling – Results
Predicted SNPs
∼ 1.2 M MB SNP calls
∼ 1.3 M ML SNP calls
∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions
Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 %
MB: Model based approach by PerlegenSciences
ML: Proposed machine learning approachPrecision = TP
TP+FP, Recall = TPP
Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13
SNP Calling – Results
Predicted SNPs
∼ 1.2 M MB SNP calls
∼ 1.3 M ML SNP calls
∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions
Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 %
MB: Model based approach by PerlegenSciences
ML: Proposed machine learning approachPrecision = TP
TP+FP, Recall = TPP
Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13
SNP Calling – Results
Predicted SNPs
∼ 1.2 M MB SNP calls
∼ 1.3 M ML SNP calls
∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions
Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 %
MB: Model based approach by PerlegenSciences
ML: Proposed machine learning approachPrecision = TP
TP+FP, Recall = TPP
Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13
SNP Calling – Results
Predicted SNPs
∼ 1.2 M MB SNP calls
∼ 1.3 M ML SNP calls
∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions
Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 % Recall [%]
Prec
isio
n [%
]13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
0 / // /
NoncodingCodingAll
MB Sites
MB: Model based approach by PerlegenSciences
ML: Proposed machine learning approachPrecision = TP
TP+FP, Recall = TPP
Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13
SNP Calling – Results
Predicted SNPs
∼ 1.2 M MB SNP calls
∼ 1.3 M ML SNP calls
∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions
Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 % Recall [%]
Prec
isio
n [%
]13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
0 / // /
NoncodingCodingAll
MB ML Sites
MB: Model based approach by PerlegenSciences
ML: Proposed machine learning approachPrecision = TP
TP+FP, Recall = TPP
Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13
SNP Calling – Results
Predicted SNPs
∼ 1.2 M MB SNP calls
∼ 1.3 M ML SNP calls
∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions
Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 % Recall [%]
Prec
isio
n [%
]13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
0 / // /
NoncodingCodingAll
MB ML Sites
MB: Model based approach by PerlegenSciences
ML: Proposed machine learning approachPrecision = TP
TP+FP, Recall = TPP
Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13
SNP Calling – Results
Predicted SNPs
∼ 1.2 M MB SNP calls
∼ 1.3 M ML SNP calls
∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions
Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 % Recall [%]
Prec
isio
n [%
]13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
0 / // /
NoncodingCodingAll
MB ML Sites
MB: Model based approach by PerlegenSciences
ML: Proposed machine learning approachPrecision = TP
TP+FP, Recall = TPP
Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13
SNP Calling – Database
http://irfgc.irri.org/cgi-bin/gbrowse/oryzasnp10/
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 8 / 13
Predicting Polymorphic Regions
0 100 200 300 400 500 6
8
10
12Lo
g m
ax in
tens
ity
bp
ReferenceCultivar A
PR
Labels
Not polymorphic PR
Predictions
SNP
Known polymorphisms
InsertionSNP Deletion
Difficulties when SNPs occur in vicinityApproach: Predict genomic segments with label sequencelearning algorithm
Conserved andPolymorphic regions (PRs)
cf. Zeller et al., Genome Research, 2008
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 9 / 13
Predicting Polymorphic Regions
0 100 200 300 400 500 6
8
10
12Lo
g m
ax in
tens
ity
bp
ReferenceCultivar A
PR
Labels
Not polymorphic PR
Predictions
SNP
Known polymorphisms
InsertionSNP Deletion
Difficulties when SNPs occur in vicinityApproach: Predict genomic segments with label sequencelearning algorithm
Conserved andPolymorphic regions (PRs)
cf. Zeller et al., Genome Research, 2008
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 9 / 13
Predicting PRs – Results
Between ∼ 65, 000 and∼ 203, 000 PRs predictedper cultivar
∼ 27 % recall at a precisionof 80 %
Between 1.7 % and 5.1 % ofthe genome covered
1
1
Prec
isio
n
Recall0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
AllCodingUTRs + intronsIntergenic
Precision = TPTP+FP, Recall = TP
P
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 10 / 13
Predicting PRs – Results
Between ∼ 65, 000 and∼ 203, 000 PRs predictedper cultivar
∼ 27 % recall at a precisionof 80 %
Between 1.7 % and 5.1 % ofthe genome covered
1
1
Prec
isio
n
Recall0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 90
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
AllCodingUTRs + intronsIntergenic
Precision = TPTP+FP, Recall = TP
P
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 10 / 13
Predicting PRs – Results
Between ∼ 65, 000 and∼ 203, 000 PRs predictedper cultivar
∼ 27 % recall at a precisionof 80 %
Between 1.7 % and 5.1 % ofthe genome covered
1
1
Prec
isio
n
Recall0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 90
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
AllCodingUTRs + intronsIntergenic
Precision = TPTP+FP, Recall = TP
P
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 10 / 13
Long Deletion Example
Disease resistance protein SlVe1 precursor on chromosome 12
6,222,000 6,223,000 6,224,000 6,225,000 6,226,000 6,227,000 6,228,000
Aswina
AzucenaCypress
Dom_Sufid
DularFR 13A
IR 64
LTHM 202
Minghui 63
Moroberekan
N 22
Pokkali
Rayada
SHZ2Sadu Cho
Swarna
Tainung 67
Zhenshan 97
Os12g0217300Genes
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 11 / 13
Conclusions
Created the first whole-genome inventory of polymorphisms forrice
Highly polymorphic ∼ 0.2 % in SNPs, ∼ 2.4 % in PRPs
Intersection of MB and ML calls provides highly reliable SNPpredictions
Used to genotype many more rice cultivars
Polymorphic region predictions
Important for more detailed analyses (e.g. dideoxy sequencing)Useful for primer design to increase PCR success rates
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 12 / 13
Conclusions
Created the first whole-genome inventory of polymorphisms forrice
Highly polymorphic ∼ 0.2 % in SNPs, ∼ 2.4 % in PRPs
Intersection of MB and ML calls provides highly reliable SNPpredictions
Used to genotype many more rice cultivars
Polymorphic region predictions
Important for more detailed analyses (e.g. dideoxy sequencing)Useful for primer design to increase PCR success rates
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 12 / 13
Conclusions
Created the first whole-genome inventory of polymorphisms forrice
Highly polymorphic ∼ 0.2 % in SNPs, ∼ 2.4 % in PRPs
Intersection of MB and ML calls provides highly reliable SNPpredictions
Used to genotype many more rice cultivars
Polymorphic region predictions
Important for more detailed analyses (e.g. dideoxy sequencing)Useful for primer design to increase PCR success rates
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 12 / 13
Acknowledgements
Friedrich Miescher Laboratory
Gunnar Ratsch
Georg Zeller
Gabriele Schweikert
MPI for Developmental Biology
Detlef Weigel
Richard Clark
Michigan State University, USA
Robin Buell
Kevin Childs
Perlegen Sciences, USA
Renee Stokowski
Dennis Ballinger
Kelly Frazer
David Cox
IRRI, The Philippines
Kenneth McNally
Victor Ulat
Hei Leung
Colorado State University, USA
Jan Leach
Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 13 / 13