revealing sequence variation patterns in rice with machine ... · human (hinds et al., science,...

36
Revealing Sequence Variation Patterns in Rice with Machine Learning Methods Regina Bohnert Friedrich Miescher Laboratory of the Max Planck Society ubingen, Germany July 18, 2008 Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 1 / 13

Upload: others

Post on 25-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Revealing Sequence Variation Patterns in Rice

with Machine Learning Methods

Regina Bohnert

Friedrich Miescher Laboratory of the Max Planck SocietyTubingen, Germany

July 18, 2008

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 1 / 13

Page 2: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Motivation

What distinguishes the sequences of subpopulations with differenttraits?

Identify sequence variations within one species

Basis for further evolutionary and functional studies

Genome-wide identification of sequence polymorphisms

High-density oligonucleotide microarrays for high-throughputresequencing

Array-based resequencing applied for

Human (Hinds et al., Science, 2005)Arabidopsis thaliana (Clark et al., Science, 2007)Mouse (Frazer et al., Nature, 2007)Oryza sativa (rice)

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 2 / 13

Page 3: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Motivation

What distinguishes the sequences of subpopulations with differenttraits?

Identify sequence variations within one species

Basis for further evolutionary and functional studies

Genome-wide identification of sequence polymorphisms

High-density oligonucleotide microarrays for high-throughputresequencing

Array-based resequencing applied for

Human (Hinds et al., Science, 2005)Arabidopsis thaliana (Clark et al., Science, 2007)Mouse (Frazer et al., Nature, 2007)Oryza sativa (rice)

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 2 / 13

Page 4: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Motivation

What distinguishes the sequences of subpopulations with differenttraits?

Identify sequence variations within one species

Basis for further evolutionary and functional studies

Genome-wide identification of sequence polymorphisms

High-density oligonucleotide microarrays for high-throughputresequencing

Array-based resequencing applied for

Human (Hinds et al., Science, 2005)Arabidopsis thaliana (Clark et al., Science, 2007)Mouse (Frazer et al., Nature, 2007)Oryza sativa (rice)

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 2 / 13

Page 5: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Motivation

What distinguishes the sequences of subpopulations with differenttraits?

Identify sequence variations within one species

Basis for further evolutionary and functional studies

Genome-wide identification of sequence polymorphisms

High-density oligonucleotide microarrays for high-throughputresequencing

Array-based resequencing applied for

Human (Hinds et al., Science, 2005)Arabidopsis thaliana (Clark et al., Science, 2007)Mouse (Frazer et al., Nature, 2007)Oryza sativa (rice)

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 2 / 13

Page 6: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Oryza sativa

Prominent model organism

Most important food source

Representative of grass family

Closely related to other cereals

372 Mb genome on 12 chr.

Challenges relative to A. thaliana

Different experimental design

Data not as clean

No gold standard set oflabelled sequences

from Koehler’s Medicinal-Plants, 1887

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 3 / 13

Page 7: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Oryza sativa

Prominent model organism

Most important food source

Representative of grass family

Closely related to other cereals

372 Mb genome on 12 chr.

Challenges relative to A. thaliana

Different experimental design

Data not as clean

No gold standard set oflabelled sequences

from Koehler’s Medicinal-Plants, 1887

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 3 / 13

Page 8: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

The Resequencing Data – Tiling Arrays

Array SurfaceHybridisationSignal

T

G

C

A

TACCGGTCGGAAGTCGATCGGTTGA

TACCGGTCGGAACTCGATCGGTTGA

TACCGGTCGGAATTCGATCGGTTGA

TACCGGTCGGAAATCGATCGGTTGA

|||||||||||||||||||||||||ATGGCCAGCCTTGAGCTAGCCAACTTGAAT

A

C

G

T

TCAACCGATCGAATTCCGACCGGTA

TCAACCGATCGATTTCCGACCGGTA

TCAACCGATCGAGTTCCGACCGGTA

TCAACCGATCGACTTCCGACCGGTA

|||||||||||||||||||||||||AGTTGGCTAGCTCAAGGCTGGCCATAGGTA

Referenceprobe

SNPprobes

Referenceprobe

SNPprobes

Oligonucleotides on glasssurface

ssDNA labelled withfluorescence as target DNA

Rice resequencing arrays

Tiling strategy with 1 bpresolution

Each base queried with aforward and reverse quartet

800 million oligos on 246arrays for each of the 20cultivars

∼ 32 % of the genomerepresented

Target DNA amplified bylong-range PCR

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 4 / 13

Page 9: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

The Resequencing Data – Tiling Arrays

Array SurfaceHybridisationSignal

T

G

C

A

TACCGGTCGGAAGTCGATCGGTTGA

TACCGGTCGGAACTCGATCGGTTGA

TACCGGTCGGAATTCGATCGGTTGA

TACCGGTCGGAAATCGATCGGTTGA

|||||||||||||||||||||||||ATGGCCAGCCTTGAGCTAGCCAACTTGAAT

A

C

G

T

TCAACCGATCGAATTCCGACCGGTA

TCAACCGATCGATTTCCGACCGGTA

TCAACCGATCGAGTTCCGACCGGTA

TCAACCGATCGACTTCCGACCGGTA

|||||||||||||||||||||||||AGTTGGCTAGCTCAAGGCTGGCCATAGGTA

Referenceprobe

SNPprobes

Referenceprobe

SNPprobes

Oligonucleotides on glasssurface

ssDNA labelled withfluorescence as target DNA

Rice resequencing arrays

Tiling strategy with 1 bpresolution

Each base queried with aforward and reverse quartet

800 million oligos on 246arrays for each of the 20cultivars

∼ 32 % of the genomerepresented

Target DNA amplified bylong-range PCR

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 4 / 13

Page 10: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

The Resequencing Data – Tiling Arrays

Array SurfaceHybridisationSignal

T

G

C

A

TACCGGTCGGAAGTCGATCGGTTGA

TACCGGTCGGAACTCGATCGGTTGA

TACCGGTCGGAATTCGATCGGTTGA

TACCGGTCGGAAATCGATCGGTTGA

|||||||||||||||||||||||||ATGGCCAGCCTTGAGCTAGCCAACTTGAAT

A

C

G

T

TCAACCGATCGAATTCCGACCGGTA

TCAACCGATCGATTTCCGACCGGTA

TCAACCGATCGAGTTCCGACCGGTA

TCAACCGATCGACTTCCGACCGGTA

|||||||||||||||||||||||||AGTTGGCTAGCTCAAGGCTGGCCATAGGTA

Referenceprobe

SNPprobes

Referenceprobe

SNPprobes

Oligonucleotides on glasssurface

ssDNA labelled withfluorescence as target DNA

Rice resequencing arrays

Tiling strategy with 1 bpresolution

Each base queried with aforward and reverse quartet

800 million oligos on 246arrays for each of the 20cultivars

∼ 32 % of the genomerepresented

Target DNA amplified bylong-range PCR

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 4 / 13

Page 11: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

The Resequencing Data – Tiling Arrays

Array SurfaceHybridisationSignal

T

G

C

A

TACCGGTCGGAAGTCGATCGGTTGA

TACCGGTCGGAACTCGATCGGTTGA

TACCGGTCGGAATTCGATCGGTTGA

TACCGGTCGGAAATCGATCGGTTGA

|||||||||||||||||||||||||ATGGCCAGCCTTGAGCTAGCCAACTTGAAT

A

C

G

T

TCAACCGATCGAATTCCGACCGGTA

TCAACCGATCGATTTCCGACCGGTA

TCAACCGATCGAGTTCCGACCGGTA

TCAACCGATCGACTTCCGACCGGTA

|||||||||||||||||||||||||AGTTGGCTAGCTCAAGGCTGGCCATAGGTA

Referenceprobe

SNPprobes

Referenceprobe

SNPprobes

Oligonucleotides on glasssurface

ssDNA labelled withfluorescence as target DNA

Rice resequencing arrays

Tiling strategy with 1 bpresolution

Each base queried with aforward and reverse quartet

800 million oligos on 246arrays for each of the 20cultivars

∼ 32 % of the genomerepresented

Target DNA amplified bylong-range PCR

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 4 / 13

Page 12: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Polymorphism Detection

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A:C

ACGT

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T

Reference

Cultivar A

A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T

Data analysis challenge

Hybridisation signaldependent on sequenceproperties of oligomer,repeats, amplicon, cultivar

Measurement noise

⇒ Machine learning based SNPcalling

Problematic cases

Highly polymorphic regions

Deletions and insertions

⇒ Margin-based prediction ofpolymorphic regions

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 5 / 13

Page 13: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Polymorphism Detection

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A:C

ACGT

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T

Reference

Cultivar A

A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T

Data analysis challenge

Hybridisation signaldependent on sequenceproperties of oligomer,repeats, amplicon, cultivar

Measurement noise

⇒ Machine learning based SNPcalling

Problematic cases

Highly polymorphic regions

Deletions and insertions

⇒ Margin-based prediction ofpolymorphic regions

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 5 / 13

Page 14: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Polymorphism Detection

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A:CA:T

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A:C

ACGT

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T

Reference

Cultivar A

Cultivar B

A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T

A T G C T T T C T G G A C T T C T G C A A A A T A C T G T C A T C A T

Data analysis challenge

Hybridisation signaldependent on sequenceproperties of oligomer,repeats, amplicon, cultivar

Measurement noise

⇒ Machine learning based SNPcalling

Problematic cases

Highly polymorphic regions

Deletions and insertions

⇒ Margin-based prediction ofpolymorphic regions

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 5 / 13

Page 15: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Polymorphism Detection

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A:CA:T

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A:C

ACGT

4

5

6

7

8

9

Log

Mea

n In

ten

sity

A T G C T T T C T G G A C T T C A G A A A A A T A C T G T C A T C A T

Reference

Cultivar A

Cultivar B

A T G C T T T C T G G A C T T C A G C A A A A T A C T G T C A T C A T

A T G C T T T C T G G A C T T C T G C A A A A T A C T G T C A T C A T

Data analysis challenge

Hybridisation signaldependent on sequenceproperties of oligomer,repeats, amplicon, cultivar

Measurement noise

⇒ Machine learning based SNPcalling

Problematic cases

Highly polymorphic regions

Deletions and insertions

⇒ Margin-based prediction ofpolymorphic regions

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 5 / 13

Page 16: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – ML Approach

Feature 1

Feat

ure

2

Support Vector Machines (SVM)

Extract features

Array dataSequenceRepetitiveness

Labelled data generated bysequencing of randomly selectedfragments

Apply SVMs using RBF kernel ina two-layered approach

2nd layer exploits informationacross cultivars

cf. Clark et al., Science, 2007

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 6 / 13

Page 17: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – ML Approach

+Feature 1

Feat

ure

2

+ +

Support Vector Machines (SVM)

Extract features

Array dataSequenceRepetitiveness

Labelled data generated bysequencing of randomly selectedfragments

Apply SVMs using RBF kernel ina two-layered approach

2nd layer exploits informationacross cultivars

cf. Clark et al., Science, 2007

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 6 / 13

Page 18: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – ML Approach

+Feature 1

Feat

ure

2

+ +

Support Vector Machines (SVM)

Extract features

Array dataSequenceRepetitiveness

Labelled data generated bysequencing of randomly selectedfragments

Apply SVMs using RBF kernel ina two-layered approach

2nd layer exploits informationacross cultivars

cf. Clark et al., Science, 2007

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 6 / 13

Page 19: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – Results

Predicted SNPs

∼ 1.2 M MB SNP calls

∼ 1.3 M ML SNP calls

∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions

Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 %

MB: Model based approach by PerlegenSciences

ML: Proposed machine learning approachPrecision = TP

TP+FP, Recall = TPP

Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13

Page 20: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – Results

Predicted SNPs

∼ 1.2 M MB SNP calls

∼ 1.3 M ML SNP calls

∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions

Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 %

MB: Model based approach by PerlegenSciences

ML: Proposed machine learning approachPrecision = TP

TP+FP, Recall = TPP

Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13

Page 21: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – Results

Predicted SNPs

∼ 1.2 M MB SNP calls

∼ 1.3 M ML SNP calls

∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions

Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 %

MB: Model based approach by PerlegenSciences

ML: Proposed machine learning approachPrecision = TP

TP+FP, Recall = TPP

Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13

Page 22: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – Results

Predicted SNPs

∼ 1.2 M MB SNP calls

∼ 1.3 M ML SNP calls

∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions

Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 % Recall [%]

Prec

isio

n [%

]13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

0 / // /

NoncodingCodingAll

MB Sites

MB: Model based approach by PerlegenSciences

ML: Proposed machine learning approachPrecision = TP

TP+FP, Recall = TPP

Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13

Page 23: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – Results

Predicted SNPs

∼ 1.2 M MB SNP calls

∼ 1.3 M ML SNP calls

∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions

Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 % Recall [%]

Prec

isio

n [%

]13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

0 / // /

NoncodingCodingAll

MB ML Sites

MB: Model based approach by PerlegenSciences

ML: Proposed machine learning approachPrecision = TP

TP+FP, Recall = TPP

Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13

Page 24: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – Results

Predicted SNPs

∼ 1.2 M MB SNP calls

∼ 1.3 M ML SNP calls

∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions

Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 % Recall [%]

Prec

isio

n [%

]13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

0 / // /

NoncodingCodingAll

MB ML Sites

MB: Model based approach by PerlegenSciences

ML: Proposed machine learning approachPrecision = TP

TP+FP, Recall = TPP

Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13

Page 25: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – Results

Predicted SNPs

∼ 1.2 M MB SNP calls

∼ 1.3 M ML SNP calls

∼ 760, 000 SNPs inMB∩ML at ∼ 160, 000positions

Recall PrecisionMB 14 % 91 %ML 21 % 92 %MB∩ML 11 % 97 % Recall [%]

Prec

isio

n [%

]13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

0 / // /

NoncodingCodingAll

MB ML Sites

MB: Model based approach by PerlegenSciences

ML: Proposed machine learning approachPrecision = TP

TP+FP, Recall = TPP

Visit Poster M09 and P28Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 7 / 13

Page 26: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

SNP Calling – Database

http://irfgc.irri.org/cgi-bin/gbrowse/oryzasnp10/

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 8 / 13

Page 27: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Predicting Polymorphic Regions

0 100 200 300 400 500 6

8

10

12Lo

g m

ax in

tens

ity

bp

ReferenceCultivar A

PR

Labels

Not polymorphic PR

Predictions

SNP

Known polymorphisms

InsertionSNP Deletion

Difficulties when SNPs occur in vicinityApproach: Predict genomic segments with label sequencelearning algorithm

Conserved andPolymorphic regions (PRs)

cf. Zeller et al., Genome Research, 2008

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 9 / 13

Page 28: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Predicting Polymorphic Regions

0 100 200 300 400 500 6

8

10

12Lo

g m

ax in

tens

ity

bp

ReferenceCultivar A

PR

Labels

Not polymorphic PR

Predictions

SNP

Known polymorphisms

InsertionSNP Deletion

Difficulties when SNPs occur in vicinityApproach: Predict genomic segments with label sequencelearning algorithm

Conserved andPolymorphic regions (PRs)

cf. Zeller et al., Genome Research, 2008

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 9 / 13

Page 29: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Predicting PRs – Results

Between ∼ 65, 000 and∼ 203, 000 PRs predictedper cultivar

∼ 27 % recall at a precisionof 80 %

Between 1.7 % and 5.1 % ofthe genome covered

1

1

Prec

isio

n

Recall0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

AllCodingUTRs + intronsIntergenic

Precision = TPTP+FP, Recall = TP

P

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 10 / 13

Page 30: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Predicting PRs – Results

Between ∼ 65, 000 and∼ 203, 000 PRs predictedper cultivar

∼ 27 % recall at a precisionof 80 %

Between 1.7 % and 5.1 % ofthe genome covered

1

1

Prec

isio

n

Recall0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 90

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

AllCodingUTRs + intronsIntergenic

Precision = TPTP+FP, Recall = TP

P

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 10 / 13

Page 31: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Predicting PRs – Results

Between ∼ 65, 000 and∼ 203, 000 PRs predictedper cultivar

∼ 27 % recall at a precisionof 80 %

Between 1.7 % and 5.1 % ofthe genome covered

1

1

Prec

isio

n

Recall0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 90

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

AllCodingUTRs + intronsIntergenic

Precision = TPTP+FP, Recall = TP

P

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 10 / 13

Page 32: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Long Deletion Example

Disease resistance protein SlVe1 precursor on chromosome 12

6,222,000 6,223,000 6,224,000 6,225,000 6,226,000 6,227,000 6,228,000

Aswina

AzucenaCypress

Dom_Sufid

DularFR 13A

IR 64

LTHM 202

Minghui 63

Moroberekan

N 22

Pokkali

Rayada

SHZ2Sadu Cho

Swarna

Tainung 67

Zhenshan 97

Os12g0217300Genes

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 11 / 13

Page 33: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Conclusions

Created the first whole-genome inventory of polymorphisms forrice

Highly polymorphic ∼ 0.2 % in SNPs, ∼ 2.4 % in PRPs

Intersection of MB and ML calls provides highly reliable SNPpredictions

Used to genotype many more rice cultivars

Polymorphic region predictions

Important for more detailed analyses (e.g. dideoxy sequencing)Useful for primer design to increase PCR success rates

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 12 / 13

Page 34: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Conclusions

Created the first whole-genome inventory of polymorphisms forrice

Highly polymorphic ∼ 0.2 % in SNPs, ∼ 2.4 % in PRPs

Intersection of MB and ML calls provides highly reliable SNPpredictions

Used to genotype many more rice cultivars

Polymorphic region predictions

Important for more detailed analyses (e.g. dideoxy sequencing)Useful for primer design to increase PCR success rates

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 12 / 13

Page 35: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Conclusions

Created the first whole-genome inventory of polymorphisms forrice

Highly polymorphic ∼ 0.2 % in SNPs, ∼ 2.4 % in PRPs

Intersection of MB and ML calls provides highly reliable SNPpredictions

Used to genotype many more rice cultivars

Polymorphic region predictions

Important for more detailed analyses (e.g. dideoxy sequencing)Useful for primer design to increase PCR success rates

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 12 / 13

Page 36: Revealing Sequence Variation Patterns in Rice with Machine ... · Human (Hinds et al., Science, 2005) Arabidopsis thaliana (Clark et al., Science, 2007) Mouse (Frazer et al., Nature,

Acknowledgements

Friedrich Miescher Laboratory

Gunnar Ratsch

Georg Zeller

Gabriele Schweikert

MPI for Developmental Biology

Detlef Weigel

Richard Clark

Michigan State University, USA

Robin Buell

Kevin Childs

Perlegen Sciences, USA

Renee Stokowski

Dennis Ballinger

Kelly Frazer

David Cox

IRRI, The Philippines

Kenneth McNally

Victor Ulat

Hei Leung

Colorado State University, USA

Jan Leach

Regina Bohnert (FML) Sequence Variation Patterns in Rice July 18, 2008 13 / 13