Integration of Genomic Data using
Spotfire Harry Harlow, Ph.D.
Director, Advanced Data AnalysisMonsanto, Genomics Group
St. Louis, Missouri
Talk Outline
• About Monsanto• Quick Primer (TxP, MxP, Bioinformatics)• Example Experiments• Pushing the Envelope • Conclusions and Acknowledgements
Monsanto•BioTech company focused on agriculture•Products
–Roundup® herbicide (rapidly degrading glyphosate)
–Seeds (Corn, Cotton, Soy, Canola, Potato, Sugar Beet)•RoundupReady® seeds (glyphosate tolerant) •Insect resistant seeds
–Animal Products•Posilac® bovine somatatropin (dairy cow productivity)•Swine Genetics
•A great place to work
Integration of Genomic and Breeding Data is Essential
to Find the Genes that will Become our ProductsPerformanceInformationPhenotype
Information
Oil Quality
YieldDrought
Stress
YieldDisease
Disease
YieldMaturityYield
Gene, Protein, and Metabolite Expression
PedigreeInformation
Key Requirement for SuccessThe Designed Experiment
• Begin with the end in mind• Responsive biological system (the biological “model”)• Know your limitations
– Analytic methods Limits of “rotten” luck– Data combinability Ability to analyze the results– Data coherency
• Robust and Understandable DOE• Informatics, biostatistics, and computer science
– data capture data reduction – analysis visualization– lead follow up
• Sufficient funding to conduct meaningful experiments • Leadership support
Transcriptional Profilingproduces a lot of data
• Typical experiment uses between 10 and 250 microarrays– each chip has 6000 to 26000 data vectors – data is relative, and either 1 or 2 channels – noise/error is a function of signal intensity, but
researchers like results as a fold change ratio • Typical experiment has ~1,000,000 data
vectors for TxP results alone • Is most useful combined with phenotype
Process of high-throughput gene expression analysis
cDNA
vector
Array cDNA
Hybridization
cDNA microarray
Database
-20
-15
-10
-5
0
5
10
15
201 2 3 4 5 6 7 8 9 10
ProbesImportant GenesGene Interactions
Products
cDNA librariesSequencing
in eachwell
UnigeneLibraries
Amplificationof cDNAs
Profiling Analysis
cDNA Nylon Array
Metabolic Profiling Sample Analysis
MSTwo HPLC
SystemsRP column
NP column
Metabolite Profiling Data
• Typical experiment uses between 100 and 2000 assays– each assay has 50 to 200 data vectors – data is relative – noise/error is a function of signal intensity, but
researchers like fold change • Typical experiment has ~100,000 data
vectors
Bioinformatics &Gene Annotation: From Sequence to Function
Sequencing
EST
Transcriptionregulation
(Gene expression)
Mapping(Molecular markers)
Mutation(Gene machines)
Protein expressionand function(Proteomics)
Full-length Genomic
Unigene
AN
NO
TA
TIO
N
Similarity searchsBLAST, PSIBLAST, FASTASMITH-WATERMANFRAME-SEARCH
Protein domain/motifProsite/Profiles, TeiresiasHidden Markov Models (Pfam)Clusters of othologs (COGs)
PhylogenyMachine Learning SVMStructural prediction
Homology modelingThreading, etc.
Gene prediction
Gene A
chromosome
Dat
a ba
ses
Metabolic PathwayRegulatory PathwayProtein interaction networksComparative genomics(evolution genes & species)
FUN
CT
ION
AL
CL
ASS
IFIC
AT
ION
Data bases Key areas:Tool integrationData integrationData miningData visualization
Gene BGene C
Comparisons to experiment
Overall data for a typical experiment• Phenotypic data: 10’s-1000’s of samples
– 10’s of measurement including growth rates, respiration, and physical measurements
– Various tissues and time points (repeated measures)
• Performance data: 10’s -100’s of samples– yield information like seed number, morphology, and
Kg/Ha
• Genomics Information– MxP 100’000’s of data vectors– TxP 1,000,000’s of data vectors – Gene annotation 100’s of pieces of evidence per gene
Genomics Experimental Classifications
1. A:B experiments2. Correlation experiments3. Perturbation experiments - Cholesterol
inhibition
4. Developmental profile experiments - Soy embryo development
5. Composite Designs (which is a convolution of a developmental profile with a response surface experimental design)
Example Experiments:
Maize Experiment (Composite)Soybean Embryo (Developmental)
Yeast (Biosynthetic Perturbation)
Experimental Goal • Identify genes and metabolites that
correlate with kernel properties and yield
• Develop an understanding for maize plant and seed development
• Identify developmental points that impact kernel composition
Repeated measures modelModel
Yikl=µ+bk+linei+mik+timel+(line*time)il+eikl
bk=block error mik=whole-plot erroreikl=random correlated error
•Error is assumed independent•Consequences of assuming “independent” errorwhen it is “correlated” include over estimatingthe statistical significance of an effect
Example Data for a Developmental Curve for a Single TissueD
egre
e of
Mat
urity Hybrid Control
Experimental Line
Scatter Plot - tDate vs. AvgOfNitrogen
tDate
1
1.5
2
2.5
3
3.5
4
06/21/0006/21/0006/21/00 06/29/00 07/12/00 07/26/00 08/09/00 08/23/00 09/06/00 09/20/00
02Low Leaf
Transcriptional Data
1
0.8
0.6
0.4
0.2
00 0.2 0.4 0.6 0.8 1
Relative Age
S c a t t e r P l o t - t D a t e v s . A v g O f S u m P P M
t D a t e
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 6 /2 1 /0 00 6 /2 1 /0 00 6 /2 1 /0 0 0 6 /2 9 /0 0 0 7 /1 2 /0 0 0 7 /2 6 /0 0 0 8 /0 9 /0 0 0 8 /2 3 /0 0 0 9 /0 6 /0 0 0 9 /2 0 /0 0
0 1 R o o t
S c a t t e r P lo t - t D a t e v s . A v g O f S u m P P M
t D a t e
5 0 0 0
1 0 0 0 0
1 5 0 0 0
2 0 0 0 0
2 5 0 0 0
3 0 0 0 0
0 6 /2 1 /0 00 6 /2 1 /0 00 6 /2 1 /0 0 0 6 /2 9 /0 0 0 7 /1 2 /0 0 0 7 /2 6 /0 0 0 8 /0 9 /0 0 0 8 /2 3 /0 0 0 9 /0 6 /0 0 0 9 /2 0 /0 0
0 4 L o w In te r n o d e
S c a t t e r P l o t - t D a t e v s . A v g O f S u m P P M
t D a t e
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 6 /2 1 /0 00 6 /2 1 /0 00 6 /2 1 /0 0 0 6 /2 9 /0 0 0 7 /1 2 /0 0 0 7 /2 6 /0 0 0 8 /0 9 /0 0 0 8 /2 3 /0 0 0 9 /0 6 /0 0 0 9 /2 0 /0 0
0 8 H i g h L e a f
Example data: One analyte during development
PPM
PPM
PPM
Time
CorrelationAnalysisCorrelationAnalysis: : MxPMxP vs. vs. TxPTxP (Genes by Metabolites)(Genes by Metabolites)
GenesMetabolites
Integration of Genomic and Breeding Data is Essential
to Find the Genes that will Become our ProductsPerformanceInformationPhenotype
Information
Oil Quality
YieldDrought
Stress
YieldDisease
Disease
YieldMaturityYield
Gene, Protein, and Metabolite Expression
PedigreeInformation
Utilization of a-priori knowledge anda custom weighting scheme statistic
• Dealt with multidimensional data of various dimensions (missing data)
• Generates a “weighting” or probability density function (PDF) • In this case, we matched TxP/Mxp results with the known genetics. • Started with p-values for the TxP/MxP results• Critical values generated by using SAS/Matlab on sampled data• Formula
∗
∑∑
∑4
- tancos -1 2-exp 1-4
nsinteractioAll
i
2
πδδδδ
δ
δ
j
ij
i
i
=∑
n
v - v j
2j
iiδVectorequivalent to
“z-score”
wherePhasefactor
Directionalconsistency
Scatter PlotScatter Plot
A visualization of the weighting scheme in 3DTwo views
LegendConsistent with geneticsBoarder line consistent
An example weighting scheme in three dimensions Intensity values generated on a regular spherical lattice
Points are sized and colored according to the PDF
Experimental Results• A few good genes
– potential new products
• New gene annotation– association of previously unknown genes with a
phenotype or biochemical process
• Identification of key developmental stages• Reference database of integrated information
So…We have a few genes now what?• Experimental Verification (RT-PCR, Northerns, etc.)• The bioinformatics work really begins
– Most genes have limited annotation• “the good” annotation (25 to 60% depending on genome)• “the bad” junk annotation (25 to 50% “putative protein”)• “the ugly” misleading annotation (up to 25% BLAST match)
– Genes of interest need additional work
• Where oh where are my genes?– Microarrays tend to have incomplete gene sets
• In many research areas custom arrays are needed• Tissue specific “unigene” sets are labor intensive to produce• Missing genes identified by doing the experiment
Example #2 Soybean Embryo Development
Experimental Protocol
Tag flowers after emerging
(= day 0 of DAF)
Collect samples up to 60 DAF
and stage them Soy plants
grown in green house
RNA extractionLipid analysis
Protein analysisMicronutrient analysis
Freeze samples
Separate seed into embryo, endosperm
and seed coat
Stage 1 4 8 10 12 15
Soybean Embryogenesis(from Goldberg et al 1989)
Regulation of Prevalent mRNA sequences during Seed Development and Germination(from Goldberg et al 1989)
L evels o f p h o sp h o lip id s an d T A G
d u rin g so yb ean em b ryo d evelo p m en t
0
2
4
6
8
10
12
14
E4L E5L E6L E7L E8L E9L E10L E11L E12L E13L E14L E15L
D e ve lo p m e n t s tag e
L ip id s %b y d ry w e ig h t
Phospholipids TA G
E4L
E5L
E6L
E7L
E8L
E9L
E10L
E11L
E12L
E13L
E14L
E15L
11697.4
66.3
55.4
36.5
31.0
21.5
Lxα’α
11S
11S
β
Developing Embryo stage
Mic
ronu
trie
nt
0
50
100
150
200
250
300
350
400
E5 E6 E7 E8 E9 E10 E11 E12 E13 E14
Phenotypic Data essential for successful TxP
Oil
Stag
e 5
vsSt
age
8St
age
8 vs
Stag
e 5
Scatter Plots
cy3
cy5
Transcriptional data shows excellent coherency
Natural Sample Order
Statistical Analysis
• Flux analysis of TAG accumulation• Correlation analysis• Cluster analysis
S catter P lot
N S am pleID
afiB D E
-8
-6
-4
-2
0
2
4
001stag...001stag...001stag...005stag...009stag... 017stag... 025stag...029stag... 037stag...041stag...
Transcription Factors HistonesS catter P lot
N S am pleID
afiB D E
-1
0
1
2
3
4
5
6
7
8
001stag...001stag...001stag...005stag...009stag... 017stag... 025stag...029stag... 037stag...041stag...
Kinases CyclinsS c a t t e r P l o t
N S a m p l e I D
a f i B D E
0
1
2
3
4
0 0 1 s t a g . . .0 0 1 s t a g . . .0 0 1 s t a g . . .0 0 5 s t a g . . .0 0 9 s t a g . . . 0 1 7 s t a g . . . 0 2 5 s t a g . . .0 2 9 s t a g . . . 0 3 7 s t a g . . .0 4 1 s t a g . . .
G T - ( U 9 7 3 2 7 ) c a l c y c l i n b i n d i n g p r o t e i n [ M u s m u s c u l u s ] { c l o n e i d : 7 0 0 8 5 1 1 1 2 }
( A J 0 1 1 8 9 3 ) c y c l i n D 3 . 1 p r o t e i n [ N i c o t i a n a t a b a c u m ]
G T - ( D 5 0 8 7 0 ) m i t o t i c c y c l i n a 2 - t y p e [ G l y c i n e m a x ] { c l o n e i d : 7 0 0 9 8 8 6 4 2 }
( A J 0 1 1 8 9 2 ) c y c l i n D 2 . 1 p r o t e i n [ N i c o t i a n a t a b a c u m ]
S catter P lot
N S am pleID
afiB D E
-2
0
2
4
6
8
001stag...001stag...001stag...005stag...009stag... 017stag... 025stag...029stag... 037stag...041stag...
-25-20-15-10-5
05
TAG
Phospholipids
3 4 5 7 10 11 13 14negativeor zeroslope
zeroslope
positiveslope
121 genes in thiscluster
Flux Analysis
12
Lipids % bydry weight
8
4
04 5 6 7 8 9 10 11 12 13 14 15
TAG = ∫ −ℑLE
LEdtlux
15
1
Genes correlated with TAGS catte r P lo t
N Sa m p le ID
-1 0
-5
0
5
1 0
1 5
2 0
0 0 1 ...0 0 1 ...0 0 1 ...0 0 5 ...0 0 9 ... 0 1 7 ... 0 2 5 ...0 2 9 ... 0 3 7 ...0 4 1 ...
S catte r P lo t
N S a m p le ID
-1 0
-5
0
5
1 0
0 0 1 ...0 0 1 ...0 0 1 ...0 0 5 ...0 0 9 ... 0 1 7 ... 0 2 5 ...0 2 9 ... 0 3 7 ...0 4 1 ...
Functional Genome Analysis
• Parallel event tracking to identify structured information flow resulting in a state change.
• What circuits are working?
• Where are the Control Points?
• What are the range of responses?
• Where is information Conserved?
Example #3 Yeast Perturbation Study
Experimental Protocol
• Inhibition of cholesterol biosynthetic pathways• Nylon experiment with complete genome • Time course experiment (logarithmic sampling times)
Spotfire Connection into our gene databases
Genesis
Conclusions• Designed experiments are required• Data integration is key to success and a moving target
in a developing field like bioinformatics• We have a need for flexible data visualization tools like
Spotfire• Today, we employ primarily mature “off-the- shelf”
tools for our data mining and statistical analysis • As a company have taken a very applied and pragmatic
approach to delivering the genes of utility that will be our new products.
Acknowledgements• IT - Mark Showers, Andrew Davis & team• Bioinformatics - Stan Letovsky & team • Transcritional Profiling - George Michaels & team• Metabolomics - Roger Wiegand & team• Breeding - Kristin Schneider & team• Quality Traits - Brad Fabbri & team• Biostatistics - Anabayan Kessavalou • Monsanto Leadership• Spotfire
New slide