Download - Integration of Genomic Data using Spotfire · Integration of Genomic Data using Spotfire Harry Harlow, Ph.D. Director, Advanced Data Analysis Monsanto, Genomics Group St. Louis, Missouri

Integration of Genomic Data using

Spotfire Harry Harlow, Ph.D.

Director, Advanced Data AnalysisMonsanto, Genomics Group

St. Louis, Missouri

Talk Outline

• About Monsanto• Quick Primer (TxP, MxP, Bioinformatics)• Example Experiments• Pushing the Envelope • Conclusions and Acknowledgements

Monsanto•BioTech company focused on agriculture•Products

–Roundup® herbicide (rapidly degrading glyphosate)

–Seeds (Corn, Cotton, Soy, Canola, Potato, Sugar Beet)•RoundupReady® seeds (glyphosate tolerant) •Insect resistant seeds

–Animal Products•Posilac® bovine somatatropin (dairy cow productivity)•Swine Genetics

•A great place to work

Integration of Genomic and Breeding Data is Essential

to Find the Genes that will Become our ProductsPerformanceInformationPhenotype

Information

Oil Quality

YieldDrought

Stress

YieldDisease

Disease

YieldMaturityYield

Gene, Protein, and Metabolite Expression

PedigreeInformation

Key Requirement for SuccessThe Designed Experiment

• Begin with the end in mind• Responsive biological system (the biological “model”)• Know your limitations

– Analytic methods Limits of “rotten” luck– Data combinability Ability to analyze the results– Data coherency

• Robust and Understandable DOE• Informatics, biostatistics, and computer science

– data capture data reduction – analysis visualization– lead follow up

• Sufficient funding to conduct meaningful experiments • Leadership support

Transcriptional Profilingproduces a lot of data

• Typical experiment uses between 10 and 250 microarrays– each chip has 6000 to 26000 data vectors – data is relative, and either 1 or 2 channels – noise/error is a function of signal intensity, but

researchers like results as a fold change ratio • Typical experiment has ~1,000,000 data

vectors for TxP results alone • Is most useful combined with phenotype

Process of high-throughput gene expression analysis

cDNA

vector

Array cDNA

Hybridization

cDNA microarray

Database

-20

-15

-10

-5

0

5

10

15

201 2 3 4 5 6 7 8 9 10

ProbesImportant GenesGene Interactions

Products

cDNA librariesSequencing

in eachwell

UnigeneLibraries

Amplificationof cDNAs

Profiling Analysis

cDNA Nylon Array

Metabolic Profiling Sample Analysis

MSTwo HPLC

SystemsRP column

NP column

Metabolite Profiling Data

• Typical experiment uses between 100 and 2000 assays– each assay has 50 to 200 data vectors – data is relative – noise/error is a function of signal intensity, but

researchers like fold change • Typical experiment has ~100,000 data

vectors

Bioinformatics &Gene Annotation: From Sequence to Function

Sequencing

EST

Transcriptionregulation

(Gene expression)

Mapping(Molecular markers)

Mutation(Gene machines)

Protein expressionand function(Proteomics)

Full-length Genomic

Unigene

AN

NO

TA

TIO

N

Similarity searchsBLAST, PSIBLAST, FASTASMITH-WATERMANFRAME-SEARCH

Protein domain/motifProsite/Profiles, TeiresiasHidden Markov Models (Pfam)Clusters of othologs (COGs)

PhylogenyMachine Learning SVMStructural prediction

Homology modelingThreading, etc.

Gene prediction

Gene A

chromosome

Dat

a ba

ses

Metabolic PathwayRegulatory PathwayProtein interaction networksComparative genomics(evolution genes & species)

FUN

CT

ION

AL

CL

ASS

IFIC

AT

ION

Data bases Key areas:Tool integrationData integrationData miningData visualization

Gene BGene C

Comparisons to experiment

Overall data for a typical experiment• Phenotypic data: 10’s-1000’s of samples

– 10’s of measurement including growth rates, respiration, and physical measurements

– Various tissues and time points (repeated measures)

• Performance data: 10’s -100’s of samples– yield information like seed number, morphology, and

Kg/Ha

• Genomics Information– MxP 100’000’s of data vectors– TxP 1,000,000’s of data vectors – Gene annotation 100’s of pieces of evidence per gene

Genomics Experimental Classifications

1. A:B experiments2. Correlation experiments3. Perturbation experiments - Cholesterol

inhibition

4. Developmental profile experiments - Soy embryo development

5. Composite Designs (which is a convolution of a developmental profile with a response surface experimental design)

Example Experiments:

Maize Experiment (Composite)Soybean Embryo (Developmental)

Yeast (Biosynthetic Perturbation)

Experimental Goal • Identify genes and metabolites that

correlate with kernel properties and yield

• Develop an understanding for maize plant and seed development

• Identify developmental points that impact kernel composition

Repeated measures modelModel

Yikl=µ+bk+linei+mik+timel+(line*time)il+eikl

bk=block error mik=whole-plot erroreikl=random correlated error

•Error is assumed independent•Consequences of assuming “independent” errorwhen it is “correlated” include over estimatingthe statistical significance of an effect

Example Data for a Developmental Curve for a Single TissueD

egre

e of

Mat

urity Hybrid Control

Experimental Line

Scatter Plot - tDate vs. AvgOfNitrogen

tDate

1

1.5

2

2.5

3

3.5

4

06/21/0006/21/0006/21/00 06/29/00 07/12/00 07/26/00 08/09/00 08/23/00 09/06/00 09/20/00

02Low Leaf

Transcriptional Data

1

0.8

0.6

0.4

0.2

00 0.2 0.4 0.6 0.8 1

Relative Age

S c a t t e r P l o t - t D a t e v s . A v g O f S u m P P M

t D a t e

2 0 0 0

4 0 0 0

6 0 0 0

8 0 0 0

1 0 0 0 0

1 2 0 0 0

0 6 /2 1 /0 00 6 /2 1 /0 00 6 /2 1 /0 0 0 6 /2 9 /0 0 0 7 /1 2 /0 0 0 7 /2 6 /0 0 0 8 /0 9 /0 0 0 8 /2 3 /0 0 0 9 /0 6 /0 0 0 9 /2 0 /0 0

0 1 R o o t

S c a t t e r P lo t - t D a t e v s . A v g O f S u m P P M

t D a t e

5 0 0 0

1 0 0 0 0

1 5 0 0 0

2 0 0 0 0

2 5 0 0 0

3 0 0 0 0

0 6 /2 1 /0 00 6 /2 1 /0 00 6 /2 1 /0 0 0 6 /2 9 /0 0 0 7 /1 2 /0 0 0 7 /2 6 /0 0 0 8 /0 9 /0 0 0 8 /2 3 /0 0 0 9 /0 6 /0 0 0 9 /2 0 /0 0

0 4 L o w In te r n o d e

S c a t t e r P l o t - t D a t e v s . A v g O f S u m P P M

t D a t e

4 0 0 0

6 0 0 0

8 0 0 0

1 0 0 0 0

1 2 0 0 0

0 6 /2 1 /0 00 6 /2 1 /0 00 6 /2 1 /0 0 0 6 /2 9 /0 0 0 7 /1 2 /0 0 0 7 /2 6 /0 0 0 8 /0 9 /0 0 0 8 /2 3 /0 0 0 9 /0 6 /0 0 0 9 /2 0 /0 0

0 8 H i g h L e a f

Example data: One analyte during development

PPM

PPM

PPM

Time

CorrelationAnalysisCorrelationAnalysis: : MxPMxP vs. vs. TxPTxP (Genes by Metabolites)(Genes by Metabolites)

GenesMetabolites

Integration of Genomic and Breeding Data is Essential

to Find the Genes that will Become our ProductsPerformanceInformationPhenotype

Information

Oil Quality

YieldDrought

Stress

YieldDisease

Disease

YieldMaturityYield

Gene, Protein, and Metabolite Expression

PedigreeInformation

Utilization of a-priori knowledge anda custom weighting scheme statistic

• Dealt with multidimensional data of various dimensions (missing data)

• Generates a “weighting” or probability density function (PDF) • In this case, we matched TxP/Mxp results with the known genetics. • Started with p-values for the TxP/MxP results• Critical values generated by using SAS/Matlab on sampled data• Formula

∗

∑∑

∑4

- tancos -1 2-exp 1-4

nsinteractioAll

i

2

πδδδδ

δ

δ

j

ij

i

i

=∑

n

v - v j

2j

iiδVectorequivalent to

“z-score”

wherePhasefactor

Directionalconsistency

Scatter PlotScatter Plot

A visualization of the weighting scheme in 3DTwo views

LegendConsistent with geneticsBoarder line consistent

An example weighting scheme in three dimensions Intensity values generated on a regular spherical lattice

Points are sized and colored according to the PDF

Experimental Results• A few good genes

– potential new products

• New gene annotation– association of previously unknown genes with a

phenotype or biochemical process

• Identification of key developmental stages• Reference database of integrated information

So…We have a few genes now what?• Experimental Verification (RT-PCR, Northerns, etc.)• The bioinformatics work really begins

– Most genes have limited annotation• “the good” annotation (25 to 60% depending on genome)• “the bad” junk annotation (25 to 50% “putative protein”)• “the ugly” misleading annotation (up to 25% BLAST match)

– Genes of interest need additional work

• Where oh where are my genes?– Microarrays tend to have incomplete gene sets

• In many research areas custom arrays are needed• Tissue specific “unigene” sets are labor intensive to produce• Missing genes identified by doing the experiment

Example #2 Soybean Embryo Development

Experimental Protocol

Tag flowers after emerging

(= day 0 of DAF)

Collect samples up to 60 DAF

and stage them Soy plants

grown in green house

RNA extractionLipid analysis

Protein analysisMicronutrient analysis

Freeze samples

Separate seed into embryo, endosperm

and seed coat

Stage 1 4 8 10 12 15

Soybean Embryogenesis(from Goldberg et al 1989)

Regulation of Prevalent mRNA sequences during Seed Development and Germination(from Goldberg et al 1989)

L evels o f p h o sp h o lip id s an d T A G

d u rin g so yb ean em b ryo d evelo p m en t

0

2

4

6

8

10

12

14

E4L E5L E6L E7L E8L E9L E10L E11L E12L E13L E14L E15L

D e ve lo p m e n t s tag e

L ip id s %b y d ry w e ig h t

Phospholipids TA G

E4L

E5L

E6L

E7L

E8L

E9L

E10L

E11L

E12L

E13L

E14L

E15L

11697.4

66.3

55.4

36.5

31.0

21.5

Lxα’α

11S

11S

β

Developing Embryo stage

Mic

ronu

trie

nt

0

50

100

150

200

250

300

350

400

E5 E6 E7 E8 E9 E10 E11 E12 E13 E14

Phenotypic Data essential for successful TxP

Oil

Stag

e 5

vsSt

age

8St

age

8 vs

Stag

e 5

Scatter Plots

cy3

cy5

Transcriptional data shows excellent coherency

Natural Sample Order

Statistical Analysis

• Flux analysis of TAG accumulation• Correlation analysis• Cluster analysis

S catter P lot

N S am pleID

afiB D E

-8

-6

-4

-2

0

2

4

001stag...001stag...001stag...005stag...009stag... 017stag... 025stag...029stag... 037stag...041stag...

Transcription Factors HistonesS catter P lot

N S am pleID

afiB D E

-1

0

1

2

3

4

5

6

7

8


Kinases CyclinsS c a t t e r P l o t

N S a m p l e I D

a f i B D E

0

1

2

3

4

0 0 1 s t a g . . .0 0 1 s t a g . . .0 0 1 s t a g . . .0 0 5 s t a g . . .0 0 9 s t a g . . . 0 1 7 s t a g . . . 0 2 5 s t a g . . .0 2 9 s t a g . . . 0 3 7 s t a g . . .0 4 1 s t a g . . .

G T - ( U 9 7 3 2 7 ) c a l c y c l i n b i n d i n g p r o t e i n [ M u s m u s c u l u s ] { c l o n e i d : 7 0 0 8 5 1 1 1 2 }

( A J 0 1 1 8 9 3 ) c y c l i n D 3 . 1 p r o t e i n [ N i c o t i a n a t a b a c u m ]

G T - ( D 5 0 8 7 0 ) m i t o t i c c y c l i n a 2 - t y p e [ G l y c i n e m a x ] { c l o n e i d : 7 0 0 9 8 8 6 4 2 }

( A J 0 1 1 8 9 2 ) c y c l i n D 2 . 1 p r o t e i n [ N i c o t i a n a t a b a c u m ]

S catter P lot

N S am pleID

afiB D E

-2

0

2

4

6

8


-25-20-15-10-5

05

TAG

Phospholipids

3 4 5 7 10 11 13 14negativeor zeroslope

zeroslope

positiveslope

121 genes in thiscluster

Flux Analysis

12

Lipids % bydry weight

8

4

04 5 6 7 8 9 10 11 12 13 14 15

TAG = ∫ −ℑLE

LEdtlux

15

1

Genes correlated with TAGS catte r P lo t

N Sa m p le ID

-1 0

-5

0

5

1 0

1 5

2 0

0 0 1 ...0 0 1 ...0 0 1 ...0 0 5 ...0 0 9 ... 0 1 7 ... 0 2 5 ...0 2 9 ... 0 3 7 ...0 4 1 ...

S catte r P lo t

N S a m p le ID

-1 0

-5

0

5

1 0

0 0 1 ...0 0 1 ...0 0 1 ...0 0 5 ...0 0 9 ... 0 1 7 ... 0 2 5 ...0 2 9 ... 0 3 7 ...0 4 1 ...

Functional Genome Analysis

• Parallel event tracking to identify structured information flow resulting in a state change.

• What circuits are working?

• Where are the Control Points?

• What are the range of responses?

• Where is information Conserved?

Example #3 Yeast Perturbation Study

Experimental Protocol

• Inhibition of cholesterol biosynthetic pathways• Nylon experiment with complete genome • Time course experiment (logarithmic sampling times)

Spotfire Connection into our gene databases

Genesis

Conclusions• Designed experiments are required• Data integration is key to success and a moving target

in a developing field like bioinformatics• We have a need for flexible data visualization tools like

Spotfire• Today, we employ primarily mature “off-the- shelf”

tools for our data mining and statistical analysis • As a company have taken a very applied and pragmatic

approach to delivering the genes of utility that will be our new products.

Acknowledgements• IT - Mark Showers, Andrew Davis & team• Bioinformatics - Stan Letovsky & team • Transcritional Profiling - George Michaels & team• Metabolomics - Roger Wiegand & team• Breeding - Kristin Schneider & team• Quality Traits - Brad Fabbri & team• Biostatistics - Anabayan Kessavalou • Monsanto Leadership• Spotfire

New slide

Download - Integration of Genomic Data using Spotfire · Integration of Genomic Data using Spotfire Harry Harlow, Ph.D. Director, Advanced Data Analysis Monsanto, Genomics Group St. Louis, Missouri

Top Related