all the data: challenges - stanford...

65
All the Data: challenges Susan Holmes http://www-stat.stanford.edu/˜susan/ Bio-X and Statistics, Stanford University STAMPS, 2015 thanks, NIH-TR01 and NIH-R01GM086884 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Upload: others

Post on 24-Feb-2021

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

All the Data: challenges

SusanHolmeshttp://www-stat.stanford.edu/˜susan/

Bio-X andStatistics, StanfordUniversity

STAMPS,2015thanks, NIH-TR01andNIH-R01GM086884

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 2: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

What'shappeninginBiologicalDataAnalysis?

1985 1998 2012

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 3: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

What'shappeninginBiologicalDataAnalysis?1985 1998 2012

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 4: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

messy

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 5: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

(Big)Challengesforthoseofusworkingfromthegroundup

Heterogeneity. Heteroscedasticity. InformationLeaks. MultiplicityofChoices. Reproducibility.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 6: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Part I

`Homogeneous data are all alike;

all heterogeneous data are heterogeneous

in their own way.'

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 7: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

HeterogeneityofData Status: response/explanatory. Hidden(latent)/measured. Types:

Continuous Binary, categorical Graphs/Trees Images Maps/SpatialInformation Rankings

Amountsofdependency: independent/timeseries/spatial. Differenttechnologiesused(454, phyloseq, Illumina,

MassSpec, RNA-seq).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 8: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

HeterogeneousDataObjectsInputanddatamanipulationwith phyloseq

(McMurdieandHolmes, 2013, PlosONE)AsalwaysinR:objectorienteddata.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 9: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

apepackage

OTU Abundanceotu_table

Sample Variablessample_data

Taxonomy TabletaxonomyTable

Phylogenetic Treephylo

otu_table sample_data tax_table phy_tree

otu_table sample_data tax_table

read.treeread.nexusread_tree

as as as

import

phyloseqconstructor:

Biostringspackage

Reference Seq.XStringSet

DNAStringSet RNAStringSet

AAStringSet

phyloseq

Experiment Data

otu_table,sam_data,tax_table,phy_treerefseq

Accessors:get_taxaget_samplesget_variablensamplesntaxarank_namessample_namessample_sumssample_variablestaxa_namestaxa_sums

Processors:filter_taxamerge_phyloseqmerge_samplesmerge_taxaprune_samplesprune_taxasubset_taxasubset_samplestip_glomtax_glom

matrix matrixdata.frame

optional

refseq

data

data structure & APIphyloseq

http://joey711.github.io/phyloseq/

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 10: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

phyloseq

Preprocessing

Import

Direct Plots

plot_network plot_heatmap plot_ordination

distance ordinate

Summary / ExploratoryGraphics

filter_taxafilterfun_samplegenefilter_sampleprune_taxaprune_samplessubset_taxasubset_samplestransform_sample_counts

import_biomimport_mothurimport_pyrotaggerimport_qiimeimport_RDP

plot_tree

plot_richness

plot_bar

bootstrappermutation testsregressiondiscriminant analysismultiple testinggap statisticclusteringprocrustes

Inference, Testing

sample data

OTU cluster output

Input

raw

phyloseqprocessed

work flowphyloseq

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 11: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

graphics

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

−0.4 −0.2 0.0 0.2 0.4NMDS1

NMDS

2

SampleType

FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue

plot_ordination, NMDS, wUF

FreshwaterFreshwater (creek)FreshwaterFreshwater (creek)Freshwater (creek)SoilSoilSoilSkinSkinSkinM

ockM

ockM

ockFecesFecesFecesFecesSedim

ent (estuary)TongueTongueO

ceanO

ceanO

ceanSedim

ent (estuary)Sedim

ent (estuary)

SampleType

OTU

1

100

10000

Abundance

plot_heatmap; bray−curtis, NMDS

SeqTech

IlluminaPyro454Sanger

Enterotype 1

23

plot_network; Enterotype data, bray−curtis, max.dist=0.25

CytophagaEmticicia

Sphingobacterium

Segetibacter

Haliscomenobacter

Pedobacter

Bacteroides

Alistipes

Bacteroides

Cytophaga

Porphyromonas

Prevotella

Parabacteroides

Algoriphagus

Odoribacter

CandidatusAquirestis

Capnocytophaga

Porphyromonas

Spirosoma

Prevotella

Balneola

Prevotella

Hymenobacter

Prevotella

76

73

75

75

79

67

81

84

84

82

75

Abundance

12562515625

SampleType

FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue

Order Bacteroidales

FlavobacterialesSphingobacteriales

plot_tree; Bacteroidetes−only. Merged samples, tip_glom=0.1

0e+00

2e+05

4e+05

6e+05

Feces

Freshwater

Freshwater (creek)

Mock

Ocean

Sediment (estuary)

Skin

Soil

Tongue

SampleType

Abun

danc

e

FamilyBacteroidaceaeBalneolaceaeCryomorphaceaeCyclobacteriaceaeFlavobacteriaceaeFlexibacteraceaePorphyromonadaceaePrevotellaceaeRikenellaceaeSaprospiraceaeSphingobacteriaceae

plot_bar; Bacteroidetes−only

S.obs S.chao1 S.ACE

2000

4000

6000

8000

FALSE TRUE FALSE TRUE FALSE TRUEHuman Associated Samples

Num

ber o

f OTU

sSampleType

FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue

plot_ordination()

plot_network()

plot_bar()

plot_heatmap()

plot_tree()

plot_richness()

phyloseq

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 12: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

graphics

plot_ordination()

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1CA1

CA2

SampleTypeFecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue

Samples Only; type="samples"

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1CA1

CA2

Class

FlavobacteriaSphingobacteriaBacteroidiasamples

SampleType

FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTonguetaxa

type

samplestaxa

Biplot; type="biplot"

Bacteroidia Flavobacteria Sphingobacteria

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1 −1 0 1 −1 0 1CA1

CA2

Class

BacteroidiaFlavobacteriaSphingobacteria

Taxa Only; type="taxa"

samples taxa

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1 −1 0 1CA1

CA2

ClassFlavobacteriaSphingobacteriaBacteroidiasamples

SampleTypeFecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTonguetaxa

Split Plot; type="split"

type="scree"

split

taxa-only

biplot

samples-only

phyloseq

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 13: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

markdown !(code + console) +

figures

phyloseq + !ggplot2 + !etc.

# Main title!This is an [R Markdown](my.link.com) document of my recent analysis.!## Subsection: some codeHere is some import code, etc.```rlibrary("phyloseq")library("ggplot2")physeq = import_biom(“datafile.biom”)plot_richness(physeq)```

source.Rmd

Complete HTML5

knitr::knit2html()

microbiome data

Our Goal with Collaborators:!Reproducible analysis workflow with R-markdown

Better Reproducibility

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 14: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Part II

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 15: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

ExampleofStudy

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 16: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Summaryofthestudy Choosethedatatransformation(hereproportions

replacedtheoriginalcounts).... log, rlog, subsample, prop, orig.

Takeasubsetofthedata, somesamplesdeclaredasoutliers. ... leaveout0, 1, 2,..,9, +criteria(10)......

Filteroutcertaintaxa(unknownlabels, rare, etc...)... removeraretaxa(thresholdat0.01%, 1%, 2%,...)

Chooseadistance.... 40choicesinvegan/phyloseq.

Chooseanordinationmethodandnumberofcoordinates.... MDS,NMDS,k=2,3,4,5..

Chooseaclusteringmethod, chooseanumberofclusters.... PAM,KNN,densitybased, hclust...

Chooseanunderlyingcontinuousvariable(gradientorgroupofvariables: manifold).

Chooseagraphicalrepresentation.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 17: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Therearethusmorethan200millionpossiblewaysofanalyzingthisdata:

5× 100× 10× 40× 8× 16× 2× 4 = 204800000

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 18: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Part III

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 19: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Some real data (Caporoso et al, 2011)

> GlobalPatterns

phyloseq-class experiment-level object

otu_table() OTU Table: [ 19216 taxa and 26 samples ]

sample_data()Sample Data: [ 26 samples by 7 sample variables ]

tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]

phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]

> sample_sums(GlobalPatterns)

CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr

864077 1135457 697509 1543451 2076476 718943 433894 186297

.....

NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1

1478965 1652754 58688 493126 279704 937466 1211071 1216137

> summary(sample_sums(GlobalPatterns))

Min. 1st Qu. Median Mean 3rd Qu. Max.

58690 567100 1107000 1085000 1527000 2357000

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 20: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Howtodealwithdifferentnumbersofreads?

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 21: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

CurrentMethod: RarefyingAdhoc librarysizenormalizationbyrandomsubsamplingwithoutreplacement.1. Selectaminimumlibrarysize, NL,min. Thishasalsobeen

calledthe rarefactionlevel thoughwewillnotusethetermhere.

2. Discardlibraries(microbiomesamples)thathavefewerreadsthan NL,min.

3. Subsampletheremaininglibrarieswithoutreplacementsuchthattheyallhavesize NL,min.

Often NL,min ischosentobeequaltothesizeofthesmallestlibrarythatisnotconsidered defective, andtheprocessofidentifyingdefectivesamplescomeswithariskofsubjectivityandbias. Inmanycasesresearchershavealsofailedtorepeattherandomsubsamplingstep(3)orrecordthepseudorandomnumbergenerationseed/process---bothofwhichareessentialforreproducibility.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 22: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

ReductionofDatatoProportions

Manysoftwareprogramsautomaticallyreducethedatatorelativeproportions, losingtheinformationaboutlibrarysizesorreadcounts.Thismakescomparisonsverydifficult.StatisticalFormulation: Whenmakinga(testing)decision,reducingresultsfromaBinomialdistributionintoaproportiondoesnotgivean admissible procedure.Definition:Anadmissibleruleisanoptimalruleformakingadecisioninthesensethatthereisnootherrulethatisalwaysbetterthanit.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 23: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Aimofthestudies: DifferentialAbundance

Likedifferentiallyexpressedgenes, aspecies/OTU isconsidereddifferentiallyabundantifitsmeanproportionissignificantlydifferentbetweentwoormoresampleclassesintheexperimentaldesign.OptimalityCriteria:SensititivityorPower TruePositiveRate.Specificity TrueNegativeRate.

Wehavetocontrolformanysourcesoferror(blocking,modeling, etc..)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 24: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

NegativeBinomialorGamma-Poisson

McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'',NegativeBinomialasahierarchicalmixtureforreadcountsIftechnicalreplicateshavesamenumberofreads: sj,Poissonvariationwithmean µ = sjui.Taxa i incidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe

Kij ∼ Poisson (sjui)

SameasDESeqforRNA-seq

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 25: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

ImprovementinPowerandFDR

Performanceofdifferentialabundancedetectionwithandwithoutrarefyingsummarizedby“AreaUndertheCurve”(AUC) metricofaReceiverOperatorCurve(ROC) (verticalaxis).Briefly, theAUC valuevariesfrom0.5(random)to1.0(perfect).Thehorizontalaxisindicatestheeffectsize, shownasthefactorappliedtoOTU countstosimulateadifferentialabundance.Eachcurvetracestherespectivenormalizationmethod’smeanperformanceofthatpanel, withaverticalbarindicatingastandarddeviationinperformanceacrossallreplicatesandmicrobiometemplates.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 26: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Theright-handsideofthepanelrowsindicatesthemedianlibrarysize, N,whilethedarknessoflineshadingindicatesthenumberofsamplespersimulatedexperiment.Colorshadeandshapeindicatethenormalizationmethod.DetectionamongmultipletestswasdefinedusingaFalseDiscoveryRate(Benjamini-Hochberg)significancethresholdof0.05.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 27: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Part IV

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 28: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

UnifracDistance(LozuponeandKnight, 2005)

isadistancebetweengroupsoforganismsthatarerelatedtoeachotherbyatree.SupposewehavetheOTUspresentinsample1(blue)andinsample2(red).Question: Dothetwosamplesdifferphylogenetically?ItisdefinedastheratioofthesumofthelengthsofthebranchesleadingtomembersofgroupA ormembersofgroupB butnotbothtothetotalbranchlengthofthetree.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 29: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

WeightedUnifracdistance A modificationofUniFrac,weightedUniFracisdefinedin(Lozuponeetal., 2007)as

n∑i=1

bi × | AiAT

− BiBT

|

n = numberofbranchesinthetree bi =lengthoftheithbranch Ai =numberofdescendantsof

ithbranchingroupA AT =totalnumberofsequences

ingroupA

[?]orcanbeseenasbyprobabilistsastheWassersteindistance(earthmovers)[?]. . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 30: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Costelloetal. 2010

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 31: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Rao'sDistance

Westartwithadistancebetweenindividuals.Theheterogeneityofapopulation(Hi )istheaveragedistancebetweenmembersofthatpopulation.Theheterogeneitybetweentwopopulations(Hij)istheaveragedistancebetweenamemberofpopulation i andamemberofpopulation j.Thedistancebetweentwopopulationsis

Dij = Hij −1

2(Hi + Hj)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 32: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 33: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

DecompositionofDiversity

Ifwehavepopulations 1, . . . , k withfrequencies π1, . . . , πk,thenthediversityofallthepopulationstogetheris

H0 =

k∑i=1

πiHi +∑i

∑j

πiπjDij = H(w) + D(b)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 34: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

DoublePrincipalCoordinateAnalysisPurdom, 2005hadtheideaofusingamethoddevelopedbyPavoineetal. calledDPCoA [?]implementedin ade4 [?].Supposewehavenspeciesinplocationsanda(euclidean)matrix ∆ givingthesquaresofthepairwisedistancesbetweenthespecies. Thenwecan

Usethedistancesbetweenspeciestofindanembeddingin n− 1 -dimensionalspacesuchthattheeuclideandistancesbetweenthespeciesisthesameasthedistancesbetweenthespeciesdefinedin ∆.

Placeeachoftheplocationsatthebarycenterofitsspeciesprofile. TheeuclideandistancesbetweenthelocationswillbethesameasthesquarerootoftheRaodissimilaritybetweenthem.

UsePCA tofindalower-dimensionalrepresentationofthelocations.

Givethespeciesandcommunitiescoordinatessuchthattheinertiadecomposesthesamewaythediversitydoes.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 35: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

FukuyamaandHolmes, 2012.Method Originaldescription Newformula PropertiesDPCoA square root of Rao's distance

basedonthesquarerootofthepatristicdistances

[∑

i bi(Ai/AT − Bi/BT)2]1/2 Mostsensitivetooutliers, leastsensitive to noise, upweightsdeep differences, gives OTUlocations

wUniFrac∑

i bi |Ai/AT − Bi/BT|∑

i bi |Ai/AT − Bi/BT| Less sensitive to outliers/moresensitivetonoisethanDPCoA

UniFrac fractionofbranchesleadingtoexactlyonegroup

∑i bi1

Ai/AT−Bi/BTAi/AT+Bi/BT

≥ 1 Sensitive to noise, upweightsshallowdifferencesonthetree

Summaryofthemethodsunderconsideration. ``Outliers"referstohighlyabundantOTUs, andnoisereferstonoiseindetectinglow-abundanceOTUs(seethetextformoredetail).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 36: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

AntibioticTimeCourseData

Measurementsofabout2500differentbacterialOTUsfromstoolsamplesofthreepatients(D,E,F)Eachpatientsampled ∼ 50timesduringthecourseoftreatmentwithciprofloxacin(anantibiotic).TimescategorizedasPreCp, 1stCp, 1stWPC (weekpostcipro), Interim, 2ndCp, 2ndWPC,andPostCp.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 37: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

UniFrac

Axis 1: 14.7%

Axi

s 2:

10.

3%

−0.2

−0.1

0.0

0.1

0.2

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4

weighted UniFrac

Axis 1: 47.6%A

xis

2: 1

2.3%

−0.2

−0.1

0.0

0.1

0.2

−0.4−0.3−0.2−0.1 0.0 0.1 0.2 0.3

weighted UF on presence/absence

Axis 1: 32.7%

Axi

s 2:

15.

1%

−0.05

0.00

0.05

0.10

−0.10−0.050.000.050.100.150.20

subject

D

E

F

ComparingtheUniFracvariants. Fromlefttoright:PCoA/MDS withunweightedUniFrac, withweightedUniFrac,andwithweightedUniFracperformedonpresence/absencedataextractedfromtheabundancedatausedintheothertwoplots

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 38: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

(a) MDS of OTUs

Axis 1: 6.2%

Axi

s 2:

3.7

%

−1.5

−1.0

−0.5

0.0

0.5

−1.0 −0.5 0.0 0.5

(c) DPCoA OTU plot

CS1

CS

2

−1.0

−0.5

0.0

0.5

1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0

phylum

4C0d−2

Actinobacteria

Bacteroidetes

Candidate division TM7

Cyanobacteria

Firmicutes

Fusobacteria

Lentisphaerae

Proteobacteria

Synergistetes

Verrucomicrobia

(b) DPCoA community plot

Axis 1: 40.9%

Axi

s 2:

13.

3%−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

−1.0 −0.5 0.0 0.5

subject

D

E

F

(a)PCoA/MDS oftheOTUsbasedonthepatristicdistance, (b)communityand(c)speciespointsforDPCoA afterremovingtwooutlyingspecies.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 39: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

AntibioticStress

Wenextwanttovisualizetheeffectoftheantibiotic.OrdinationsofthecommunitiesduetoDPCoA andUniFracwithinformationaboutthewhetherthecommunitywasstressedornotstressed(precipro, interim, andpostciprowereconsidered``notstressed'', whilefirstcipro, firstweekpostcipro, secondcipro, andsecondweekpostciprowereconsidered``stressed'').WeseethatforUniFrac, thefirstaxisseemstoseparatethestressedcommunitiesfromthenotstressedcommunities.DPCoA alsoseemstoseparatetheoutthestressedcommunitiesalongthefirstaxis(inthedirectionassociatedwithBacteroidetes), althoughonlyforsubjectsD andE.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 40: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Axis1

Axi

s2

−0.2

−0.1

0.0

0.1

0.2

D 1

D 2

E 1 E 2

F 1F 2

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4

Antibiotic stress

1: not stressed

2: stressed

Subject

D

E

F

PCoA/MDS withunweightedUniFrac. Thelabelsrepresentsubjectplusantibioticcondition.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 41: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Axis1

Axi

s2

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

D 1D 2

E 1

E 2

F 1F 2

−1.0 −0.5 0.0 0.5

CommunitypointsasrepresentedbyDPCoA.Thelabelsrepresentsubjectplusantibioticcondition.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 42: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

ConclusionsforAntibioticStress

SinceUniFracemphasizesshallowdifferencesonthetreeandsincePCoA/MDS withUniFracseemstoseparatethesubjectsfromeachotherbetterthantheothertwomethods, wecanconcludethatthedifferencesbetweensubjectsaremainlyshallowones. However, DPCoA alsoseparatesthesubjectsandthestressedversusnon-stressedcommunities, andexaminingthecommunityandOTU ordinationscantellusaboutthedifferencesinthecompositionsofthesecommunities.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 43: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Part V

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 44: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Non-reproducibility: EnterotypesStudyNature WebLink

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 45: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

MultiplicityofChoices/Tests/Tuning

Data Filter Normalize Smooth Project WeightWeight

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 46: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

SuboptimalWorkflow

Anoptimizationwhereyouonlyallowedyourselftwostepsoneparalleltoeachcoordinate....

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 47: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Iteration: DocumentationandRecordkeeping

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 48: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

DocumentedChoices

Dataminingdiary: arecordofeverysampledropped, everynormalization, everychoiceofweight.ExamplefortheEnterotypedata Cloud Enterotype

example

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 49: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

RationalScientificEconomics

Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?

Datadredging. Researchisrecreatedfromscratcheverytime(new

clusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.

Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 50: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

RationalScientificEconomics

Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?Datadredging.

Researchisrecreatedfromscratcheverytime(newclusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?

Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.

Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 51: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

RationalScientificEconomics

Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?Datadredging.

Researchisrecreatedfromscratcheverytime(newclusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.

Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 52: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Complexitiesofproofbysimulation

Itisatheoremthat...Proofverifiedbypeerreview.Oursimulationshows....

Howwerechoicesmade, robustness. Doesthemodelfitthedata? Goodnessoffitishard

whendoing200,000datapoint. Howwastheprogramchecked(theoryavailable, test

case)? Crowdchecking....

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 53: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

ReproducibleSimulationsandSubsampling

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 54: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

SupplementaryMaterial

ClusterSimulationCodeDispersionsurvey

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 55: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

KeepalltheData

......includingtheanalyses

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 56: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Example

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 57: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Goalsalreadyattained:

Modelingmixtures: VarianceStabilizingtransformations(nodatadiscarded).

Waysofcombiningheterogeneousdata: distancesandmultivariaterepresentation(nodatadiscarded).

Throughscripting: multiplemethodsshowninparallelanditeratively(noanalyseshidden).

Dataintegration phyloseq (trees, tables, networks)throughanobjectorientedapproach.

Threshold, sensitivitytestsandmodelingsimulations(tuningmadevisible).

Reproducibility: opensourcestandards, publicationofsourcecodeanddatausingR/Bioconductor. (phyloseq).

Shiny-phyloseq.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 58: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

BenefittingfromthetoolsandschoolsofStatisticians.......

Thankstothe R community:HadleyWickhamfor ggplot2, Chessel, Jombart, Dray,Thioulouse ade4 , GordonSmythandhisteamfor edgeR,WolfgangHuber, MichaelLovefor DESeq2 andEmmanuelParadisfor ape.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 59: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Collaborators:

DavidRelman AlfredSpormann ElisabethPurdomJustinSonnenburgandPersiDiaconis.

PostdoctoralFellows Paul(Joey)McMurdie, AlexAlekseyenko(NYU),BenCallahan, AngelaMarcobal.Students: MilingShen, AldenTimme, KatieShelef, YanaHoy,JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom CIMI-Toulouse, NIH-TR01, NIH/NIGMS R01,NSF-VIGRE andNSF-DMS.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 60: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

phyloseq

JoeyMcMurdie(joey711ongithub).AvailableinBioconductor.HowcanI (mystudents, mypostdocs...) learnmore?Askme.http://www-stat.stanford.edu/˜susan/

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 61: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Multi-tablemethods

Inertia, Co-InertiaWegeneralizeitinseveraldirectionsthroughtheideaofinertia.Asinphysics, wedefineinertiaasaweightedsumofdistancesofweightedpoints.Thisenablesustouseabundancedatainacontingencytableandcomputeitsinertiawhichinthiscasewillbetheweightedsumofthesquaresofdistancesbetweenobservedandexpectedfrequencies, suchasisusedincomputingthechisquarestatistic.Anothergeneralizationofvariance-inertiaistheusefulPhylogeneticdiversityindex. (computingthesumofdistancesbetweenasubsetoftaxathroughthetree).Wealsohavesuchgeneralizationsthatcovervariabilityofpointsonagraphtakenfromstandardspatialstatistics.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 62: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Co-Inertia

Whenstudyingtwovariablesmeasuredatthesamelocations,forinstancePH andhumiditythestandardquantificationofcovariationisthecovariance.

sum(x1 ∗ y1 + x2 ∗ y2 + x3 ∗ y3)

ifxandyco-vary-inthesamedirectionthiswillbebig.A simplegeneralizationtothiswhenthevariabilityismorecomplicatedtomeasureasaboveisdonethroughCo-Inertiaanalysis(CIA).Co-inertiaanalysis(CIA) isamultivariatemethodthatidentifiestrendsorco-relationshipsinmultipledatasetswhichcontainthesamesamplesorthesametimepoints. Thatistherowsorcolumnsofthematrixhavetobeweightedsimilarlyandthusmustbematchable.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 63: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

RV coefficient

TheglobalmeasureofsimilarityoftwodatatablesasopposedtotwovectorscanbedonebyageneralizationofcovarianceprovidedbyaninnerproductbetweentablesthatgivestheRV coefficient, anumberbetween0and1, likeacorrelationcoefficient, butfortables.

RV(A,B) = Tr(A′B)√Tr(A′A)

√Tr(B′B)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 64: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

ExampleCurrentworkwithJoeyMcMurdie, LesDethfelsen, DavidRelman, AngelaMarcobal, JustinSonnenburg. Data:

Taxa Readcounts(3patientstakingcipro: twotimecourses): .

Mass-Spec PositiveandNegativeionMassSpecfeaturesandtheirintensities: .

RNA-seq Metagenomicdataongenes:.

HereistheRV tableofthethreearraytypes:

> fourtable$RV

Taxa Kegg MassSpec+ MassSpec-

Taxa 1 0.565 0.561 0.670

Kegg 0.565 1 0.686 0.644

MassSpec+ 0.561 0.686 1 0.568

MassSpec- 0.670 0.644 0.568 1

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Page 65: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity

Multipletablemethods

InPCA wecomputethevariance-covariancematrix, inmultipletablemethodswecantakeacubeoftablesandcomputetheRV coefficientoftheircharacterizingoperators.Wethendiagonalizethisandfindthebestweighted`ensemble'.Thisiscalledthe`compromise'andalltheindividualtablescanbeprojectedontoit.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .