all the data: challenges - stanford...

All the Data: challenges

SusanHolmeshttp://www-stat.stanford.edu/˜susan/

Bio-X andStatistics, StanfordUniversity

STAMPS,2015thanks, NIH-TR01andNIH-R01GM086884

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

What'shappeninginBiologicalDataAnalysis?

1985 1998 2012

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

What'shappeninginBiologicalDataAnalysis?1985 1998 2012

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

messy

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

(Big)Challengesforthoseofusworkingfromthegroundup

Heterogeneity. Heteroscedasticity. InformationLeaks. MultiplicityofChoices. Reproducibility.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Part I

`Homogeneous data are all alike;

all heterogeneous data are heterogeneous

in their own way.'

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

HeterogeneityofData Status: response/explanatory. Hidden(latent)/measured. Types:

Continuous Binary, categorical Graphs/Trees Images Maps/SpatialInformation Rankings

Amountsofdependency: independent/timeseries/spatial. Differenttechnologiesused(454, phyloseq, Illumina,

MassSpec, RNA-seq).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

HeterogeneousDataObjectsInputanddatamanipulationwith phyloseq

(McMurdieandHolmes, 2013, PlosONE)AsalwaysinR:objectorienteddata.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

apepackage

OTU Abundanceotu_table

Sample Variablessample_data

Taxonomy TabletaxonomyTable

Phylogenetic Treephylo

otu_table sample_data tax_table phy_tree

otu_table sample_data tax_table

read.treeread.nexusread_tree

as as as

import

phyloseqconstructor:

Biostringspackage

Reference Seq.XStringSet

DNAStringSet RNAStringSet

AAStringSet

phyloseq

Experiment Data

otu_table,sam_data,tax_table,phy_treerefseq

Accessors:get_taxaget_samplesget_variablensamplesntaxarank_namessample_namessample_sumssample_variablestaxa_namestaxa_sums

Processors:filter_taxamerge_phyloseqmerge_samplesmerge_taxaprune_samplesprune_taxasubset_taxasubset_samplestip_glomtax_glom

matrix matrixdata.frame

optional

refseq

data

data structure & APIphyloseq

http://joey711.github.io/phyloseq/

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

phyloseq

Preprocessing

Import

Direct Plots

plot_network plot_heatmap plot_ordination

distance ordinate

Summary / ExploratoryGraphics

filter_taxafilterfun_samplegenefilter_sampleprune_taxaprune_samplessubset_taxasubset_samplestransform_sample_counts

import_biomimport_mothurimport_pyrotaggerimport_qiimeimport_RDP

plot_tree

plot_richness

plot_bar

bootstrappermutation testsregressiondiscriminant analysismultiple testinggap statisticclusteringprocrustes

Inference, Testing

sample data

OTU cluster output

Input

raw

phyloseqprocessed

work flowphyloseq

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

graphics

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

−0.4 −0.2 0.0 0.2 0.4NMDS1

NMDS

2

SampleType

FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue

plot_ordination, NMDS, wUF

FreshwaterFreshwater (creek)FreshwaterFreshwater (creek)Freshwater (creek)SoilSoilSoilSkinSkinSkinM

ockM

ockM

ockFecesFecesFecesFecesSedim

ent (estuary)TongueTongueO

ceanO

ceanO

ceanSedim

ent (estuary)Sedim

ent (estuary)

SampleType

OTU

1

100

10000

Abundance

plot_heatmap; bray−curtis, NMDS

SeqTech

IlluminaPyro454Sanger

Enterotype 1

23

plot_network; Enterotype data, bray−curtis, max.dist=0.25

CytophagaEmticicia

Sphingobacterium

Segetibacter

Haliscomenobacter

Pedobacter

Bacteroides

Alistipes

Bacteroides

Cytophaga

Porphyromonas

Prevotella

Parabacteroides

Algoriphagus

Odoribacter

CandidatusAquirestis

Capnocytophaga

Porphyromonas

Spirosoma

Prevotella

Balneola

Prevotella

Hymenobacter

Prevotella

76

73

75

75

79

67

81

84

84

82

75

Abundance

12562515625

SampleType


Order Bacteroidales

FlavobacterialesSphingobacteriales

plot_tree; Bacteroidetes−only. Merged samples, tip_glom=0.1

0e+00

2e+05

4e+05

6e+05

Feces

Freshwater

Freshwater (creek)

Mock

Ocean

Sediment (estuary)

Skin

Soil

Tongue

SampleType

Abun

danc

e

FamilyBacteroidaceaeBalneolaceaeCryomorphaceaeCyclobacteriaceaeFlavobacteriaceaeFlexibacteraceaePorphyromonadaceaePrevotellaceaeRikenellaceaeSaprospiraceaeSphingobacteriaceae

plot_bar; Bacteroidetes−only

S.obs S.chao1 S.ACE

2000

4000

6000

8000

FALSE TRUE FALSE TRUE FALSE TRUEHuman Associated Samples

Num

ber o

f OTU

sSampleType


plot_ordination()

plot_network()

plot_bar()

plot_heatmap()

plot_tree()

plot_richness()

phyloseq

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

graphics

plot_ordination()

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1CA1

CA2

SampleTypeFecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue

Samples Only; type="samples"

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1CA1

CA2

Class

FlavobacteriaSphingobacteriaBacteroidiasamples

SampleType

FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTonguetaxa

type

samplestaxa

Biplot; type="biplot"

Bacteroidia Flavobacteria Sphingobacteria

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1 −1 0 1 −1 0 1CA1

CA2

Class

BacteroidiaFlavobacteriaSphingobacteria

Taxa Only; type="taxa"

samples taxa

−1.0

−0.5

0.0

0.5

1.0

1.5

−1 0 1 −1 0 1CA1

CA2

ClassFlavobacteriaSphingobacteriaBacteroidiasamples

SampleTypeFecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTonguetaxa

Split Plot; type="split"

type="scree"

split

taxa-only

biplot

samples-only

phyloseq

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

markdown !(code + console) +

figures

phyloseq + !ggplot2 + !etc.

# Main title!This is an [R Markdown](my.link.com) document of my recent analysis.!## Subsection: some codeHere is some import code, etc.```rlibrary("phyloseq")library("ggplot2")physeq = import_biom(“datafile.biom”)plot_richness(physeq)```

source.Rmd

Complete HTML5

knitr::knit2html()

microbiome data

Our Goal with Collaborators:!Reproducible analysis workflow with R-markdown

Better Reproducibility

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Part II

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ExampleofStudy

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Summaryofthestudy Choosethedatatransformation(hereproportions

replacedtheoriginalcounts).... log, rlog, subsample, prop, orig.

Takeasubsetofthedata, somesamplesdeclaredasoutliers. ... leaveout0, 1, 2,..,9, +criteria(10)......

Filteroutcertaintaxa(unknownlabels, rare, etc...)... removeraretaxa(thresholdat0.01%, 1%, 2%,...)

Chooseadistance.... 40choicesinvegan/phyloseq.

Chooseanordinationmethodandnumberofcoordinates.... MDS,NMDS,k=2,3,4,5..

Chooseaclusteringmethod, chooseanumberofclusters.... PAM,KNN,densitybased, hclust...

Chooseanunderlyingcontinuousvariable(gradientorgroupofvariables: manifold).

Chooseagraphicalrepresentation.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Therearethusmorethan200millionpossiblewaysofanalyzingthisdata:

5× 100× 10× 40× 8× 16× 2× 4 = 204800000

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Part III

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Some real data (Caporoso et al, 2011)

> GlobalPatterns

phyloseq-class experiment-level object

otu_table() OTU Table: [ 19216 taxa and 26 samples ]

sample_data()Sample Data: [ 26 samples by 7 sample variables ]

tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]

phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]

> sample_sums(GlobalPatterns)

CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr

864077 1135457 697509 1543451 2076476 718943 433894 186297

.....

NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1

1478965 1652754 58688 493126 279704 937466 1211071 1216137

> summary(sample_sums(GlobalPatterns))

Min. 1st Qu. Median Mean 3rd Qu. Max.

58690 567100 1107000 1085000 1527000 2357000

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Howtodealwithdifferentnumbersofreads?

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

CurrentMethod: RarefyingAdhoc librarysizenormalizationbyrandomsubsamplingwithoutreplacement.1. Selectaminimumlibrarysize, NL,min. Thishasalsobeen

calledthe rarefactionlevel thoughwewillnotusethetermhere.

2. Discardlibraries(microbiomesamples)thathavefewerreadsthan NL,min.

3. Subsampletheremaininglibrarieswithoutreplacementsuchthattheyallhavesize NL,min.

Often NL,min ischosentobeequaltothesizeofthesmallestlibrarythatisnotconsidered defective, andtheprocessofidentifyingdefectivesamplescomeswithariskofsubjectivityandbias. Inmanycasesresearchershavealsofailedtorepeattherandomsubsamplingstep(3)orrecordthepseudorandomnumbergenerationseed/process---bothofwhichareessentialforreproducibility.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ReductionofDatatoProportions

Manysoftwareprogramsautomaticallyreducethedatatorelativeproportions, losingtheinformationaboutlibrarysizesorreadcounts.Thismakescomparisonsverydifficult.StatisticalFormulation: Whenmakinga(testing)decision,reducingresultsfromaBinomialdistributionintoaproportiondoesnotgivean admissible procedure.Definition:Anadmissibleruleisanoptimalruleformakingadecisioninthesensethatthereisnootherrulethatisalwaysbetterthanit.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Aimofthestudies: DifferentialAbundance

Likedifferentiallyexpressedgenes, aspecies/OTU isconsidereddifferentiallyabundantifitsmeanproportionissignificantlydifferentbetweentwoormoresampleclassesintheexperimentaldesign.OptimalityCriteria:SensititivityorPower TruePositiveRate.Specificity TrueNegativeRate.

Wehavetocontrolformanysourcesoferror(blocking,modeling, etc..)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

NegativeBinomialorGamma-Poisson

McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'',NegativeBinomialasahierarchicalmixtureforreadcountsIftechnicalreplicateshavesamenumberofreads: sj,Poissonvariationwithmean µ = sjui.Taxa i incidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe

Kij ∼ Poisson (sjui)

SameasDESeqforRNA-seq

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ImprovementinPowerandFDR

Performanceofdifferentialabundancedetectionwithandwithoutrarefyingsummarizedby“AreaUndertheCurve”(AUC) metricofaReceiverOperatorCurve(ROC) (verticalaxis).Briefly, theAUC valuevariesfrom0.5(random)to1.0(perfect).Thehorizontalaxisindicatestheeffectsize, shownasthefactorappliedtoOTU countstosimulateadifferentialabundance.Eachcurvetracestherespectivenormalizationmethod’smeanperformanceofthatpanel, withaverticalbarindicatingastandarddeviationinperformanceacrossallreplicatesandmicrobiometemplates.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Theright-handsideofthepanelrowsindicatesthemedianlibrarysize, N,whilethedarknessoflineshadingindicatesthenumberofsamplespersimulatedexperiment.Colorshadeandshapeindicatethenormalizationmethod.DetectionamongmultipletestswasdefinedusingaFalseDiscoveryRate(Benjamini-Hochberg)significancethresholdof0.05.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Part IV

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

UnifracDistance(LozuponeandKnight, 2005)

isadistancebetweengroupsoforganismsthatarerelatedtoeachotherbyatree.SupposewehavetheOTUspresentinsample1(blue)andinsample2(red).Question: Dothetwosamplesdifferphylogenetically?ItisdefinedastheratioofthesumofthelengthsofthebranchesleadingtomembersofgroupA ormembersofgroupB butnotbothtothetotalbranchlengthofthetree.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

WeightedUnifracdistance A modificationofUniFrac,weightedUniFracisdefinedin(Lozuponeetal., 2007)as

n∑i=1

bi × | AiAT

− BiBT

|

n = numberofbranchesinthetree bi =lengthoftheithbranch Ai =numberofdescendantsof

ithbranchingroupA AT =totalnumberofsequences

ingroupA

[?]orcanbeseenasbyprobabilistsastheWassersteindistance(earthmovers)[?]. . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Costelloetal. 2010

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Rao'sDistance

Westartwithadistancebetweenindividuals.Theheterogeneityofapopulation(Hi )istheaveragedistancebetweenmembersofthatpopulation.Theheterogeneitybetweentwopopulations(Hij)istheaveragedistancebetweenamemberofpopulation i andamemberofpopulation j.Thedistancebetweentwopopulationsis

Dij = Hij −1

2(Hi + Hj)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

DecompositionofDiversity

Ifwehavepopulations 1, . . . , k withfrequencies π1, . . . , πk,thenthediversityofallthepopulationstogetheris

H0 =

k∑i=1

πiHi +∑i

∑j

πiπjDij = H(w) + D(b)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

DoublePrincipalCoordinateAnalysisPurdom, 2005hadtheideaofusingamethoddevelopedbyPavoineetal. calledDPCoA [?]implementedin ade4 [?].Supposewehavenspeciesinplocationsanda(euclidean)matrix ∆ givingthesquaresofthepairwisedistancesbetweenthespecies. Thenwecan

Usethedistancesbetweenspeciestofindanembeddingin n− 1 -dimensionalspacesuchthattheeuclideandistancesbetweenthespeciesisthesameasthedistancesbetweenthespeciesdefinedin ∆.

Placeeachoftheplocationsatthebarycenterofitsspeciesprofile. TheeuclideandistancesbetweenthelocationswillbethesameasthesquarerootoftheRaodissimilaritybetweenthem.

UsePCA tofindalower-dimensionalrepresentationofthelocations.

Givethespeciesandcommunitiescoordinatessuchthattheinertiadecomposesthesamewaythediversitydoes.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

FukuyamaandHolmes, 2012.Method Originaldescription Newformula PropertiesDPCoA square root of Rao's distance

basedonthesquarerootofthepatristicdistances

[∑

i bi(Ai/AT − Bi/BT)2]1/2 Mostsensitivetooutliers, leastsensitive to noise, upweightsdeep differences, gives OTUlocations

wUniFrac∑

i bi |Ai/AT − Bi/BT|∑

i bi |Ai/AT − Bi/BT| Less sensitive to outliers/moresensitivetonoisethanDPCoA

UniFrac fractionofbranchesleadingtoexactlyonegroup

∑i bi1

Ai/AT−Bi/BTAi/AT+Bi/BT

≥ 1 Sensitive to noise, upweightsshallowdifferencesonthetree

Summaryofthemethodsunderconsideration. ``Outliers"referstohighlyabundantOTUs, andnoisereferstonoiseindetectinglow-abundanceOTUs(seethetextformoredetail).

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

AntibioticTimeCourseData

Measurementsofabout2500differentbacterialOTUsfromstoolsamplesofthreepatients(D,E,F)Eachpatientsampled ∼ 50timesduringthecourseoftreatmentwithciprofloxacin(anantibiotic).TimescategorizedasPreCp, 1stCp, 1stWPC (weekpostcipro), Interim, 2ndCp, 2ndWPC,andPostCp.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

UniFrac

Axis 1: 14.7%

Axi

s 2:

10.

3%

−0.2

−0.1

0.0

0.1

0.2

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4

weighted UniFrac

Axis 1: 47.6%A

xis

2: 1

2.3%

−0.2

−0.1

0.0

0.1

0.2

−0.4−0.3−0.2−0.1 0.0 0.1 0.2 0.3

weighted UF on presence/absence

Axis 1: 32.7%

Axi

s 2:

15.

1%

−0.05

0.00

0.05

0.10

−0.10−0.050.000.050.100.150.20

subject

D

E

F

ComparingtheUniFracvariants. Fromlefttoright:PCoA/MDS withunweightedUniFrac, withweightedUniFrac,andwithweightedUniFracperformedonpresence/absencedataextractedfromtheabundancedatausedintheothertwoplots

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

(a) MDS of OTUs

Axis 1: 6.2%

Axi

s 2:

3.7

%

−1.5

−1.0

−0.5

0.0

0.5

−1.0 −0.5 0.0 0.5

(c) DPCoA OTU plot

CS1

CS

2

−1.0

−0.5

0.0

0.5

1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0

phylum

4C0d−2

Actinobacteria

Bacteroidetes

Candidate division TM7

Cyanobacteria

Firmicutes

Fusobacteria

Lentisphaerae

Proteobacteria

Synergistetes

Verrucomicrobia

(b) DPCoA community plot

Axis 1: 40.9%

Axi

s 2:

13.

3%−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

−1.0 −0.5 0.0 0.5

subject

D

E

F

(a)PCoA/MDS oftheOTUsbasedonthepatristicdistance, (b)communityand(c)speciespointsforDPCoA afterremovingtwooutlyingspecies.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

AntibioticStress

Wenextwanttovisualizetheeffectoftheantibiotic.OrdinationsofthecommunitiesduetoDPCoA andUniFracwithinformationaboutthewhetherthecommunitywasstressedornotstressed(precipro, interim, andpostciprowereconsidered``notstressed'', whilefirstcipro, firstweekpostcipro, secondcipro, andsecondweekpostciprowereconsidered``stressed'').WeseethatforUniFrac, thefirstaxisseemstoseparatethestressedcommunitiesfromthenotstressedcommunities.DPCoA alsoseemstoseparatetheoutthestressedcommunitiesalongthefirstaxis(inthedirectionassociatedwithBacteroidetes), althoughonlyforsubjectsD andE.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Axis1

Axi

s2

−0.2

−0.1

0.0

0.1

0.2

D 1

D 2

E 1 E 2

F 1F 2

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4

Antibiotic stress

1: not stressed

2: stressed

Subject

D

E

F

PCoA/MDS withunweightedUniFrac. Thelabelsrepresentsubjectplusantibioticcondition.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Axis1

Axi

s2

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

D 1D 2

E 1

E 2

F 1F 2

−1.0 −0.5 0.0 0.5

CommunitypointsasrepresentedbyDPCoA.Thelabelsrepresentsubjectplusantibioticcondition.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ConclusionsforAntibioticStress

SinceUniFracemphasizesshallowdifferencesonthetreeandsincePCoA/MDS withUniFracseemstoseparatethesubjectsfromeachotherbetterthantheothertwomethods, wecanconcludethatthedifferencesbetweensubjectsaremainlyshallowones. However, DPCoA alsoseparatesthesubjectsandthestressedversusnon-stressedcommunities, andexaminingthecommunityandOTU ordinationscantellusaboutthedifferencesinthecompositionsofthesecommunities.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Part V

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Non-reproducibility: EnterotypesStudyNature WebLink

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

http://www.nature.com/nature/journal/v506/n7489/full/nature13075.html

MultiplicityofChoices/Tests/Tuning

Data Filter Normalize Smooth Project WeightWeight

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

SuboptimalWorkflow

Anoptimizationwhereyouonlyallowedyourselftwostepsoneparalleltoeachcoordinate....

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Iteration: DocumentationandRecordkeeping

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

DocumentedChoices

Dataminingdiary: arecordofeverysampledropped, everynormalization, everychoiceofweight.ExamplefortheEnterotypedata Cloud Enterotype

example

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

http://rstudio-pubs-static.s3.amazonaws.com/2574_d1cd758ba04748fba1112a3ebb2b8381.html

http://rstudio-pubs-static.s3.amazonaws.com/2574_d1cd758ba04748fba1112a3ebb2b8381.html

RationalScientificEconomics

Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?

Datadredging. Researchisrecreatedfromscratcheverytime(new

clusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.

Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .


Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?Datadredging.

Researchisrecreatedfromscratcheverytime(newclusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?

Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.


. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .


Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?Datadredging.

Researchisrecreatedfromscratcheverytime(newclusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.


. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Complexitiesofproofbysimulation

Itisatheoremthat...Proofverifiedbypeerreview.Oursimulationshows....

Howwerechoicesmade, robustness. Doesthemodelfitthedata? Goodnessoffitishard

whendoing200,000datapoint. Howwastheprogramchecked(theoryavailable, test

case)? Crowdchecking....

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ReproducibleSimulationsandSubsampling

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

SupplementaryMaterial

ClusterSimulationCodeDispersionsurvey

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

http://joey711.github.io/waste-not-supplemental/simulation-cluster-accuracy/simulation-cluster-accuracy-server.html

http://joey711.github.io/waste-not-supplemental/dispersion-survey/dispersion.html

KeepalltheData

......includingtheanalyses

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Example

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

http://127.0.0.1:3268

Goalsalreadyattained:

Modelingmixtures: VarianceStabilizingtransformations(nodatadiscarded).

Waysofcombiningheterogeneousdata: distancesandmultivariaterepresentation(nodatadiscarded).

Throughscripting: multiplemethodsshowninparallelanditeratively(noanalyseshidden).

Dataintegration phyloseq (trees, tables, networks)throughanobjectorientedapproach.

Threshold, sensitivitytestsandmodelingsimulations(tuningmadevisible).

Reproducibility: opensourcestandards, publicationofsourcecodeanddatausingR/Bioconductor. (phyloseq).

Shiny-phyloseq.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

BenefittingfromthetoolsandschoolsofStatisticians.......

Thankstothe R community:HadleyWickhamfor ggplot2, Chessel, Jombart, Dray,Thioulouse ade4 , GordonSmythandhisteamfor edgeR,WolfgangHuber, MichaelLovefor DESeq2 andEmmanuelParadisfor ape.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Collaborators:

DavidRelman AlfredSpormann ElisabethPurdomJustinSonnenburgandPersiDiaconis.

PostdoctoralFellows Paul(Joey)McMurdie, AlexAlekseyenko(NYU),BenCallahan, AngelaMarcobal.Students: MilingShen, AldenTimme, KatieShelef, YanaHoy,JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom CIMI-Toulouse, NIH-TR01, NIH/NIGMS R01,NSF-VIGRE andNSF-DMS.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

phyloseq

JoeyMcMurdie(joey711ongithub).AvailableinBioconductor.HowcanI (mystudents, mypostdocs...) learnmore?Askme.http://www-stat.stanford.edu/˜susan/

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

http://www-stat.stanford.edu/~susan/

Multi-tablemethods

Inertia, Co-InertiaWegeneralizeitinseveraldirectionsthroughtheideaofinertia.Asinphysics, wedefineinertiaasaweightedsumofdistancesofweightedpoints.Thisenablesustouseabundancedatainacontingencytableandcomputeitsinertiawhichinthiscasewillbetheweightedsumofthesquaresofdistancesbetweenobservedandexpectedfrequencies, suchasisusedincomputingthechisquarestatistic.Anothergeneralizationofvariance-inertiaistheusefulPhylogeneticdiversityindex. (computingthesumofdistancesbetweenasubsetoftaxathroughthetree).Wealsohavesuchgeneralizationsthatcovervariabilityofpointsonagraphtakenfromstandardspatialstatistics.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Co-Inertia

Whenstudyingtwovariablesmeasuredatthesamelocations,forinstancePH andhumiditythestandardquantificationofcovariationisthecovariance.

sum(x1 ∗ y1 + x2 ∗ y2 + x3 ∗ y3)

ifxandyco-vary-inthesamedirectionthiswillbebig.A simplegeneralizationtothiswhenthevariabilityismorecomplicatedtomeasureasaboveisdonethroughCo-Inertiaanalysis(CIA).Co-inertiaanalysis(CIA) isamultivariatemethodthatidentifiestrendsorco-relationshipsinmultipledatasetswhichcontainthesamesamplesorthesametimepoints. Thatistherowsorcolumnsofthematrixhavetobeweightedsimilarlyandthusmustbematchable.

. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

RV coefficient

TheglobalmeasureofsimilarityoftwodatatablesasopposedtotwovectorscanbedonebyageneralizationofcovarianceprovidedbyaninnerproductbetweentablesthatgivestheRV coefficient, anumberbetween0and1, likeacorrelationcoefficient, butfortables.

RV(A,B) = Tr(A′B)√Tr(A′A)

√Tr(B′B)

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

ExampleCurrentworkwithJoeyMcMurdie, LesDethfelsen, DavidRelman, AngelaMarcobal, JustinSonnenburg. Data:

Taxa Readcounts(3patientstakingcipro: twotimecourses): .

Mass-Spec PositiveandNegativeionMassSpecfeaturesandtheirintensities: .

RNA-seq Metagenomicdataongenes:.

HereistheRV tableofthethreearraytypes:

> fourtable$RV

Taxa Kegg MassSpec+ MassSpec-

Taxa 1 0.565 0.561 0.670

Kegg 0.565 1 0.686 0.644

MassSpec+ 0.561 0.686 1 0.568

MassSpec- 0.670 0.644 0.568 1

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

Multipletablemethods

InPCA wecomputethevariance-covariancematrix, inmultipletablemethodswecantakeacubeoftablesandcomputetheRV coefficientoftheircharacterizingoperators.Wethendiagonalizethisandfindthebestweighted`ensemble'.Thisiscalledthe`compromise'andalltheindividualtablescanbeprojectedontoit.

. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .

all the data: challenges - stanford...

Documents