all the data: challenges - stanford...
TRANSCRIPT
All the Data: challenges
SusanHolmeshttp://www-stat.stanford.edu/˜susan/
Bio-X andStatistics, StanfordUniversity
STAMPS,2015thanks, NIH-TR01andNIH-R01GM086884
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
What'shappeninginBiologicalDataAnalysis?
1985 1998 2012
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
What'shappeninginBiologicalDataAnalysis?1985 1998 2012
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
messy
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
(Big)Challengesforthoseofusworkingfromthegroundup
Heterogeneity. Heteroscedasticity. InformationLeaks. MultiplicityofChoices. Reproducibility.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part I
`Homogeneous data are all alike;
all heterogeneous data are heterogeneous
in their own way.'
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
HeterogeneityofData Status: response/explanatory. Hidden(latent)/measured. Types:
Continuous Binary, categorical Graphs/Trees Images Maps/SpatialInformation Rankings
Amountsofdependency: independent/timeseries/spatial. Differenttechnologiesused(454, phyloseq, Illumina,
MassSpec, RNA-seq).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
HeterogeneousDataObjectsInputanddatamanipulationwith phyloseq
(McMurdieandHolmes, 2013, PlosONE)AsalwaysinR:objectorienteddata.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
apepackage
OTU Abundanceotu_table
Sample Variablessample_data
Taxonomy TabletaxonomyTable
Phylogenetic Treephylo
otu_table sample_data tax_table phy_tree
otu_table sample_data tax_table
read.treeread.nexusread_tree
as as as
import
phyloseqconstructor:
Biostringspackage
Reference Seq.XStringSet
DNAStringSet RNAStringSet
AAStringSet
phyloseq
Experiment Data
otu_table,sam_data,tax_table,phy_treerefseq
Accessors:get_taxaget_samplesget_variablensamplesntaxarank_namessample_namessample_sumssample_variablestaxa_namestaxa_sums
Processors:filter_taxamerge_phyloseqmerge_samplesmerge_taxaprune_samplesprune_taxasubset_taxasubset_samplestip_glomtax_glom
matrix matrixdata.frame
optional
refseq
data
data structure & APIphyloseq
http://joey711.github.io/phyloseq/
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
phyloseq
Preprocessing
Import
Direct Plots
plot_network plot_heatmap plot_ordination
distance ordinate
Summary / ExploratoryGraphics
filter_taxafilterfun_samplegenefilter_sampleprune_taxaprune_samplessubset_taxasubset_samplestransform_sample_counts
import_biomimport_mothurimport_pyrotaggerimport_qiimeimport_RDP
plot_tree
plot_richness
plot_bar
bootstrappermutation testsregressiondiscriminant analysismultiple testinggap statisticclusteringprocrustes
Inference, Testing
sample data
OTU cluster output
Input
raw
phyloseqprocessed
work flowphyloseq
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
graphics
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
−0.4 −0.2 0.0 0.2 0.4NMDS1
NMDS
2
SampleType
FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue
plot_ordination, NMDS, wUF
FreshwaterFreshwater (creek)FreshwaterFreshwater (creek)Freshwater (creek)SoilSoilSoilSkinSkinSkinM
ockM
ockM
ockFecesFecesFecesFecesSedim
ent (estuary)TongueTongueO
ceanO
ceanO
ceanSedim
ent (estuary)Sedim
ent (estuary)
SampleType
OTU
1
100
10000
Abundance
plot_heatmap; bray−curtis, NMDS
SeqTech
IlluminaPyro454Sanger
Enterotype 1
23
plot_network; Enterotype data, bray−curtis, max.dist=0.25
CytophagaEmticicia
Sphingobacterium
Segetibacter
Haliscomenobacter
Pedobacter
Bacteroides
Alistipes
Bacteroides
Cytophaga
Porphyromonas
Prevotella
Parabacteroides
Algoriphagus
Odoribacter
CandidatusAquirestis
Capnocytophaga
Porphyromonas
Spirosoma
Prevotella
Balneola
Prevotella
Hymenobacter
Prevotella
76
73
75
75
79
67
81
84
84
82
75
Abundance
12562515625
SampleType
FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue
Order Bacteroidales
FlavobacterialesSphingobacteriales
plot_tree; Bacteroidetes−only. Merged samples, tip_glom=0.1
0e+00
2e+05
4e+05
6e+05
Feces
Freshwater
Freshwater (creek)
Mock
Ocean
Sediment (estuary)
Skin
Soil
Tongue
SampleType
Abun
danc
e
FamilyBacteroidaceaeBalneolaceaeCryomorphaceaeCyclobacteriaceaeFlavobacteriaceaeFlexibacteraceaePorphyromonadaceaePrevotellaceaeRikenellaceaeSaprospiraceaeSphingobacteriaceae
plot_bar; Bacteroidetes−only
S.obs S.chao1 S.ACE
2000
4000
6000
8000
FALSE TRUE FALSE TRUE FALSE TRUEHuman Associated Samples
Num
ber o
f OTU
sSampleType
FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue
plot_ordination()
plot_network()
plot_bar()
plot_heatmap()
plot_tree()
plot_richness()
phyloseq
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
graphics
plot_ordination()
−1.0
−0.5
0.0
0.5
1.0
1.5
−1 0 1CA1
CA2
SampleTypeFecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue
Samples Only; type="samples"
−1.0
−0.5
0.0
0.5
1.0
1.5
−1 0 1CA1
CA2
Class
FlavobacteriaSphingobacteriaBacteroidiasamples
SampleType
FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTonguetaxa
type
samplestaxa
Biplot; type="biplot"
Bacteroidia Flavobacteria Sphingobacteria
−1.0
−0.5
0.0
0.5
1.0
1.5
−1 0 1 −1 0 1 −1 0 1CA1
CA2
Class
BacteroidiaFlavobacteriaSphingobacteria
Taxa Only; type="taxa"
samples taxa
−1.0
−0.5
0.0
0.5
1.0
1.5
−1 0 1 −1 0 1CA1
CA2
ClassFlavobacteriaSphingobacteriaBacteroidiasamples
SampleTypeFecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTonguetaxa
Split Plot; type="split"
type="scree"
split
taxa-only
biplot
samples-only
phyloseq
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
markdown !(code + console) +
figures
phyloseq + !ggplot2 + !etc.
# Main title!This is an [R Markdown](my.link.com) document of my recent analysis.!## Subsection: some codeHere is some import code, etc.```rlibrary("phyloseq")library("ggplot2")physeq = import_biom(“datafile.biom”)plot_richness(physeq)```
source.Rmd
Complete HTML5
knitr::knit2html()
microbiome data
Our Goal with Collaborators:!Reproducible analysis workflow with R-markdown
Better Reproducibility
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part II
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ExampleofStudy
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Summaryofthestudy Choosethedatatransformation(hereproportions
replacedtheoriginalcounts).... log, rlog, subsample, prop, orig.
Takeasubsetofthedata, somesamplesdeclaredasoutliers. ... leaveout0, 1, 2,..,9, +criteria(10)......
Filteroutcertaintaxa(unknownlabels, rare, etc...)... removeraretaxa(thresholdat0.01%, 1%, 2%,...)
Chooseadistance.... 40choicesinvegan/phyloseq.
Chooseanordinationmethodandnumberofcoordinates.... MDS,NMDS,k=2,3,4,5..
Chooseaclusteringmethod, chooseanumberofclusters.... PAM,KNN,densitybased, hclust...
Chooseanunderlyingcontinuousvariable(gradientorgroupofvariables: manifold).
Chooseagraphicalrepresentation.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Therearethusmorethan200millionpossiblewaysofanalyzingthisdata:
5× 100× 10× 40× 8× 16× 2× 4 = 204800000
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part III
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Some real data (Caporoso et al, 2011)
> GlobalPatterns
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data()Sample Data: [ 26 samples by 7 sample variables ]
tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]
> sample_sums(GlobalPatterns)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
864077 1135457 697509 1543451 2076476 718943 433894 186297
.....
NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
1478965 1652754 58688 493126 279704 937466 1211071 1216137
> summary(sample_sums(GlobalPatterns))
Min. 1st Qu. Median Mean 3rd Qu. Max.
58690 567100 1107000 1085000 1527000 2357000
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Howtodealwithdifferentnumbersofreads?
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
CurrentMethod: RarefyingAdhoc librarysizenormalizationbyrandomsubsamplingwithoutreplacement.1. Selectaminimumlibrarysize, NL,min. Thishasalsobeen
calledthe rarefactionlevel thoughwewillnotusethetermhere.
2. Discardlibraries(microbiomesamples)thathavefewerreadsthan NL,min.
3. Subsampletheremaininglibrarieswithoutreplacementsuchthattheyallhavesize NL,min.
Often NL,min ischosentobeequaltothesizeofthesmallestlibrarythatisnotconsidered defective, andtheprocessofidentifyingdefectivesamplescomeswithariskofsubjectivityandbias. Inmanycasesresearchershavealsofailedtorepeattherandomsubsamplingstep(3)orrecordthepseudorandomnumbergenerationseed/process---bothofwhichareessentialforreproducibility.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ReductionofDatatoProportions
Manysoftwareprogramsautomaticallyreducethedatatorelativeproportions, losingtheinformationaboutlibrarysizesorreadcounts.Thismakescomparisonsverydifficult.StatisticalFormulation: Whenmakinga(testing)decision,reducingresultsfromaBinomialdistributionintoaproportiondoesnotgivean admissible procedure.Definition:Anadmissibleruleisanoptimalruleformakingadecisioninthesensethatthereisnootherrulethatisalwaysbetterthanit.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Aimofthestudies: DifferentialAbundance
Likedifferentiallyexpressedgenes, aspecies/OTU isconsidereddifferentiallyabundantifitsmeanproportionissignificantlydifferentbetweentwoormoresampleclassesintheexperimentaldesign.OptimalityCriteria:SensititivityorPower TruePositiveRate.Specificity TrueNegativeRate.
Wehavetocontrolformanysourcesoferror(blocking,modeling, etc..)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
NegativeBinomialorGamma-Poisson
McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'',NegativeBinomialasahierarchicalmixtureforreadcountsIftechnicalreplicateshavesamenumberofreads: sj,Poissonvariationwithmean µ = sjui.Taxa i incidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe
Kij ∼ Poisson (sjui)
SameasDESeqforRNA-seq
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ImprovementinPowerandFDR
Performanceofdifferentialabundancedetectionwithandwithoutrarefyingsummarizedby“AreaUndertheCurve”(AUC) metricofaReceiverOperatorCurve(ROC) (verticalaxis).Briefly, theAUC valuevariesfrom0.5(random)to1.0(perfect).Thehorizontalaxisindicatestheeffectsize, shownasthefactorappliedtoOTU countstosimulateadifferentialabundance.Eachcurvetracestherespectivenormalizationmethod’smeanperformanceofthatpanel, withaverticalbarindicatingastandarddeviationinperformanceacrossallreplicatesandmicrobiometemplates.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Theright-handsideofthepanelrowsindicatesthemedianlibrarysize, N,whilethedarknessoflineshadingindicatesthenumberofsamplespersimulatedexperiment.Colorshadeandshapeindicatethenormalizationmethod.DetectionamongmultipletestswasdefinedusingaFalseDiscoveryRate(Benjamini-Hochberg)significancethresholdof0.05.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part IV
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
UnifracDistance(LozuponeandKnight, 2005)
isadistancebetweengroupsoforganismsthatarerelatedtoeachotherbyatree.SupposewehavetheOTUspresentinsample1(blue)andinsample2(red).Question: Dothetwosamplesdifferphylogenetically?ItisdefinedastheratioofthesumofthelengthsofthebranchesleadingtomembersofgroupA ormembersofgroupB butnotbothtothetotalbranchlengthofthetree.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
WeightedUnifracdistance A modificationofUniFrac,weightedUniFracisdefinedin(Lozuponeetal., 2007)as
n∑i=1
bi × | AiAT
− BiBT
|
n = numberofbranchesinthetree bi =lengthoftheithbranch Ai =numberofdescendantsof
ithbranchingroupA AT =totalnumberofsequences
ingroupA
[?]orcanbeseenasbyprobabilistsastheWassersteindistance(earthmovers)[?]. . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Costelloetal. 2010
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Rao'sDistance
Westartwithadistancebetweenindividuals.Theheterogeneityofapopulation(Hi )istheaveragedistancebetweenmembersofthatpopulation.Theheterogeneitybetweentwopopulations(Hij)istheaveragedistancebetweenamemberofpopulation i andamemberofpopulation j.Thedistancebetweentwopopulationsis
Dij = Hij −1
2(Hi + Hj)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DecompositionofDiversity
Ifwehavepopulations 1, . . . , k withfrequencies π1, . . . , πk,thenthediversityofallthepopulationstogetheris
H0 =
k∑i=1
πiHi +∑i
∑j
πiπjDij = H(w) + D(b)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DoublePrincipalCoordinateAnalysisPurdom, 2005hadtheideaofusingamethoddevelopedbyPavoineetal. calledDPCoA [?]implementedin ade4 [?].Supposewehavenspeciesinplocationsanda(euclidean)matrix ∆ givingthesquaresofthepairwisedistancesbetweenthespecies. Thenwecan
Usethedistancesbetweenspeciestofindanembeddingin n− 1 -dimensionalspacesuchthattheeuclideandistancesbetweenthespeciesisthesameasthedistancesbetweenthespeciesdefinedin ∆.
Placeeachoftheplocationsatthebarycenterofitsspeciesprofile. TheeuclideandistancesbetweenthelocationswillbethesameasthesquarerootoftheRaodissimilaritybetweenthem.
UsePCA tofindalower-dimensionalrepresentationofthelocations.
Givethespeciesandcommunitiescoordinatessuchthattheinertiadecomposesthesamewaythediversitydoes.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
FukuyamaandHolmes, 2012.Method Originaldescription Newformula PropertiesDPCoA square root of Rao's distance
basedonthesquarerootofthepatristicdistances
[∑
i bi(Ai/AT − Bi/BT)2]1/2 Mostsensitivetooutliers, leastsensitive to noise, upweightsdeep differences, gives OTUlocations
wUniFrac∑
i bi |Ai/AT − Bi/BT|∑
i bi |Ai/AT − Bi/BT| Less sensitive to outliers/moresensitivetonoisethanDPCoA
UniFrac fractionofbranchesleadingtoexactlyonegroup
∑i bi1
Ai/AT−Bi/BTAi/AT+Bi/BT
≥ 1 Sensitive to noise, upweightsshallowdifferencesonthetree
Summaryofthemethodsunderconsideration. ``Outliers"referstohighlyabundantOTUs, andnoisereferstonoiseindetectinglow-abundanceOTUs(seethetextformoredetail).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AntibioticTimeCourseData
Measurementsofabout2500differentbacterialOTUsfromstoolsamplesofthreepatients(D,E,F)Eachpatientsampled ∼ 50timesduringthecourseoftreatmentwithciprofloxacin(anantibiotic).TimescategorizedasPreCp, 1stCp, 1stWPC (weekpostcipro), Interim, 2ndCp, 2ndWPC,andPostCp.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
UniFrac
Axis 1: 14.7%
Axi
s 2:
10.
3%
−0.2
−0.1
0.0
0.1
0.2
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
weighted UniFrac
Axis 1: 47.6%A
xis
2: 1
2.3%
−0.2
−0.1
0.0
0.1
0.2
−0.4−0.3−0.2−0.1 0.0 0.1 0.2 0.3
weighted UF on presence/absence
Axis 1: 32.7%
Axi
s 2:
15.
1%
−0.05
0.00
0.05
0.10
−0.10−0.050.000.050.100.150.20
subject
D
E
F
ComparingtheUniFracvariants. Fromlefttoright:PCoA/MDS withunweightedUniFrac, withweightedUniFrac,andwithweightedUniFracperformedonpresence/absencedataextractedfromtheabundancedatausedintheothertwoplots
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
(a) MDS of OTUs
Axis 1: 6.2%
Axi
s 2:
3.7
%
−1.5
−1.0
−0.5
0.0
0.5
−1.0 −0.5 0.0 0.5
(c) DPCoA OTU plot
CS1
CS
2
−1.0
−0.5
0.0
0.5
1.0
−1.5 −1.0 −0.5 0.0 0.5 1.0
phylum
4C0d−2
Actinobacteria
Bacteroidetes
Candidate division TM7
Cyanobacteria
Firmicutes
Fusobacteria
Lentisphaerae
Proteobacteria
Synergistetes
Verrucomicrobia
(b) DPCoA community plot
Axis 1: 40.9%
Axi
s 2:
13.
3%−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
−1.0 −0.5 0.0 0.5
subject
D
E
F
(a)PCoA/MDS oftheOTUsbasedonthepatristicdistance, (b)communityand(c)speciespointsforDPCoA afterremovingtwooutlyingspecies.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
AntibioticStress
Wenextwanttovisualizetheeffectoftheantibiotic.OrdinationsofthecommunitiesduetoDPCoA andUniFracwithinformationaboutthewhetherthecommunitywasstressedornotstressed(precipro, interim, andpostciprowereconsidered``notstressed'', whilefirstcipro, firstweekpostcipro, secondcipro, andsecondweekpostciprowereconsidered``stressed'').WeseethatforUniFrac, thefirstaxisseemstoseparatethestressedcommunitiesfromthenotstressedcommunities.DPCoA alsoseemstoseparatetheoutthestressedcommunitiesalongthefirstaxis(inthedirectionassociatedwithBacteroidetes), althoughonlyforsubjectsD andE.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Axis1
Axi
s2
−0.2
−0.1
0.0
0.1
0.2
D 1
D 2
E 1 E 2
F 1F 2
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
Antibiotic stress
1: not stressed
2: stressed
Subject
D
E
F
PCoA/MDS withunweightedUniFrac. Thelabelsrepresentsubjectplusantibioticcondition.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Axis1
Axi
s2
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
D 1D 2
E 1
E 2
F 1F 2
−1.0 −0.5 0.0 0.5
CommunitypointsasrepresentedbyDPCoA.Thelabelsrepresentsubjectplusantibioticcondition.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ConclusionsforAntibioticStress
SinceUniFracemphasizesshallowdifferencesonthetreeandsincePCoA/MDS withUniFracseemstoseparatethesubjectsfromeachotherbetterthantheothertwomethods, wecanconcludethatthedifferencesbetweensubjectsaremainlyshallowones. However, DPCoA alsoseparatesthesubjectsandthestressedversusnon-stressedcommunities, andexaminingthecommunityandOTU ordinationscantellusaboutthedifferencesinthecompositionsofthesecommunities.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Part V
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Non-reproducibility: EnterotypesStudyNature WebLink
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
MultiplicityofChoices/Tests/Tuning
Data Filter Normalize Smooth Project WeightWeight
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
SuboptimalWorkflow
Anoptimizationwhereyouonlyallowedyourselftwostepsoneparalleltoeachcoordinate....
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Iteration: DocumentationandRecordkeeping
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
DocumentedChoices
Dataminingdiary: arecordofeverysampledropped, everynormalization, everychoiceofweight.ExamplefortheEnterotypedata Cloud Enterotype
example
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
RationalScientificEconomics
Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?
Datadredging. Researchisrecreatedfromscratcheverytime(new
clusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.
Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
RationalScientificEconomics
Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?Datadredging.
Researchisrecreatedfromscratcheverytime(newclusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?
Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.
Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
RationalScientificEconomics
Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?Datadredging.
Researchisrecreatedfromscratcheverytime(newclusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.
Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Complexitiesofproofbysimulation
Itisatheoremthat...Proofverifiedbypeerreview.Oursimulationshows....
Howwerechoicesmade, robustness. Doesthemodelfitthedata? Goodnessoffitishard
whendoing200,000datapoint. Howwastheprogramchecked(theoryavailable, test
case)? Crowdchecking....
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ReproducibleSimulationsandSubsampling
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
SupplementaryMaterial
ClusterSimulationCodeDispersionsurvey
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
KeepalltheData
......includingtheanalyses
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Goalsalreadyattained:
Modelingmixtures: VarianceStabilizingtransformations(nodatadiscarded).
Waysofcombiningheterogeneousdata: distancesandmultivariaterepresentation(nodatadiscarded).
Throughscripting: multiplemethodsshowninparallelanditeratively(noanalyseshidden).
Dataintegration phyloseq (trees, tables, networks)throughanobjectorientedapproach.
Threshold, sensitivitytestsandmodelingsimulations(tuningmadevisible).
Reproducibility: opensourcestandards, publicationofsourcecodeanddatausingR/Bioconductor. (phyloseq).
Shiny-phyloseq.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
BenefittingfromthetoolsandschoolsofStatisticians.......
Thankstothe R community:HadleyWickhamfor ggplot2, Chessel, Jombart, Dray,Thioulouse ade4 , GordonSmythandhisteamfor edgeR,WolfgangHuber, MichaelLovefor DESeq2 andEmmanuelParadisfor ape.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Collaborators:
DavidRelman AlfredSpormann ElisabethPurdomJustinSonnenburgandPersiDiaconis.
PostdoctoralFellows Paul(Joey)McMurdie, AlexAlekseyenko(NYU),BenCallahan, AngelaMarcobal.Students: MilingShen, AldenTimme, KatieShelef, YanaHoy,JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom CIMI-Toulouse, NIH-TR01, NIH/NIGMS R01,NSF-VIGRE andNSF-DMS.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
phyloseq
JoeyMcMurdie(joey711ongithub).AvailableinBioconductor.HowcanI (mystudents, mypostdocs...) learnmore?Askme.http://www-stat.stanford.edu/˜susan/
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Multi-tablemethods
Inertia, Co-InertiaWegeneralizeitinseveraldirectionsthroughtheideaofinertia.Asinphysics, wedefineinertiaasaweightedsumofdistancesofweightedpoints.Thisenablesustouseabundancedatainacontingencytableandcomputeitsinertiawhichinthiscasewillbetheweightedsumofthesquaresofdistancesbetweenobservedandexpectedfrequencies, suchasisusedincomputingthechisquarestatistic.Anothergeneralizationofvariance-inertiaistheusefulPhylogeneticdiversityindex. (computingthesumofdistancesbetweenasubsetoftaxathroughthetree).Wealsohavesuchgeneralizationsthatcovervariabilityofpointsonagraphtakenfromstandardspatialstatistics.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Co-Inertia
Whenstudyingtwovariablesmeasuredatthesamelocations,forinstancePH andhumiditythestandardquantificationofcovariationisthecovariance.
sum(x1 ∗ y1 + x2 ∗ y2 + x3 ∗ y3)
ifxandyco-vary-inthesamedirectionthiswillbebig.A simplegeneralizationtothiswhenthevariabilityismorecomplicatedtomeasureasaboveisdonethroughCo-Inertiaanalysis(CIA).Co-inertiaanalysis(CIA) isamultivariatemethodthatidentifiestrendsorco-relationshipsinmultipledatasetswhichcontainthesamesamplesorthesametimepoints. Thatistherowsorcolumnsofthematrixhavetobeweightedsimilarlyandthusmustbematchable.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
RV coefficient
TheglobalmeasureofsimilarityoftwodatatablesasopposedtotwovectorscanbedonebyageneralizationofcovarianceprovidedbyaninnerproductbetweentablesthatgivestheRV coefficient, anumberbetween0and1, likeacorrelationcoefficient, butfortables.
RV(A,B) = Tr(A′B)√Tr(A′A)
√Tr(B′B)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
ExampleCurrentworkwithJoeyMcMurdie, LesDethfelsen, DavidRelman, AngelaMarcobal, JustinSonnenburg. Data:
Taxa Readcounts(3patientstakingcipro: twotimecourses): .
Mass-Spec PositiveandNegativeionMassSpecfeaturesandtheirintensities: .
RNA-seq Metagenomicdataongenes:.
HereistheRV tableofthethreearraytypes:
> fourtable$RV
Taxa Kegg MassSpec+ MassSpec-
Taxa 1 0.565 0.561 0.670
Kegg 0.565 1 0.686 0.644
MassSpec+ 0.561 0.686 1 0.568
MassSpec- 0.670 0.644 0.568 1
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
Multipletablemethods
InPCA wecomputethevariance-covariancematrix, inmultipletablemethodswecantakeacubeoftablesandcomputetheRV coefficientoftheircharacterizingoperators.Wethendiagonalizethisandfindthebestweighted`ensemble'.Thisiscalledthe`compromise'andalltheindividualtablescanbeprojectedontoit.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .