bioinformatics long-term support (wabi) systems biology ... · piano –a platform for gsa (in r)...
TRANSCRIPT
LeifVä[email protected]
BioinformaticsLong-termSupport(WABI)SystemsBiologyFacility@Chalmers
Gene-setanalysisanddataintegration
Outline
Will try to be practical, without getting to the detail of code-level
• Gene-setanalysis- Whatandwhy?• Gene-setcollections• MethodsforGSA• Gene-setdirectionality,overlap/interactions,biases• Thingstoconsider
2
Whatisgene-setanalysis(GSA)?
Immuneresponse
Pyruvate
Gene-level data Gene-setdata(results)
PPARG
Gene-setanalysisGO-termsPathwaysChromosomal locationsTranscription factorsHistone modificationsDiseasesetc…
Samples
Genes
Wewillfocusontranscriptomics anddifferentialexpressionanalysisHowever,GSAcaninprinciplebeusedonalltypesofgenome-widedata.
3
Manynamesforgene-setanalysis(GSA)
• Functionalannotation• Pathwayanalysis• Gene-setenrichmentanalysis• GO-termanalysis• Genelistenrichmentanalysis• …
4
Whygene-setanalysis(GSA)?
• Interpretationofgenome-wideresults• Gene-setsare(typically)fewerthanallthegenesandhave
moredescriptivenames• Difficulttomanagealonglistofsignificantgenes• Detectpatternsthatwouldbedifficulttodiscernsimplyby
manuallygoingthroughe.g.thelistofdifferentiallyexpressedgenes
• Integratesexternalinformationintotheanalysis• Lesspronetofalse-positivesonthegene-level• Topgenesmightnotbetheinterestingones,several
coordinatedsmallerchanges
5
Gene-sets
6
Sowhataboutgene-sets?
• Dependsontheresearchquestion• Severaldatabases/resourcesavailableprovidinggene-set
collections(e.g.MSigDB,Enrichr)• Includeddirectlyinsomeanalysistools• GO-termsareprobablyoneofthemostwidelyusedgene-sets
GO-termsPathwaysChromosomal locationsTranscription factorsHistone modificationsDiseasesMetabolitesetc…
7
Gene-setexample:Geneontology(GO)terms
• Hierarchical graph with three categories (or parents):Biological process, Molecular function, Cellular compartment
• Terms get more and more detailed moving down the hierarchy• Genes can belong to multiple GO terms
8
Gene-setexample:Metabolicpathwaysormetabolites
9
Gene-setexample:Transcriptionfactortargets
10
Gene-setexample:Hallmarkgene-sets
“Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.”
Liberzon et al. (2015) Cell Systems 1:417-425
http://software.broadinstitute.org/gsea/msigdb/collections.jsp
11
Wheretogetgene-setcollections?
http://amp.pharm.mssm.edu/Enrichr/#statshttp://software.broadinstitute.org/gsea/msigdb/index.jsp
12Parsed info from various databases. Focus on human.
13
• GO annotations for many specieshttp://geneontology.org/page/download-annotations
• clusterProfiler (R/Bioconductor package)http://bioconductor.org/packages/devel/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html#go-gene-set-enrichment-analysis
Wheretogetgene-setcollections?
Not working with human data?
doi: 10.1002/ajmg.b.32328
Wheretogetgene-setcollections?
• Soonerorlateryouwillrunintotheproblemofmatchingyourdatatogene-setcollectionsduetotheexistenceofseveralgeneIDtypes
14
Wheretogetgene-setcollections?
http://www.ensembl.org/biomart/martview
One way to map different gene IDs to each other, or to assemble a gene-set collection with the gene IDs used by your data
See also: DAVID https://david.ncifcrf.gov/content.jsp?file=conversion.htmlMygene http://mygene.info/ and http://bioconductor.org/packages/release/bioc/html/mygene.html 15
Gene-setanalysistoolsandmethods
16
ToolsandmethodsforGSA
OmicsTools (several platforms)http://omictools.com/gene-set-analysis-category
Bioconductor (R packages)https://bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment
Some examples:• Hypergeometric test / Fisher’s exact test
(a.k.a overrepresentation analysis)• DAVID (browser)• Enrichr (browser)• GSEA (Java, R)• piano (R)
17
Also exists e.g.:• GSA for GWAS, miRNA, …• Network-based• PlantGSEA• GSA controlling for length bias in RNA-seq• …
There are hundreds of tools to choose between…
Overrepresentationanalysis
All genes (universe)
Selected list of genes GO:003478
GO:000237GO:002736
GO:009835
Is this overlap bigger than expected by
random chance?
8 292 19768
Selected Not selectedIn GO-term
Not in GO-term
Hypergeometric test(Fisher’s exact test)
18
Overrepresentationanalysis
https://david.ncifcrf.gov/home.jsphttp://amp.pharm.mssm.edu/Enrichr/
19
Overrepresentationanalysis
• Requiresacutoff(arbitrary)• Omitstheactualvaluesofthegene-levelstatistics• Goodfore.g.overlapofsignificantgenesintwocomparisons• Computationallyfast
Incontrast,gene-setanalysisiscutoff-freeandusesallgene-leveldataandcandetectsmallbutcoordinatechangesthatcollectivelycontributetosomebiologicalprocess.
20
8 2
92 19768
Selected Not selected
In GO-termNot in GO-term
114 4513
A vs ctrl B vs ctrl
GSA:asimpleexample
Samples
Genes
Gene-set 1
Gene-set 2
𝑆" = 𝑚𝑒𝑎𝑛(𝐺")
• S is the gene-set statistic• G are gene-level statistics of the genes in the gene-set
𝑆+ = −0.1
𝑆0 = 6.2
Permute the gene-labels (or sample labels) and redo the calculations over and over again (e.g. 10,000 times)!
𝑝" = fractionof𝑆=>?@AB>Cthatismoreextremethan𝑆"
𝑆=>?@AB>C
-606
21
𝑆" = 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝐺", [remaininggenes])
Gene-levelstatistics
• p-values• t-values,etc• Fold-changes• Ranks• Correlations• Signaltonoiseratio• …
22
GSEA
Mootha et al Nature Genetics, 2003; Subramanian PNAS 2005
23
Piano– aplatformforGSA(inR)
• Reporterfeatures• Parametricanalysisofgene-setenrichment,PAGE• Tailstrength• Wilcoxonrank-sumtest• Gene-setenrichmentanalysis,GSEA(twoimplementations)• Mean• Median• Sum• Maxmean
Consensusresult
Disclaimer: The author of this presentation is the developer of piano 24
Directionality,overlap,interaction,biases…
25
Directionalityofgene-sets
Disclaimer: The author of this presentation is the developer of piano 26
Gene-setoverlapandinteraction
• High number of very overlapping gene-sets (representing a similar biological theme) can bias interpretation and take attention from other biological themes that are represented by fewer gene-sets.
27
Image from Enrichment Map http://dx.doi.org/10.1371/journal.pone.0013984
Gene-overlap network
Gene-setoverlapandinteraction
Examples of gene-set “interactions”
• High number of very overlapping gene-sets (representing a similar biological theme) can bias interpretation and take attention from other biological themes that are represented by fewer gene-sets.
• Can be valuable to take gene-set interaction into account (e.g. www.sysbio.se/kiwi)
28
ConsiderationswhenperformingGSA
• Biasingene-setcollections(populardomains,multifunctionalgenes,…)• Gene-setnamescanbemisleading(revisitthegenes!)• Considerthegene-setsize,i.e.numberofgenes(specificorgeneral)• Positiveandnegativeassociationbetweengenesandgene-setsmakesgene-
levelfold-changestrickytointerpretcorrectly• (Typically)binaryassociationtogene-sets,doesnottakeintoaccount
varyinglevelsofinfluencefromindividualgenesontheprocessthatisrepresentedbythegene-sets
• Remembertorevisitthegene-leveldata!Arethegenessignificant?Aretheycorrectlyassignedtothespecificgene-set?
• Remembertoadjustformultipletesting
Gene-setanalysisisaveryefficientandusefultooltointerpretyourgenome-widedata!JustremembertocriticallyevaluatetheresultsJ
29