biological interpretation of microarray data helen lockstone dtc bioinformatics course 9 th february...

36
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Post on 21-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Biological Interpretation of Microarray Data

Helen Lockstone

DTC Bioinformatics Course

9th February 2010

Page 2: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Overview

• Interpreting microarray results– Gene lists to biological knowledge

• The Gene Ontology Consortium– Defined terms to describe gene function

• Functional analysis tools– Methods– DAVID/GSEA

Page 3: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Microarray Pipeline

Design and perform experiment

Process and normalise data

Statistical analysis

Differentially expressed genes

Biological interpretation

Page 4: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Biological Interpretation

• An obvious way to gain biological insight is to assess the differentially expressed genes in terms of their known function(s)

• Required an automated and objective (statistical) approach

• Functional profiling or pathway analysis

Page 5: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Early functional analyses

• Manually annotate list of differentially expressed (DE) genes

• Extremely time-consuming, not systematic, user-dependent

• Group together genes with similar function• Conclude functional categories with most DE

genes important in disease/condition under study• BUT may not be the right conclusion

Page 6: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

GO and functional analysis

Immune response

Metabolism

Transcription

Energy production

Neurotransmission

Protein transport

Functional category Number of sig genesImmune response 40Metabolism 20Transcription 20Energy production 10Neurotransmission 5Protein transport 5TOTAL 100

Immune response category contains 40% of all significant genes - by far the largest category.

Reasonable to conclude that immune response may be important in the condition being studied?

Page 7: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

However ….

• What if 40% of the genes on the array were involved in immune response?

• Only detected as many significant immune response genes as expected by chance

• Need to consider not only the number of significant genes for each category, but also total number on the array

Page 8: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Same example, relative to array

Functional category

Number of genes on array

Actual number of significant genes

Expected number of significant genes

Immune response 8000 40 40Metabolism 4000 20 20Transcription 2000 20 10Energy production 4000 10 20Neurotransmission 200 5 1Protein transport 1800 5 9

ALL 20000 100

Expected number of significant genes for category X = (num sig genes ÷ total genes on array)*(num genes in category X on array)

Page 9: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Same example, relative to array

• Now, transcription and neurotransmission categories appear more interesting as many more significant genes were observed than expected by chance

• Largest categories are not necessarily the most interesting!

Functional category

Number of genes on array

Actual number of significant genes

Expected number of significant genes

Immune response 8000 40 40Metabolism 4000 20 20Transcription 2000 20 10Energy production 4000 10 20Neurotransmission 200 5 1Protein transport 1800 5 9

ALL 20000 100

Page 10: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Major bioinformatic developments

• Requires annotating entire set of genes

• The Gene Ontology Consortium (www.geneontology.org)

• Automated, statistical approaches for annotating gene lists and performing functional profiling

Page 11: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

The Gene Ontology Consortium

Page 12: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

GO Consortium

• Developed three structured and controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner

• Has become a major resource for microarray data interpretation

Page 13: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

The Gene Ontology

• Molecular Function: basic activity or task

• Biological Process: broad objective or goal

• Cellular Component: location or complex

Page 14: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

The Gene Ontology

• Molecular Function: basic activity or task

– e.g. catalytic activity, calcium ion binding

• Biological Process: broad objective or goal

– e.g. signal transduction, immune response

• Cellular Component: location or complex

– e.g. nucleus, mitochondrion

Page 15: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

GO Structure

• Hierarchical tree• Annotated with most

specific annotation, forming path to top of tree

• Genes annotated with all relevant terms

• Annotations based on published studies and also electronic inferences

Page 16: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

GO Terms

• GO ID: GO:0007268

• GO term: synaptic transmission

• Ontology: biological process

• Definition: The process of communication from a neuron to a target (neuron, muscle, or secretory cell) across a synapse

Page 17: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010
Page 18: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Graphical view

Page 19: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

http://www.ncbi.nlm.nih.gov/sites/entrez

Page 20: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Functional Profiling Tools

Page 21: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Functional profiling tools

Identify GO categories with significantly more DE genes than expected by chance (i.e. over-represented among DE genes relative to

representation on array as a whole)

Correct for testing multiple GO categories

Hypergeometric Distribution or Fisher’s Exact Test

Page 22: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Khatri and Draghici. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (2005) 21(18):3587-95

Functional profiling tools

Page 23: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Functional profiling tools

• Freely-available stand-alone/web-based tools– User-friendly graphical interface and simple to use– Extensive documentation, plus tutorials/technical support

• Reduces a large number of DE genes to a smaller number of significantly enriched GO categories – more easily interpreted in biological context

• Considering sets of genes increases power – individual genes could be false positives but a set of functionally

related genes all showing significant changes is more robust

Page 24: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010
Page 25: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

DAVID Results

Page 26: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Advantages

• Increasingly support data (probe IDs) from different microarray platforms

• Accept various probe/gene identifiers

• Web-based tools automatically retrieve most up-to-date GO annotations

• Most automatically map from probe IDs to a gene ID - multiple significant probes for one gene could otherwise skew results

Page 27: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Further considerations

• Reference list must be appropriate for accurate statistical analysis

• Up/down regulated genes can be submitted separately or as a combined list

• Unannotated genes cannot be used in the analysis; gene ontology evolving; well-studied systems over-represented

Page 28: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Gene set enrichment analysis

• Majority of tools based on idea of identifying GO categories significantly enriched in list of differentially expressed genes

• Requires some threshold to define genes as ‘significant’

• Recent tool called GSEA takes a different approach by considering all assayed genes

Page 29: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

GSEA: Key Features

• Ranks all genes on array based on their differential expression

• Identifies gene sets whose member genes are clustered either towards top or bottom of the ranked list (i.e. up- or down regulated)

• Enrichment score calculated for each category • Permutation test to identify significantly enriched

categories• Extensive gene sets provided via MolSig DB – GO,

chromosome location, KEGG pathways, transcription factor or microRNA target genes

Page 30: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

GSEA

• Each gene category tested by traversing ranked list

• Enrichment score starts at 0, weighted increment when a member gene encountered, weighted decrement otherwise

• Enrichment score – point where most different from zero

Most significantly up-regulated genes

Unchanged genes

Most significantly down-regulated genes

Disease Control

Page 31: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

GSEA algorithm

Page 32: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Null distribution of enrichment scores

Actual ES

GSEA: Permutation Test

• Randomise data (groups), rank genes again and repeat test 1000 times

• Null distribution of 1000 ES for geneset

• FDR q-value computed – corrected for gene set size and testing multiple gene sets

Page 33: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Biological Interpretation

• Due to GO hierarchy, several related categories may contain a subset of genes that is driving the significant enrichment score so will all be significant

• Interpretation still requires substantial work– search literature and public databases – likely functional consequences of the changes– are the genes identified as significant within each GO

category up- or down-regulated?– genes within a category can have opposite effects e.g.

apoptosis would include genes that induce or repress apoptosis

Page 34: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Biological Interpretation

• Too many categories found significant– Size filter – More stringent significance threshold– Related categories (redundancy)

• No significant categories– Relax significance level slightly – e.g. 0.25 recommended by GSEA as exploratory analysis

• No significant genes– GSEA most suitable

Page 35: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

Commercial Tool Suites

• Ingenuity Pathway Analysis (Ingenuity Systems, CA)– Developed own extensive ontology over past 10 years – Includes gene interactions, disease/drug information– PhD-level curators mining the literature– Used by many pharmaceutical companies

Page 36: Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010

For more information

• Gene Ontology: http://www.geneontology.org • Affymetrix: http://www.affymetrix.com • DAVID: http://david.abcc.ncifcrf.gov• GSEA: http://www.broad.mit.edu/gsea/ • Ingenuity:

http://www.ingenuity.com/products/pathways_analysis.html

• NCBI: http://www.ncbi.nlm.nih.gov/