data mining with the gene ontology
DESCRIPTION
Grup de Recerca en Estadística i Bioinformàtica. Data mining with the Gene Ontology. GO ing into Biological Meaning. Josep Lluís Mosquera April 2005. Motivation. High throughput methodologies pose different challenges : The experiment in itself Statistical analysis of results - PowerPoint PPT PresentationTRANSCRIPT
Data mining with the Gene Ontology
Josep Lluís MosqueraApril 2005
Grup de Recerca en Estadística i Bioinformàtica
GOing into Biological Meaning
2
Motivation
• High throughput methodologies pose different challenges:
1) The experiment in itself2) Statistical analysis of results3) Biological interpretation
• In gene-expression microarray studies, independently of the technology or analysis methods used, one generally obtains long lists of genes.
QUESTION: What does this mean?
3
Rationale
• Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language.
• The annotation in this form, is human readable and understandable, but it isn’t easy to interpretate computationally.
PROBLEM: The lack of a common set of terms and descriptions which is common to all organisms.
4
What can we do?
• An ontology provides a set of vocabulary terms covering a conceptual domain. These terms: Must:
o have a definitiono be placed within a structure of relationships
May have one or more parents. May be linked by two kind of relationships:
o ‘is-a’ between parent and childo ‘part-of’ between part and whole
• In this context, the Gene Ontology (GO) is a very useful resource for the initial interpretation of gene lists.
5
Gene Ontology Consortium
6
But... what’s the GO?
• It is an ontology with clear definitions of its terms and relationships between them starting at the top level (GO) whose children are three independent ontologies.
GO
Molecular Functions (MF) Biological Processes (BP) Cellular Components (CC)
7
Graphical Overview
• There are more than 16K nodes in GO
8
• Consist of two essential parts: The current ontologies:
o Vocabularyo Structure
The current annotations:o Create a link between the known genes and the
associated GOs that define their function.
GO database
THE CHALLENGE: Use annotations and structure of the GOs to understand the
biological meaning in a large dataset of genes.
9
Genes and GO terms
• Each gene can have several associated GO terms
• Each GO term can be connected to several other GO terms higher these are associated with the gene too.
• We call:o path the list of GO terms between the root
and the annotated GO term.o split each GO term in the path.
10
11
Our context
• A list of 100 genes will usually have many hundreds of associated GO terms and several thousand associated splits.
OBJECTIVE: How to cast biological meaning to gene lists from differentially expressed genes through of the Gene Ontology (GO)
12
Statistical Methods
• Let us consider:o N genes on a microarray:
M belong to a given GO term category (A)M-N do not belong it (category Ac)
o K of the N genes are selected and assigned to a given class (e.g. regulated genes)
o x genes of these K will be in A (EXAMPLE)
STATISTICAL HYPOTHESIS:H0: GO category A is equally represented on the
microarray than in the class of differentially regulated genes
H1: GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes
13
Hypergeometric Distribution (1/2)
We ask: Assuming sampling without replacement, what is the probability of having exactly x genes of category A?
• The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters (N, M, K).
K
N
xK
MN
x
M
xXP
14
Hypergeometric Distribution (2/2)
• So, under the null hypothesis p_value of having x genes or larger in A will be:
• This corresponds to a one-side test in which small p_values relate to over-represented GO terms.
• For under-represented categories can be calculated as1 - p_value
K
xk
K
N
kK
MN
k
M
HxXPvaluep 0_
15
Disadvantages
• The hypergeometric distribution is rather difficult and time consuming to calculate when N is high.
• We can proof,
• Using this approximation the p_value for over-represented GO terms can be calculated as
N
MKBinKNMHip
N,,,
iKix
i N
M
N
M
i
Kvaluep
1_
1
0
16
Alternative approaches• Let us assume
where N=N.., M=N1., K=N.1 and x=n11
• Using this notation, alternative include:o test for equality of two proportionso Fisher’s Exact Test
Differentially regulated genes (D)
Dc Genes on Microarray
Category A
n11 n12 N1.
Ac n21 n22 N2.
N.1 N.2 N..
2
17
Chi-square Test (2)
• statistic can be calculated as2
21
2121
2
211222112 ~
2
NNNN
NnnnnN
5
N
NN ji
PROBLEMS: It cannot:1. Distinguish between under- and over-
represented gene categories.2. Be used for small samples, i.e. when
18
Fisher’s Exact Test
• This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table as:
• One can calculate a table containing all possible combinations of n11n12n21n22.
• The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.
!!!!!
!!!!
22211211
2121
nnnnN
NNNNP
19
Correction for Multiple Tests
• As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance:o Methods controlling False Discovery Rate (FDR):
Benjamin and Hochberg (assuming independence) Benjamin and Yekutieli (dropping independence)
o Methods controlling Family Wyse Error Rate (FWER): Holm correction Westfall and Young
20
Example
N= 9177 genes on microarrayA
Ac
M= 467 in GO category
A
N-M= 8710 in Ac
K= 173 genes picked randomly
x= 51 genes of category
A
21
Miguel.... GO!