data mining with the gene ontology

Data mining with the Gene Ontology

Josep Lluís MosqueraApril 2005

Grup de Recerca en Estadística i Bioinformàtica

GOing into Biological Meaning

2

Motivation

• High throughput methodologies pose different challenges:

1) The experiment in itself2) Statistical analysis of results3) Biological interpretation

• In gene-expression microarray studies, independently of the technology or analysis methods used, one generally obtains long lists of genes.

QUESTION: What does this mean?

3

Rationale

• Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language.

• The annotation in this form, is human readable and understandable, but it isn’t easy to interpretate computationally.

PROBLEM: The lack of a common set of terms and descriptions which is common to all organisms.

4

What can we do?

• An ontology provides a set of vocabulary terms covering a conceptual domain. These terms: Must:

o have a definitiono be placed within a structure of relationships

May have one or more parents. May be linked by two kind of relationships:

o ‘is-a’ between parent and childo ‘part-of’ between part and whole

• In this context, the Gene Ontology (GO) is a very useful resource for the initial interpretation of gene lists.

http://www.geneontology.org/

5

Gene Ontology Consortium

6

But... what’s the GO?

• It is an ontology with clear definitions of its terms and relationships between them starting at the top level (GO) whose children are three independent ontologies.

GO

Molecular Functions (MF) Biological Processes (BP) Cellular Components (CC)

7

Graphical Overview

• There are more than 16K nodes in GO

8

• Consist of two essential parts: The current ontologies:

o Vocabularyo Structure

The current annotations:o Create a link between the known genes and the

associated GOs that define their function.

GO database

THE CHALLENGE: Use annotations and structure of the GOs to understand the

biological meaning in a large dataset of genes.

9

Genes and GO terms

• Each gene can have several associated GO terms

• Each GO term can be connected to several other GO terms higher these are associated with the gene too.

• We call:o path the list of GO terms between the root

and the annotated GO term.o split each GO term in the path.

11

Our context

• A list of 100 genes will usually have many hundreds of associated GO terms and several thousand associated splits.

OBJECTIVE: How to cast biological meaning to gene lists from differentially expressed genes through of the Gene Ontology (GO)

12

Statistical Methods

• Let us consider:o N genes on a microarray:

M belong to a given GO term category (A)M-N do not belong it (category Ac)

o K of the N genes are selected and assigned to a given class (e.g. regulated genes)

o x genes of these K will be in A (EXAMPLE)

STATISTICAL HYPOTHESIS:H0: GO category A is equally represented on the

microarray than in the class of differentially regulated genes

H1: GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes

13

Hypergeometric Distribution (1/2)

We ask: Assuming sampling without replacement, what is the probability of having exactly x genes of category A?

• The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters (N, M, K).

K

N

xK

MN

x

M

xXP

14

Hypergeometric Distribution (2/2)

• So, under the null hypothesis p_value of having x genes or larger in A will be:

• This corresponds to a one-side test in which small p_values relate to over-represented GO terms.

• For under-represented categories can be calculated as1 - p_value

K

xk

K

N

kK

MN

k

M

HxXPvaluep 0_

15

Disadvantages

• The hypergeometric distribution is rather difficult and time consuming to calculate when N is high.

• We can proof,

• Using this approximation the p_value for over-represented GO terms can be calculated as

N

MKBinKNMHip

N,,,

iKix

i N

M

N

M

i

Kvaluep

1_

1

0

16

Alternative approaches• Let us assume

where N=N.., M=N1., K=N.1 and x=n11

• Using this notation, alternative include:o test for equality of two proportionso Fisher’s Exact Test

Differentially regulated genes (D)

Dc Genes on Microarray

Category A

n11 n12 N1.

Ac n21 n22 N2.

N.1 N.2 N..

2

17

Chi-square Test (2)

• statistic can be calculated as2

21

2121

2

211222112 ~

2

NNNN

NnnnnN

5

N

NN ji

PROBLEMS: It cannot:1. Distinguish between under- and over-

represented gene categories.2. Be used for small samples, i.e. when

18

Fisher’s Exact Test

• This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table as:

• One can calculate a table containing all possible combinations of n11n12n21n22.

• The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.

!!!!!

!!!!

22211211

2121

nnnnN

NNNNP

19

Correction for Multiple Tests

• As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance:o Methods controlling False Discovery Rate (FDR):

Benjamin and Hochberg (assuming independence) Benjamin and Yekutieli (dropping independence)

o Methods controlling Family Wyse Error Rate (FWER): Holm correction Westfall and Young

20

Example

N= 9177 genes on microarrayA

Ac

M= 467 in GO category

A

N-M= 8710 in Ac

K= 173 genes picked randomly

x= 51 genes of category

A

21

Miguel.... GO!

data mining with the gene ontology

Documents

x genes of category

known genes

termseach gene

regulated genesx genes

gene ontology consortiumbut

long lists of genes

term category

terms higher