gene set enrichment analysis petri törönen petri(dot)toronen(at)helsinki.fi

27
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi

Upload: wilfred-stanley

Post on 22-Dec-2015

223 views

Category:

Documents


7 download

TRANSCRIPT

Gene Set Enrichment Analysis

Petri Törönenpetri(DOT)toronen(AT)helsinki.fi

What, Why, How…

• Gene expression data/analysis• Problems with gene expression data analysis• Earlier solutions• My solution• Comparisons• Conclusions / Warnings

Genome-wide gene expression

• Genome-wide Gene Expression (GE) analysis. Standard lab tool

• Various methods• Aim to understand biological differences across

the samples at gene level

• If you don’t work with GE data:– Gene Set Methods can be used with most other

large scale data sets

Typical pipelinesGenerate the GE data

Pre-processing(Normalization etc.)

Define Differentially Expressed genes

Draw biological conclusionsFind over-represented biological processes

Generate the GE data

Pre-processing(Normalization etc.)

Define Differentially Expressed genes

Cluster selected genes

Draw biological conclusions

Generate the GE data

Pre-processing(Normalization etc.)

Define Differentially Expressed genes

Generate a classification of samples using GE profilesof genes

Draw biological conclusionsClassify unknown samples

What can go wrong?

• Is the definition of Differentially Expressed genes always reasonable?– datasets with large noise levels– p-value thresholds– sudden jump to signif. regulation– genes with weak regulation

• Is the set of Diff. Expr. genes the main goal?

What can go wrong?

• Is the definition of Differentially Expressed genes always reasonable?– datasets with large noise levels– p-value thresholds– genes with weak regulation

• Is the set of Diff. Expr. genes the main goal? => Biological Processes are usually more

informative.

What can go wrong?

Analysis of data with one threshold.Biological process with weak regulation goes unnoticed

Solution

• Analyze sets of genes instead of genes• Gene Set: Genes belonging to same pathway,

biological process, complex and/or Gene Ontology class

• Benefits: Group of genes is less sensitive to error than a single gene*

• Benefits: Easy interpretation of the results

• Something to support the gene based analysis

Gene set analysis pipeline

Generate the GE data

Pre-processing(Normalization etc.)

Define continuous Diff. Expr. score for genes

Calculate a gene set score for each gene set

Generate permuted data

Pre-definedgene sets

Calculate the gene set scorefor each gene set

Look for gene sets that show stronger signal in real data than in permuted data

Gene level

Gene set level

Expr

essi

on d

ata

Clas

s da

ta

Sample labels

Methods for gene set scoring

• Average based methods• Rank based methods• Other methods (omitted here)

Average based methods

• Calculate the average regulation of gene set (Tian et al. PNAS)

• Can something go wrong with it?

Rank based methods

• Steps:– order genes with differential

expression– test every possible threshold in the

ordered list– look over(/under)-representation of

gene set above the threshold– select the strongest score

• Expression values are (often) discarded!

• Iterative Group Analysis, Kolmogorov-Smirnov test (KS), modified KS (Gene Set Enrichment Analysis package, MIT)

Analyzedsubset

threshold

Gene expression data Analyzed gene classes

Black = class memberWhite = not a member

Permutations

• Needed to evaluate significance• Two types:• Row Randomization– mix labels gene set / gene class

• Column Randomization– mix sample labels, used to

calculate diff. expr.• Column Randomization

preferred

Rowrand.

Col. rand

Summary of methods

• Average-based methods are weak with non-coherent regulation

• Rank-based methods usually omit gene expression values => steps between all genes equally significant

My brilliant proposal

• Combine two method groups:– Order genes with diff. expr. scores– Test every threshold position– At each threshold calculate

• Scale the difference with STD and average estimates (Toronen et al. 2009)

• Get a Z-score scaling for difference=> Gene Set Z-score (GSZ)

My brilliant proposal

• An over-representation (hypergeometric) score weighted with diff. expr. score

• GSZ compares the Diff to the mean and STD we obtain when the class is randomly distr. in the ordered list.

• Considers both: Variance in the expr. values and variance in the number gene set members in the list

My brilliant proposal

• Many popular Gene Set scoring methods are variants of GSZ-method:– hypergeometric testing– Pearson correlation– Max-Mean (Efron, Tibshirani)– Random Sets (Newton et al.)

GSZ profile from ALL data (Chiaretti et.al) for one GO class vs. 7 quantiles (0, 5, 25, 50, 75, 95, 100) from 500 permutations. Different positions corresponds to other competing methods.

Evaluation• Stability of the scores as threshold goes through the gene list? • Red line: Strongest signal from positive data (across all GO classes)• Blue lines: various quantiles (same as before) across all GO class• Compare with KS and modified KS (Right column. MIT, PNAS and Nature Gen.)• Same data, same permutation!!

GSZ

with

diff

. par

amet

er v

alue

s. T

hird

box

sh

ows

defa

ult p

aram

eter

val

ues.

Pay attentionto stability ofblue lines.

More evaluation

• GSZ is also stable against the gene set size variations– most methods are not

• Several Gene Set scoring methods were tested with artificial positive and random datasets– GSZ showed best overall ability to separate two dataset types

• Methods were evaluated by splitting the real data to two halves: Test how well the results match– GSZ was best in predicting its own results from the other half– GSZ was best in predicting summary of all methods from the

other half

More evaluation• Compare different gene set scoring functions• Test with two popular datasets against GO classes• Calculate the empirical -log(p-values) for strongest GO classes from each

method • Blue line = GSZ, green line = T-test, red = KS, magenta = iGA, cyan = modified KS

ALL

data

set

p53

data

set

Pooled data

Class data

More evaluation• Select biologically relevant GO classes as

biologically positive• Look how many such classes each method

finds across the top ranks (GSZ = blue line)

Here ALL dataset. GSZ outperforms others at bigger ranks. Similar results were obtained with p53 dataset

Comparison with other programs

• Selected SignalPathway (green line), GSEA (cyan) and GSA (black) to comparison

• Evaluation was done again using the biologically positive classes

• Comparing programs less clear (more variables)

Here again ALL dataset. Similar results with p53GSZ outperforms others at large

Summary

• GSZ, weighted over-representation score• Math link to many other popular methods• Stable across GO class sizes and across gene

list positions• Good performance in artificial datasets• Best performance with many evaluations

from two real datasets

Other applications

• siRNA data vs. gene IDs (discussed)• Linkage data vs. biological processes

(discussed)• BLAST result list vs. descriptions (in usage)• BLAST result list vs. GO classes (in usage)

Warnings

• Quality of gene expression data• Enough samples for permutations• Each gene should occur only once in the

expression data• Filter genes without annotations (with GO

data)• Use Column Permutations• Quality of gene sets / annotations

Wake up!!