bioinformatics 2 - lecture 4 - the university of edinburghgabriele schweikert bioinformatics 2 -...

Post on 15-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bioinformatics 2 - Lecture 4

Gabriele Schweikert

University of Edinburgh

February 8, 2013

Gabriele Schweikert Bioinformatics 2 - Lecture 4 1

http://www.arthursclipart.org/medical/humanbody/page 01.html

Gabriele Schweikert Bioinformatics 2 - Lecture 4 2

XX -Seq

Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

Gabriele Schweikert Bioinformatics 2 - Lecture 4 3

Gene regulation by transcription factor binding

Hobert, Science, 2008

Gabriele Schweikert Bioinformatics 2 - Lecture 4 4

Epigenomics

Marks, Nature Reviews Cancer, 2001

Gabriele Schweikert Bioinformatics 2 - Lecture 4 5

Introduction: ChIP-Seq

- Cross-linkingDNA - binding

protein

DNA

adopted from Kim and Park, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 4 6

Introduction: ChIP-Seq

- Cross-linking

- DNA fragmentation

- Enrichment with specific antibody (ChIP)

DNA - bindingprotein

DNA

adopted from Kim and Park, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 4 6

Introduction: ChIP-Seq

- Cross-linking

- DNA fragmentation

- Enrichment with specific antibody (ChIP)

- Profiling of enriched DNA (Seq)

DNA - bindingprotein

DNA

Individual sequencing read (tag)

Read (tag) density

- Cross-linking

- DNA fragmentation

- Enrichment with specific antibody (ChIP)

- Profiling of enriched DNA (Seq)

adopted from Kim and Park, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 4 6

ChIP-Seq analysis pipeline

Park, Nature Reviews Genetics, 2009

Gabriele Schweikert Bioinformatics 2 - Lecture 4 7

Differential profile analysis

compare binding profiles in different conditions/tissues

find regions which are significantly different between condition Aand B.

Gabriele Schweikert Bioinformatics 2 - Lecture 4 8

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Gabriele Schweikert Bioinformatics 2 - Lecture 4 9

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Gabriele Schweikert Bioinformatics 2 - Lecture 4 9

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’

→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Gabriele Schweikert Bioinformatics 2 - Lecture 4 9

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Gabriele Schweikert Bioinformatics 2 - Lecture 4 9

Pipeline: Differential profile analysis

1 quality control

2 alignment (BWA)

3 filtering (duplicates)

4 define regions of interest (peak calling)

5 strand shift correction

6 normalization

7 differential profile analysis

Gabriele Schweikert Bioinformatics 2 - Lecture 4 10

Pipeline: Differential profile analysis

1 quality control

2 alignment (BWA)

3 filtering (duplicates)

4 define regions of interest (peak calling)

5 strand shift correction

6 normalization

7 differential profile analysis

Gabriele Schweikert Bioinformatics 2 - Lecture 4 10

Strand shift

Park, Nature Reviews Genetics, 2009

Gabriele Schweikert Bioinformatics 2 - Lecture 4 11

Peak Calling

in general only a small fraction of the genome shows significantenrichment (binding)

discriminate true peaks in sequence coverage (protein bindingsites) from the background

> 31 open source methods (’peak callers’)

1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks

Gabriele Schweikert Bioinformatics 2 - Lecture 4 12

Peak Calling

in general only a small fraction of the genome shows significantenrichment (binding)

discriminate true peaks in sequence coverage (protein bindingsites) from the background

> 31 open source methods (’peak callers’)1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks

Gabriele Schweikert Bioinformatics 2 - Lecture 4 12

Peak Callers

Wilbanks and Facciotti, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 13

Peak Calling / sliding window

great differences in results

potentially use several peak callers

performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.

H3K27me3)

alternatively use sliding windows for very extended regions

use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)

→ Output: a set of genomic regions

Gabriele Schweikert Bioinformatics 2 - Lecture 4 14

Peak Calling / sliding window

great differences in results

potentially use several peak callers

performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.

H3K27me3)

alternatively use sliding windows for very extended regions

use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)

→ Output: a set of genomic regions

Gabriele Schweikert Bioinformatics 2 - Lecture 4 14

Peak Calling / sliding window

great differences in results

potentially use several peak callers

performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.

H3K27me3)

alternatively use sliding windows for very extended regions

use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)

→ Output: a set of genomic regions

Gabriele Schweikert Bioinformatics 2 - Lecture 4 14

Strand shift correction

1 use cross correlation profiles to estimate fragment length

2 shift / extend reads on forward / reverse strand

Gabriele Schweikert Bioinformatics 2 - Lecture 4 15

Normalization

sequencing depth (number of clusters) varies between samples→ normalization

if sample A has been sampled deeper than sample B,counts are expected to be higher in A

can we use total number of reads per sample (library size)?

only works if we assume that the total number of molecules inthe sample is the same

differential regions with high counts distort the ratio of totalreads.

Gabriele Schweikert Bioinformatics 2 - Lecture 4 16

Normalization

sequencing depth (number of clusters) varies between samples→ normalization

if sample A has been sampled deeper than sample B,counts are expected to be higher in A

can we use total number of reads per sample (library size)?

only works if we assume that the total number of molecules inthe sample is the same

differential regions with high counts distort the ratio of totalreads.

Gabriele Schweikert Bioinformatics 2 - Lecture 4 16

Normalization

sequencing depth (number of clusters) varies between samples→ normalization

if sample A has been sampled deeper than sample B,counts are expected to be higher in A

can we use total number of reads per sample (library size)?

only works if we assume that the total number of molecules inthe sample is the same

differential regions with high counts distort the ratio of totalreads.

Gabriele Schweikert Bioinformatics 2 - Lecture 4 16

Normalization

Robinson and Oshlack, Genome Biology, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 17

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Normalization (Anders and Huber. 2010)

for each gene divide counts from sample A by the counts forsample B

per gene estimate for the size ratio of sample A to sample B

use median of all these ratios

what is the assumption we make about sample A and B?

the majority of events is not changing in sample A vs sample B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 19

Normalization (Anders and Huber. 2010)

for each gene divide counts from sample A by the counts forsample B

per gene estimate for the size ratio of sample A to sample B

use median of all these ratios

what is the assumption we make about sample A and B?

the majority of events is not changing in sample A vs sample B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 19

Normalization (Anders and Huber. 2010)

for each gene divide counts from sample A by the counts forsample B

per gene estimate for the size ratio of sample A to sample B

use median of all these ratios

what is the assumption we make about sample A and B?

the majority of events is not changing in sample A vs sample B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 19

Normalization

Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222

geometric mean 210 316 210 258

Determine normalization factor:

Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05

Counts after normalization:

Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211

Gabriele Schweikert Bioinformatics 2 - Lecture 4 20

Normalization

Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222

geometric mean 210 316 210 258

Determine normalization factor:

Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05

Counts after normalization:

Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211

Gabriele Schweikert Bioinformatics 2 - Lecture 4 20

Normalization

Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222

geometric mean 210 316 210 258

Determine normalization factor:

Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05

Counts after normalization:

Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211

Gabriele Schweikert Bioinformatics 2 - Lecture 4 20

Simulation: Biological Replicates

Gabriele Schweikert Bioinformatics 2 - Lecture 4 21

Simulation: Biological Replicates

Gabriele Schweikert Bioinformatics 2 - Lecture 4 21

Simulation: Biological Replicates

Gabriele Schweikert Bioinformatics 2 - Lecture 4 21

Simulation: add big changes (-) (at promoters)

Gabriele Schweikert Bioinformatics 2 - Lecture 4 22

Simulation: add big changes (-) (at promoters)

Gabriele Schweikert Bioinformatics 2 - Lecture 4 22

Simulation: add big changes (-) (at promoters)

Gabriele Schweikert Bioinformatics 2 - Lecture 4 22

Normalization check

Total counts: 1 : 0.76 : 1.12 : 0.88

Gabriele Schweikert Bioinformatics 2 - Lecture 4 23

Normalization check

Gabriele Schweikert Bioinformatics 2 - Lecture 4 24

Differential peak calling

Clouaire et al., 2012

Gabriele Schweikert Bioinformatics 2 - Lecture 4 25

Differential peak calling

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Gabriele Schweikert Bioinformatics 2 - Lecture 4 26

Differential peak calling

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Gabriele Schweikert Bioinformatics 2 - Lecture 4 26

Differential peak calling

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Gabriele Schweikert Bioinformatics 2 - Lecture 4 26

Differential peak calling

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Gabriele Schweikert Bioinformatics 2 - Lecture 4 26

Differential Peak Calling

Challenges with count data from NGS

small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods

large dynamic range (0...106) between binding sites

distribution is discrete, positive, skewed→ no (log-)normal model

Gabriele Schweikert Bioinformatics 2 - Lecture 4 27

Differential Peak Calling

Challenges with count data from NGS

small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods

large dynamic range (0...106) between binding sites

distribution is discrete, positive, skewed→ no (log-)normal model

Gabriele Schweikert Bioinformatics 2 - Lecture 4 27

Differential Peak Calling

Challenges with count data from NGS

small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods

large dynamic range (0...106) between binding sites

distribution is discrete, positive, skewed→ no (log-)normal model

Gabriele Schweikert Bioinformatics 2 - Lecture 4 27

Differential Peak Calling

to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance

→ variance estimated from comparing two replicates

→ variance depends strongly on the mean

→ share information between peaks

Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 28

Differential Peak Calling

to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance

→ variance estimated from comparing two replicates

→ variance depends strongly on the mean

→ share information between peaks

Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 28

Differential Peak Calling

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Gabriele Schweikert Bioinformatics 2 - Lecture 4 29

Differential Peak Calling

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Gabriele Schweikert Bioinformatics 2 - Lecture 4 29

Differential Peak Calling

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Gabriele Schweikert Bioinformatics 2 - Lecture 4 29

Differential Peak Calling

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Gabriele Schweikert Bioinformatics 2 - Lecture 4 29

Types of noise

1 Shot noise

unavoidabledominant for small peakscan be computed

2 Technical noise

from sample preparation and sequencing

3 Biological noise

differences between samples of the same conditiondominant for high count peaks peakscan’t be computed, needs to be estimated

Gabriele Schweikert Bioinformatics 2 - Lecture 4 30

The negative binomial distribution

Two-stage hierarchical process: Gamma distribution + Poisson

from Anders, BioC 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 31

Testing

Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2

j v(µρ)

sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ

Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2

Gabriele Schweikert Bioinformatics 2 - Lecture 4 32

Testing

Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2

j v(µρ)

sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ

Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2

Gabriele Schweikert Bioinformatics 2 - Lecture 4 32

Model fitting

Estimate variance from replicates

Fit a line to get the variance-mean dependence v(µ)(local regression for a gamma-family generalized linear model)

Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 33

Model fitting

For condition A and B , add counts from all replicates: KiA,KiB

Consider KiA,KiB as NB-distributed with moments as estimatedand fitted

calculate the probability of observing the actual sums or moreextreme ones, conditioned on A = B .

DESeq, Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 34

Correction for multiple testing

The false discovery rate (see lecture 1)

Defined as the expectation of the ratio of false positives (type Ierrors) to total positives (number of times the null is rejected)

Assume we are testing m hypotheses; the Benjamini-Hochbergprocedure for a given FDR α works as follows:

1 Rank p-values in increasing order;2 Find the largest k s.t. pk ≤ k

mα;3 Reject all null hypotheses 1,. . . ,k

Gabriele Schweikert Bioinformatics 2 - Lecture 4 35

DESeq results: MA plot

Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 36

Example: Differential oestrogen receptor binding

Gabriele Schweikert Bioinformatics 2 - Lecture 4 37

Example: The colors of Chromatin

Filion et al, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 38

Example: ENCODE

Gabriele Schweikert Bioinformatics 2 - Lecture 4 39

top related