bioinformatics 2 - lecture 4 - the university of edinburghgabriele schweikert bioinformatics 2 -...

74
Bioinformatics 2 - Lecture 4 Gabriele Schweikert University of Edinburgh February 8, 2013 Gabriele Schweikert Bioinformatics 2 - Lecture 4 1

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Bioinformatics 2 - Lecture 4

Gabriele Schweikert

University of Edinburgh

February 8, 2013

Gabriele Schweikert Bioinformatics 2 - Lecture 4 1

Page 2: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

http://www.arthursclipart.org/medical/humanbody/page 01.html

Gabriele Schweikert Bioinformatics 2 - Lecture 4 2

Page 3: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

XX -Seq

Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

Gabriele Schweikert Bioinformatics 2 - Lecture 4 3

Page 4: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Gene regulation by transcription factor binding

Hobert, Science, 2008

Gabriele Schweikert Bioinformatics 2 - Lecture 4 4

Page 5: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Epigenomics

Marks, Nature Reviews Cancer, 2001

Gabriele Schweikert Bioinformatics 2 - Lecture 4 5

Page 6: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Introduction: ChIP-Seq

- Cross-linkingDNA - binding

protein

DNA

adopted from Kim and Park, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 4 6

Page 7: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Introduction: ChIP-Seq

- Cross-linking

- DNA fragmentation

- Enrichment with specific antibody (ChIP)

DNA - bindingprotein

DNA

adopted from Kim and Park, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 4 6

Page 8: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Introduction: ChIP-Seq

- Cross-linking

- DNA fragmentation

- Enrichment with specific antibody (ChIP)

- Profiling of enriched DNA (Seq)

DNA - bindingprotein

DNA

Individual sequencing read (tag)

Read (tag) density

- Cross-linking

- DNA fragmentation

- Enrichment with specific antibody (ChIP)

- Profiling of enriched DNA (Seq)

adopted from Kim and Park, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 4 6

Page 9: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

ChIP-Seq analysis pipeline

Park, Nature Reviews Genetics, 2009

Gabriele Schweikert Bioinformatics 2 - Lecture 4 7

Page 10: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential profile analysis

compare binding profiles in different conditions/tissues

find regions which are significantly different between condition Aand B.

Gabriele Schweikert Bioinformatics 2 - Lecture 4 8

Page 11: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Gabriele Schweikert Bioinformatics 2 - Lecture 4 9

Page 12: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Gabriele Schweikert Bioinformatics 2 - Lecture 4 9

Page 13: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’

→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Gabriele Schweikert Bioinformatics 2 - Lecture 4 9

Page 14: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Gabriele Schweikert Bioinformatics 2 - Lecture 4 9

Page 15: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Pipeline: Differential profile analysis

1 quality control

2 alignment (BWA)

3 filtering (duplicates)

4 define regions of interest (peak calling)

5 strand shift correction

6 normalization

7 differential profile analysis

Gabriele Schweikert Bioinformatics 2 - Lecture 4 10

Page 16: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Pipeline: Differential profile analysis

1 quality control

2 alignment (BWA)

3 filtering (duplicates)

4 define regions of interest (peak calling)

5 strand shift correction

6 normalization

7 differential profile analysis

Gabriele Schweikert Bioinformatics 2 - Lecture 4 10

Page 17: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Strand shift

Park, Nature Reviews Genetics, 2009

Gabriele Schweikert Bioinformatics 2 - Lecture 4 11

Page 18: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Peak Calling

in general only a small fraction of the genome shows significantenrichment (binding)

discriminate true peaks in sequence coverage (protein bindingsites) from the background

> 31 open source methods (’peak callers’)

1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks

Gabriele Schweikert Bioinformatics 2 - Lecture 4 12

Page 19: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Peak Calling

in general only a small fraction of the genome shows significantenrichment (binding)

discriminate true peaks in sequence coverage (protein bindingsites) from the background

> 31 open source methods (’peak callers’)1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks

Gabriele Schweikert Bioinformatics 2 - Lecture 4 12

Page 20: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Peak Callers

Wilbanks and Facciotti, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 13

Page 21: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Peak Calling / sliding window

great differences in results

potentially use several peak callers

performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.

H3K27me3)

alternatively use sliding windows for very extended regions

use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)

→ Output: a set of genomic regions

Gabriele Schweikert Bioinformatics 2 - Lecture 4 14

Page 22: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Peak Calling / sliding window

great differences in results

potentially use several peak callers

performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.

H3K27me3)

alternatively use sliding windows for very extended regions

use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)

→ Output: a set of genomic regions

Gabriele Schweikert Bioinformatics 2 - Lecture 4 14

Page 23: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Peak Calling / sliding window

great differences in results

potentially use several peak callers

performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.

H3K27me3)

alternatively use sliding windows for very extended regions

use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)

→ Output: a set of genomic regions

Gabriele Schweikert Bioinformatics 2 - Lecture 4 14

Page 24: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Strand shift correction

1 use cross correlation profiles to estimate fragment length

2 shift / extend reads on forward / reverse strand

Gabriele Schweikert Bioinformatics 2 - Lecture 4 15

Page 25: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization

sequencing depth (number of clusters) varies between samples→ normalization

if sample A has been sampled deeper than sample B,counts are expected to be higher in A

can we use total number of reads per sample (library size)?

only works if we assume that the total number of molecules inthe sample is the same

differential regions with high counts distort the ratio of totalreads.

Gabriele Schweikert Bioinformatics 2 - Lecture 4 16

Page 26: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization

sequencing depth (number of clusters) varies between samples→ normalization

if sample A has been sampled deeper than sample B,counts are expected to be higher in A

can we use total number of reads per sample (library size)?

only works if we assume that the total number of molecules inthe sample is the same

differential regions with high counts distort the ratio of totalreads.

Gabriele Schweikert Bioinformatics 2 - Lecture 4 16

Page 27: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization

sequencing depth (number of clusters) varies between samples→ normalization

if sample A has been sampled deeper than sample B,counts are expected to be higher in A

can we use total number of reads per sample (library size)?

only works if we assume that the total number of molecules inthe sample is the same

differential regions with high counts distort the ratio of totalreads.

Gabriele Schweikert Bioinformatics 2 - Lecture 4 16

Page 28: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization

Robinson and Oshlack, Genome Biology, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 17

Page 29: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Page 30: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Page 31: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Page 32: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Page 33: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Page 34: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Page 35: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

Blue Yellow Green Red Totalmolecules in sample 400 600 400 400 1800

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Condition A

Condition B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 18

Page 36: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (Anders and Huber. 2010)

for each gene divide counts from sample A by the counts forsample B

per gene estimate for the size ratio of sample A to sample B

use median of all these ratios

what is the assumption we make about sample A and B?

the majority of events is not changing in sample A vs sample B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 19

Page 37: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (Anders and Huber. 2010)

for each gene divide counts from sample A by the counts forsample B

per gene estimate for the size ratio of sample A to sample B

use median of all these ratios

what is the assumption we make about sample A and B?

the majority of events is not changing in sample A vs sample B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 19

Page 38: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization (Anders and Huber. 2010)

for each gene divide counts from sample A by the counts forsample B

per gene estimate for the size ratio of sample A to sample B

use median of all these ratios

what is the assumption we make about sample A and B?

the majority of events is not changing in sample A vs sample B

Gabriele Schweikert Bioinformatics 2 - Lecture 4 19

Page 39: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization

Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222

geometric mean 210 316 210 258

Determine normalization factor:

Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05

Counts after normalization:

Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211

Gabriele Schweikert Bioinformatics 2 - Lecture 4 20

Page 40: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization

Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222

geometric mean 210 316 210 258

Determine normalization factor:

Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05

Counts after normalization:

Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211

Gabriele Schweikert Bioinformatics 2 - Lecture 4 20

Page 41: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization

Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222

geometric mean 210 316 210 258

Determine normalization factor:

Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05

Counts after normalization:

Blue Yellow Green Rednb of reads (exp 1) 211 316 211 316nb of reads (exp 3) 211 316 211 211

Gabriele Schweikert Bioinformatics 2 - Lecture 4 20

Page 42: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Simulation: Biological Replicates

Gabriele Schweikert Bioinformatics 2 - Lecture 4 21

Page 43: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Simulation: Biological Replicates

Gabriele Schweikert Bioinformatics 2 - Lecture 4 21

Page 44: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Simulation: Biological Replicates

Gabriele Schweikert Bioinformatics 2 - Lecture 4 21

Page 45: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Simulation: add big changes (-) (at promoters)

Gabriele Schweikert Bioinformatics 2 - Lecture 4 22

Page 46: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Simulation: add big changes (-) (at promoters)

Gabriele Schweikert Bioinformatics 2 - Lecture 4 22

Page 47: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Simulation: add big changes (-) (at promoters)

Gabriele Schweikert Bioinformatics 2 - Lecture 4 22

Page 48: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization check

Total counts: 1 : 0.76 : 1.12 : 0.88

Gabriele Schweikert Bioinformatics 2 - Lecture 4 23

Page 49: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Normalization check

Gabriele Schweikert Bioinformatics 2 - Lecture 4 24

Page 50: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential peak calling

Clouaire et al., 2012

Gabriele Schweikert Bioinformatics 2 - Lecture 4 25

Page 51: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential peak calling

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Gabriele Schweikert Bioinformatics 2 - Lecture 4 26

Page 52: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential peak calling

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Gabriele Schweikert Bioinformatics 2 - Lecture 4 26

Page 53: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential peak calling

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Gabriele Schweikert Bioinformatics 2 - Lecture 4 26

Page 54: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential peak calling

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Gabriele Schweikert Bioinformatics 2 - Lecture 4 26

Page 55: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

Challenges with count data from NGS

small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods

large dynamic range (0...106) between binding sites

distribution is discrete, positive, skewed→ no (log-)normal model

Gabriele Schweikert Bioinformatics 2 - Lecture 4 27

Page 56: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

Challenges with count data from NGS

small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods

large dynamic range (0...106) between binding sites

distribution is discrete, positive, skewed→ no (log-)normal model

Gabriele Schweikert Bioinformatics 2 - Lecture 4 27

Page 57: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

Challenges with count data from NGS

small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods

large dynamic range (0...106) between binding sites

distribution is discrete, positive, skewed→ no (log-)normal model

Gabriele Schweikert Bioinformatics 2 - Lecture 4 27

Page 58: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance

→ variance estimated from comparing two replicates

→ variance depends strongly on the mean

→ share information between peaks

Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 28

Page 59: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance

→ variance estimated from comparing two replicates

→ variance depends strongly on the mean

→ share information between peaks

Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 28

Page 60: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Gabriele Schweikert Bioinformatics 2 - Lecture 4 29

Page 61: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Gabriele Schweikert Bioinformatics 2 - Lecture 4 29

Page 62: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Gabriele Schweikert Bioinformatics 2 - Lecture 4 29

Page 63: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Differential Peak Calling

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Gabriele Schweikert Bioinformatics 2 - Lecture 4 29

Page 64: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Types of noise

1 Shot noise

unavoidabledominant for small peakscan be computed

2 Technical noise

from sample preparation and sequencing

3 Biological noise

differences between samples of the same conditiondominant for high count peaks peakscan’t be computed, needs to be estimated

Gabriele Schweikert Bioinformatics 2 - Lecture 4 30

Page 65: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

The negative binomial distribution

Two-stage hierarchical process: Gamma distribution + Poisson

from Anders, BioC 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 31

Page 66: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Testing

Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2

j v(µρ)

sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ

Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2

Gabriele Schweikert Bioinformatics 2 - Lecture 4 32

Page 67: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Testing

Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2

j v(µρ)

sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ

Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2

Gabriele Schweikert Bioinformatics 2 - Lecture 4 32

Page 68: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Model fitting

Estimate variance from replicates

Fit a line to get the variance-mean dependence v(µ)(local regression for a gamma-family generalized linear model)

Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 33

Page 69: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Model fitting

For condition A and B , add counts from all replicates: KiA,KiB

Consider KiA,KiB as NB-distributed with moments as estimatedand fitted

calculate the probability of observing the actual sums or moreextreme ones, conditioned on A = B .

DESeq, Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 34

Page 70: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Correction for multiple testing

The false discovery rate (see lecture 1)

Defined as the expectation of the ratio of false positives (type Ierrors) to total positives (number of times the null is rejected)

Assume we are testing m hypotheses; the Benjamini-Hochbergprocedure for a given FDR α works as follows:

1 Rank p-values in increasing order;2 Find the largest k s.t. pk ≤ k

mα;3 Reject all null hypotheses 1,. . . ,k

Gabriele Schweikert Bioinformatics 2 - Lecture 4 35

Page 71: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

DESeq results: MA plot

Anders and Huber, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 36

Page 72: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Example: Differential oestrogen receptor binding

Gabriele Schweikert Bioinformatics 2 - Lecture 4 37

Page 73: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Example: The colors of Chromatin

Filion et al, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 4 38

Page 74: Bioinformatics 2 - Lecture 4 - The University of EdinburghGabriele Schweikert Bioinformatics 2 - Lecture 4 18 Normalization (simple example) Normalization Condition 1 Blue Yellow Green

Example: ENCODE

Gabriele Schweikert Bioinformatics 2 - Lecture 4 39