bioinformatics 2 - lecture 4 - the university of edinburghgabriele schweikert bioinformatics 2 -...

Bioinformatics 2 - Lecture 4

Gabriele Schweikert

University of Edinburgh

February 8, 2013

Gabriele Schweikert Bioinformatics 2 - Lecture 4 1

http://www.arthursclipart.org/medical/humanbody/page 01.html

XX -Seq

Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

Gene regulation by transcription factor binding

Hobert, Science, 2008

Epigenomics

Marks, Nature Reviews Cancer, 2001

Introduction: ChIP-Seq

- Cross-linkingDNA - binding

protein

adopted from Kim and Park, 2011

- Cross-linking

- DNA fragmentation

- Enrichment with specific antibody (ChIP)

DNA - bindingprotein

- Cross-linking

- DNA fragmentation

- Profiling of enriched DNA (Seq)

DNA - bindingprotein

Individual sequencing read (tag)

Read (tag) density

- Cross-linking

- DNA fragmentation

- Profiling of enriched DNA (Seq)

ChIP-Seq analysis pipeline

Park, Nature Reviews Genetics, 2009

Differential profile analysis

compare binding profiles in different conditions/tissues

find regions which are significantly different between condition Aand B.

Two fundamentally different questions:

1 Is the level of enrichment at a given position different in twosamples?

2 May this difference be attributed to the difference inexperimental conditions?i.e., are we confident that it is due to the experimentaltreatment and not due to fluctuations (”biological variation”)?

→ We are more interested in answering the second question→ Requires ’biological replicates’→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

→ We are more interested in answering the second question→ Requires ’biological replicates’

→ We also need input control: ’non-ChIP genomic DNA’,to account for sequencing bias

Pipeline: Differential profile analysis

1 quality control

2 alignment (BWA)

3 filtering (duplicates)

4 define regions of interest (peak calling)

5 strand shift correction

6 normalization

7 differential profile analysis

Pipeline: Differential profile analysis

1 quality control

2 alignment (BWA)

3 filtering (duplicates)

4 define regions of interest (peak calling)

5 strand shift correction

6 normalization

7 differential profile analysis

Strand shift

Park, Nature Reviews Genetics, 2009

Peak Calling

in general only a small fraction of the genome shows significantenrichment (binding)

discriminate true peaks in sequence coverage (protein bindingsites) from the background

> 31 open source methods (’peak callers’)

1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks

Peak Calling

in general only a small fraction of the genome shows significantenrichment (binding)

discriminate true peaks in sequence coverage (protein bindingsites) from the background

> 31 open source methods (’peak callers’)1 find overlapping extended reads2 sliding window approaches3 Gaussian kernel density estimators4 look for bimodal peaks

Peak Callers

Wilbanks and Facciotti, 2010

Peak Calling / sliding window

great differences in results

potentially use several peak callers

performance depends on type of peak1 punctuate peaks for most transcription factor binding sites2 potentially, large extended peaks for histone modifications (e.g.

H3K27me3)

alternatively use sliding windows for very extended regions

use fixed windows around annotated sites. (e.g. +/- 2000bpwindows around transcription start sites for H3K4me3)

→ Output: a set of genomic regions

H3K27me3)

Strand shift correction

1 use cross correlation profiles to estimate fragment length

2 shift / extend reads on forward / reverse strand

Normalization

sequencing depth (number of clusters) varies between samples→ normalization

if sample A has been sampled deeper than sample B,counts are expected to be higher in A

can we use total number of reads per sample (library size)?

only works if we assume that the total number of molecules inthe sample is the same

differential regions with high counts distort the ratio of totalreads.

Normalization

Robinson and Oshlack, Genome Biology, 2010

Normalization (simple example)Normalization

Condition 1

Blue Yellow Green Red Totalmolecules in sample 400 600 400 600 2000

fraction 0.2 0.3 0.2 0.3

nb of reads (exp 1) 200 300 200 300 1000nb of reads (exp 2) 100 150 100 150 500

after lib normalization (exp 2) 200 300 200 300 1000

Condition 2

fraction 0.22 0.33 0.22 0.22

nb of reads (exp 3) 222 333 222 222 1000after lib normalization (exp 3) 222 333 222 222 1000

Gabriele Schweikert (University of Edinburgh) Machine Learning for Genomic and Epigenomic Research 6

Normalization

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Condition A

Condition B

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Normalization

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Condition A

Condition B

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Normalization

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Condition A

Condition B

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Normalization

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Condition A

Condition B

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Normalization

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Condition A

Condition B

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Normalization

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Condition A

Condition B

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Normalization

Condition 1

fraction 0.2 0.3 0.2 0.3

Condition 2

fraction 0.22 0.33 0.22 0.22

Condition A

Condition B

Normalization (Anders and Huber. 2010)

for each gene divide counts from sample A by the counts forsample B

per gene estimate for the size ratio of sample A to sample B

use median of all these ratios

what is the assumption we make about sample A and B?

the majority of events is not changing in sample A vs sample B

Normalization

Blue Yellow Green Rednb of reads (exp 1) 200 300 200 300nb of reads (exp 3) 222 333 222 222

geometric mean 210 316 210 258

Determine normalization factor:

Blue Yellow Green Red mediannb of reads (exp 1) 0.95 0.95 0.95 1.16 0.95nb of reads (exp 3) 1.05 1.05 1.05 0.86 1.05

Counts after normalization:

Normalization

Simulation: Biological Replicates

Simulation: add big changes (-) (at promoters)

Normalization check

Total counts: 1 : 0.76 : 1.12 : 0.88

Normalization check

Differential peak calling

Clouaire et al., 2012

Which Peaks are statistically significant different ?

→ Problem is related to detection of differential expressed genes inRNA-Seq

→ Current approaches mostly adopted from RNA-Seq basedmethods e.g. DESeq (Anders and Huber, 2010)

DBChIP (Liang and Keles, 2012)DiffBind (Ross-Innes et al., 2012)

→ Peaks are represented by a single value: total counts

Differential Peak Calling

Challenges with count data from NGS

small number of replicates(mind you: these a large experiments, we look at hundreds ofthousands of binding sites,however each binding site is only tested a few timesusually <3 !! )→ no rank based or permutation methods

large dynamic range (0...106) between binding sites

distribution is discrete, positive, skewed→ no (log-)normal model

to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance

→ variance estimated from comparing two replicates

→ variance depends strongly on the mean

→ share information between peaks

Anders and Huber, 2010

to decide whether a peak is significantly different under onecondition vs another we need to estimate the variance

→ variance estimated from comparing two replicates

→ variance depends strongly on the mean

→ share information between peaks

whenever things are counted, which distribution comes intomind?

Poisson distribution

for Poisson-distributed data, the variance is equal to the mean

→ in NGS data we observe overdispersion(greater variability than expected from this simple model ! )

Types of noise

1 Shot noise

unavoidabledominant for small peakscan be computed

2 Technical noise

from sample preparation and sequencing

3 Biological noise

differences between samples of the same conditiondominant for high count peaks peakscan’t be computed, needs to be estimated

The negative binomial distribution

Two-stage hierarchical process: Gamma distribution + Poisson

from Anders, BioC 2010

Testing

Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2

j v(µρ)

sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ

Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2

Testing

Model:The binding intensity (counts) for a given site in sample j stemsfrom a negative binomial distribution with mean sjµρ andvariance s2

j v(µρ)

sj relative size of library j (normalization factor)µρ mean value for condition ρv(µρ) fitted variance for mean µρ

Null hypothesis:The intensity of binding is not influenced by the experimentalcondition ρ:µρ1 = µρ2

Model fitting

Estimate variance from replicates

Fit a line to get the variance-mean dependence v(µ)(local regression for a gamma-family generalized linear model)

Model fitting

For condition A and B , add counts from all replicates: KiA,KiB

Consider KiA,KiB as NB-distributed with moments as estimatedand fitted

calculate the probability of observing the actual sums or moreextreme ones, conditioned on A = B .

DESeq, Anders and Huber, 2010

Correction for multiple testing

The false discovery rate (see lecture 1)

Defined as the expectation of the ratio of false positives (type Ierrors) to total positives (number of times the null is rejected)

Assume we are testing m hypotheses; the Benjamini-Hochbergprocedure for a given FDR α works as follows:

1 Rank p-values in increasing order;2 Find the largest k s.t. pk ≤ k

mα;3 Reject all null hypotheses 1,. . . ,k

DESeq results: MA plot

Example: Differential oestrogen receptor binding

Example: The colors of Chromatin

Filion et al, 2010

Example: ENCODE

bioinformatics 2 - lecture 4 - the university of edinburghgabriele schweikert bioinformatics 2 -...

Documents

inss 6511 lecture 4 normalization an normalization example

de normalization

local context normalization: revisiting local...

keynote gigi schweikert

schweikert - clipbook

normalization - 1 normalization. normalization - 2

advance normalization

dna microarray bioinformatics - #27612 normalization getting...

robust-linear-model normalization to reduce technical...

5 normalization

normalization bcnf

normalization example

normalization - cbs.dtu.dk · normalization image analysis...

nie normalization

normalization 1

11. normalization

unit4 normalization

normalization exercises a337. 2 normalization example 1

bioinformatics 2 - lecture 5 · gabriele schweikert...

cinderella schweikert, alexis