introduc)on*to*chroman*ip*– sequencing( chip7seq ... · cross7correlaon*plots* −500 0 500 1000...

78
Introduc)on to Chroma)n IP – sequencing (ChIPseq) data analysis Linköping, 21 April 2016 Agata Smialowska BILS / NBIS, SciLifeLab, Stockholm University Introduc)on to Bioinforma)cs using NGS data

Upload: others

Post on 09-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Introduc)on  to  Chroma)n  IP  –  sequencing  (ChIP-­‐seq)  data  analysis  

Linköping,  21  April  2016    

Agata  Smialowska  BILS  /  NBIS,  SciLifeLab,  Stockholm  University  

Introduc)on  to  Bioinforma)cs  using  NGS  data    

Page 2: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Chroma)n  state  and  gene  expression  PEV  Posi)on  effect  variega)on  in  Drosophila  eye  (nature.com)  

Juxtaposi)on  of  eye  colour  genes  with  heterochroma)n  results  in  the  “moWled”  eye  coloura)on  (red  and  white).    

Proteins,  which  bind  heterochroma)n,  act  to  “spread”  the  silencing  signal  by  providing  a  forward  feedback  loop.    

Heterochroma)n  Protein  1;  Histone  methyltransferase  Su(var)3-­‐9;  H3K9  methyla)on  

First  observed  by  H.  Muller  1930  

Page 3: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Chroma)n  immunoprecipita)on  

RnDsystems  

Page 4: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Applica)ons  General  transcrip)on  machinery  

Page 5: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Applica)ons  

Promoter-­‐associated  transcrip)on  factors  

Page 6: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Applica)ons  

Distal  enhancers  

Page 7: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Applica)ons  

Histone  modifica)ons  and  variants   Ac)va)on  states  

Co-­‐factors  

Page 8: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

design  study    obtain  input  chroma)n    perform  precipita)on    construct  library    sequence  library    bioinforma1c  analysis  

Workflow  of  a  ChIP-­‐seq  study  

Page 9: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

ChIP-­‐seq  workflow  

Liu,  PoW  and  Huss,  BMC  Biology  2010  

Page 10: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Cri)cal  factors  

•  An)body  selec)on  •  Library  cloning  and  sequencing  •  Algorithm  for  peak  detec)on  •  Proper  control  sample  (input  chroma)n  or  mock  IP)  

•  Reproducibility  in  chroma)n  fragmenta)on  •  Cross-­‐linker  choice  •  Enough  material  and  biological  replicates  

Page 11: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Experiment  design  

•  Sound  experimental  design:  replica)on,  randomisa)on  and  blocking  (R.A.  Fisher,  1935)  

•  In  the  absence  of  a  proper  design,  it  is  essen)ally  impossible  to  par))on  biological  varia)on  from  technical  varia)on  

•  Sequencing  depth:  depends  on  the  structure  of  the  signal;  cannot  be  linearly  scaled  to  genome  size  

•  Single-­‐  vs.  paired-­‐end  reads:  PE  improves  read  mapping  confidence  and  gives  a  direct  measure  of  fragment  size,  which  otherwise  has  to  be  modelled  or  es)mated  

Page 12: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Ideal  design:    Each  sample  has  a  matched  input  Input  sequenced  to  a  comparable  depth    as  IP  sample    ≥2  biological  replicates  for  site  iden)fica)on  ≥3  biological  replicates  for  differen)al  binding  

input   library/sequencing  

X  ChIP  

replicates  

input   library/sequencing  ChIP  

replicates  

✓  input   library/sequencing  

ChIP  replicates  

under-­‐sequenced  input  

ChIP  

well-­‐sequenced  input  

ChIP  

X  

Experiment  design  

Page 13: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Importance  of  biological  replicates  

sample   technical  replicates  are  generally  a  waste  of  )me  and  money    

libraries   sequencing  

X  replicates   libraries   sequencing  

origin   many  studies  do  not  account  for  batch  effects  i.  )me  ii.  origin  

so  if  you  care  about  reproducibility  

samples  

experiment  

✓  )me  -­‐-­‐-­‐-­‐-­‐-­‐-­‐>  

experiment1   experiment2   Experiment3…   libraries,  sequencing,  etc  

X  

Page 14: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

pooled  data  

under-­‐sequenced  data  

X  

if  you  need  to  pool  your  data,  then  it  is  under-­‐sequenced  

pooled  data   actual  replicates  

Importance  of  sequencing  depth  

Page 15: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Sequencing  depth  depends  on  data  type  

TF:  20  M  

point-­‐source   mixed  signal   broad  signal  

No  clear  guidelines  for  mixed  and  broad  type  of  peaks  

Transcrip)on  Factors  

Chroma)n    Remodellers    

Histone  marks  

Chroma)n    Remodellers    

Histone  marks    

RNA  polymerase  II  

Human:   ?   ?  

H3K4me3:  25  M   H3K36me3:  35  M   H3K27me3:  40  M  

H3K9me3:  >55    M  

Source:  The  ENCODE  consor)um;    Jung  et  al,  NAR  2014  

Page 16: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

The  ENCODE  (Encyclopedia  of  DNA  Elements)  Consor)um  and  the  Roadmap  Epigenomics  Consor)um  are  a  vast  resource  of  various  kinds  of  func)onal  genomics  data  (as  well  as  RNA-­‐seq  data).  

 

Page 17: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

•  ChIP  –  sequencing:  introduc)on  from  a  bioinforma)cs  point  of  view  

 •  Principles  of  analysis  of  ChIP-­‐seq  data  

•  ChIP-­‐seq:  downstream  analyses  

•  Resources  

•  Exercise  overview  

Page 18: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

•  ChIP  –  sequencing:  introduc)on  from  a  bioinforma)cs  point  of  view  

 •  Principles  of  analysis  of  ChIP-­‐seq  data  

•  ChIP-­‐seq:  downstream  analyses  

•  Resources  

•  Exercise  overview  

Page 19: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Chroma)n  =  DNA  +  proteins  

Park,  Nature  Rev  Gene)cs,  2009    

Page 20: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Data  analysis  

Page 21: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Profile  of  protein  binding  sites  vs.  input  

Park,  Nature  Rev  Gene)cs,  2009    

Chromator  (Drosophila)  –  protein  binding  methylated  histones  

Page 22: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

design  study    obtain  input  chroma)n    perform  precipita)on    construct  library    sequence  library    library  quality  control  filter  sequences    align  sequences    filter  alignments    iden1fy  peaks  /  regions  of  enrichment    assess  data  quality    understand  the  data  /  results    downstream  analyses  

Workflow  of  a  ChIP-­‐seq  study  

Itera)ve  process  

Page 23: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

•  ChIP  –  sequencing:  introduc)on  from  a  bioinforma)cs  point  of  view  

 •  Principles  of  analysis  of  ChIP-­‐seq  data  

•  ChIP-­‐seq:  downstream  analyses  

•  Resources  

•  Exercise  overview  

Page 24: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Two  ques)ons  to  address  

•  1.  Did  the  ChIP  part  of  the  ChIP-­‐seq  experiment  work?  Was  the  enrichment  successful?  

•  2.  Where  are  the  binding  sites  (of  the  protein  of  interest)?  

Page 25: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Word  of  cau)on!  

ChIP-­‐seq  experiments  are  more  unpredictable  than  RNA-­‐seq!  Error  sources:    chroma)n  structure    PCR  over-­‐amplifica)on    non-­‐specific  an)body    other  things?  

Page 26: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

ChIP-­‐seq  QC:  did  the  ChIP  work?  

•  1.  Inspect  the  signal  (mapped  reads,  coverage  profiles)  in  genome  browser  

•  2.  Compute  peak-­‐independent  quality  metrics  (cross  correla)on,  cumula)ve  enrichment)  

•  3.  Assess  replicate  consistency  (correla)ons  between  replicates  of  the  same  condi)on)  

Page 27: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

tag  density  distribu)on  reproducibility  similarity  of  coverage  signal  at  known  sites  …  Sposng  inconsistencies  Confounding  factors  Under-­‐sequenced  libraries  …  

Page 28: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

How  do  I  know  my  data  is  of  good  quality?  

Marinov  et  al,  G3  2013    

Library  complexity  

Page 29: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Sequence  duplica)on  level  >  80%  (low  complexity  library)  

Quality  control:  tag  uniqueness  –  library  complexity  metric  

NRF:  Non-­‐redundant  frac)on  (of  reads):  propor)on  of  unique  tags  /  total    less  than  20%  of  reads  should  be  duplicates  for  10  million  reads  sequenced  (ENCODE)  

FastQC  Babraham  Ins)tute  

Page 30: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

How  do  I  know  my  data  is  of  good  quality?  

Marinov  et  al,  G3  2013    

Objec)ve  (i.e.  peak  independent)  metrics  to  quan)fy  enrichment  in  ChIP-­‐seq;    for  TF  in  mammalian  systems:    Normalised  Strand  Correla)on  NSC  Rela)ve  Strand  Correla)on  RSC  

Large-­‐scale  quality  analysis  of  published  ChIP-­‐seq  data  sets:  20%  low  quality  25%  intermediate  quality  30%  inputs  have  metrics  similar  to  IPs  

Page 31: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Strand  cross-­‐correla)on  

Carroll  et  al,  Front  Genet  2014  

The  correla)on  between  signal  of  the  5ʹ′  end  of  reads  on  the  (+)  and  (-­‐)  strands  is  assessed  axer  successive  shixs  of  the  reads  on  the  (+)  strand  and  the  point  of  maximum  correla)on  between  the  two  strands  is  used  as  an  es)ma)on  of  fragment  length.  

Strand  shix  

Cross  c

orrela)o

n  

Page 32: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Strand  cross-­‐correla)on  

Carroll  et  al,  Front  Genet  2014  

NSC  =  Max  CC  value  (fLen)  

Min  CC  RSC  =  

Max  CC  –  Min  CC  

Phantom  CC  –  Min  CC  

Page 33: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Cross-­‐correla)on  plots  

−500 0 500 1000 1500

0.20

00.

205

0.21

00.

215

0.22

00.

225

strand−shift (105,455)

cros

s−co

rrela

tion

ENCFF000OWMed.sorted.1.bam.picard.bam

NSC=1.14102,RSC=1.06452,Qtag=1

−500 0 500 1000 1500

0.28

60.

288

0.29

00.

292

0.29

40.

296

0.29

80.

300

strand−shift (100,265,245)

cros

s−co

rrela

tion

ENCFF000PET.sorted.1.bam.picard.bam

NSC=1.01443,RSC=0.289702,Qtag=−1

−500 0 500 1000 1500

0.19

0.20

0.21

0.22

0.23

strand−shift (130)

cros

s−co

rrela

tion

ENCFF000PMG.sorted.1.bam

NSC=1.28071,RSC=0.987276,Qtag=0

−500 0 500 1000 15000.25

0.26

0.27

0.28

0.29

0.30

strand−shift (125)

cros

s−co

rrela

tion

ENCFF000PMJ.sorted.1.bam

NSC=1.21367,RSC=1.39752,Qtag=1

−500 0 500 1000 1500

0.27

40.

275

0.27

60.

277

0.27

8

strand−shift (90,200,210)

cros

s−co

rrela

tion

ENCFF000PON.sorted.1.bam.picard.bam

NSC=1.0166,RSC=0.92739,Qtag=0

Very  good  enrichment  

Acceptable  enrichment   Poor  enrichment,  

possibly  undersequenced  

No  clustering  Good  input  

Read  clustering  Bad  input  

Input  

ChIP  

Page 34: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Cumula)ve  enrichment  aka  “Fingerprint”    is  another    metric  for  successful  enrichment  

hWp://deeptools.readthedocs.org  Diaz  et  al,  Genome  Biol  2012  

Page 35: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Park,  Nature  Rev  Gene)cs,  2009    

Page 36: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Peak  calling  

appropriate  methodologies  depend  on  data  type  

SPP  MACS2  

punctate   mixed  signal   broad  signal  

-­‐   -­‐  

This  is  an  ac)ve  area  of  algorithm  development  

Transcrip)on  Factors  

Chroma)n    Remodellers    

Histone  marks  

Chroma)n    Remodellers    

Histone  marks    

RNA  polymerase  II  

Page 37: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Principle  of  peak  detec)on  

Symmetry  in  reads  mapped  to  opposite  DNA  strands  

Computa)on  of  enrichment  model  

Page 38: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Pepke,  2009  

Page 39: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Comparison  of  peak  calling  algorithms  

Wilbanks  2010  

Page 40: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Point-­‐source  vs.  broad  peak  detec)on  

Wilbanks  2010  

Sequence-­‐specific  binding  (TFs)   Distributed  binding  (histones,  RNApol2)  

Page 41: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Comparison  of  enriched  regions  detected  by  various  algorithms  

 

Jung  2014  

55M  human  

Page 42: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Comparison  of  enriched  regions  detected  by  various  algorithms  

 

Jung  2014  

55M  human  

Page 43: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

“Hyper-­‐chippable”  regions  

Carroll  et  al,  Front  Genet  2014  

DER  –  Duke  Excluded  Regions  (11  repeat  classes)  UHS  –  Ultra  High  Signal  (open  chroma)n)  DAC  –  consensus  excluded  regions  

Reads  mapped  to  these  regions  should  be  filtered  out  prior  to  peak  calling    Tracks  available  from  UCSC  for  human,  mouse,  fly  and  worm  

Page 44: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Quality  considera)ons  

•  ChIP-­‐seq  quality  guidelines  from  the  ENCODE  project  (Rela)ve  strand  cross-­‐correla)on,  Irreproducible  discovery  rate)  

•  An)body  valida)on  •  Appropriate  sequencing  depth  (depending  on  genome  size  and  

peak  type).  For  human  genome  and  broad-­‐source  peaks,  min.  40-­‐50M  reads  is  required.  

•  Experimental  replica)on  •  Frac)on  of  reads  in  peaks  (FRiP)  >  1%  •  Cross  correla)on  (correla)on  of  the  density  of  sequences  aligned  to  

opposite  DNA  strands  axer  shixing  by  the  fragment  size)  •  Experimental  verifica)on  of  known  binding  sites  (and  sites  not  

bound  as  nega)ve  controls)  

Page 45: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

ChIP-­‐exo:  improvement  in  binding  site  iden)fica)on  

Rhee  and  Pugh,  Cell  2011    

Page 46: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Other  func)onal  genomics  techniques  

Clifford  et  al,  Nature  Rev  Genet,    2014  

Page 47: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

•  ChIP  –  sequencing:  introduc)on  from  a  bioinforma)cs  point  of  view  

 •  Principles  of  analysis  of  ChIP-­‐seq  data  

•  ChIP-­‐seq:  downstream  analyses  

•  Resources  

•  Exercise  overview  

Page 48: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

ChIPseq  downstream  analyses  

•  Valida)on  (wet  lab)  

•  Downstream  analysis  –  Mo)f  discovery  –  Annota)on  –  Integra)on  of  binding  and  expression  data  –  Integra)on  of  various  binding  datasets  –  Differen)al  binding  

 

Page 49: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Peak  annota)on  

Iden)fica)on  of  nearest  genomic  features  •  BEDtools,  •  BEDops,  •  PeakAnnotator,  •  CisGenome,  •  In  R  /  Bioconductor:  ChIPpeakAnno  

Page 50: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Mo)f  detec)on  

•  Enrichment  of  known  sequence  mo)fs  (CEAS,  Transfac  Match,  HOMER,  RSAT)  

•  De  novo  mo)f  detec)on  (MEME,  CisFinder,  HMS,  DREME,  ChIPMunk,  HOMER,  RSAT)  

Enrichment  of  known  mo)fs  (Homer):  

Page 51: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Signal  visualisa)on  and  interpreta)on  

Binding  profile  of  a  TF  in  rela)on  to  the  transcrip)on  start  site  

deepTools  ngsplots  seqMiner  

•  Clustering  •  Heatmaps  •  Profiles  •  Comparison  of  

different  datasets  

Page 52: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Differen)al  occupancy  

•  Use  algorithms  developed  for  differen)al  expression  and  summarise  reads  mapped  in  peaks;  normalisa)on;  sta)s)cal  tes)ng;  R  environment  –  edgeR  /  csaw  –  DiffBind  (implements  several  normalisa)on  methods)    

•  Calculate  enrichment  in  sliding  windows  –  DROMPA  –  Diffreps  

Page 53: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

•  ChIP  –  sequencing:  introduc)on  from  a  bioinforma)cs  point  of  view  

 •  Principles  of  analysis  of  ChIP-­‐seq  data  

•  ChIP-­‐seq:  downstream  analyses  

•  Resources  

•  Exercise  overview  

Page 54: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Where  to  obtain  data?  

Page 55: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

The  ENCODE  project  

www.encodeproject.org    •  Encyclopedia  of  DNA  elements  •  Iden)fica)on  of  regulatory  DNA  elements  in  human  (and  mouse)  genome  

•  240  human  and  55  mouse  DNA  binding  proteins  •  1464  human  and  432  mouse  samples  •  RNA  profiling,  protein-­‐DNA  interac)on,  chroma)n  

condensa)on,  DNA  methyla)on,  …  •  2009  -­‐  ongoing  

Page 56: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Human  ACTB  locus  as  seen  in  the  UCSC  Genome  Browser  

Gene  model  Alterna)ve  transcripts  Histone  modifica)ons  Chroma)n  structure  

Transcrip)on  factor  binding  sites  DNA  conserva)on  Single  nucleo)de  polymorphisms  (SNP)  Repeats  

Page 57: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Human  ACTB  locus  as  seen  in  the  UCSC  Genome  Browser  

Gene  model  Alterna)ve  transcripts  Histone  modifica)ons  Chroma)n  structure  

Transcrip)on  factor  binding  sites  DNA  conserva)on  Single  nucleo)de  polymorphisms  (SNP)  Repeats  

Page 58: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Human  ACTB  locus  as  seen  in  the  UCSC  Genome  Browser  

Gene  model  Alterna)ve  transcripts  Histone  modifica)ons  Chroma)n  structure  

Transcrip)on  factor  binding  sites  DNA  conserva)on  Single  nucleo)de  polymorphisms  (SNP)  Repeats  

Page 59: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Human  ACTB  locus  as  seen  in  the  UCSC  Genome  Browser  

Gene  model  Alterna)ve  transcripts  Histone  modifica)ons  Chroma)n  structure  

Transcrip)on  factor  binding  sites  DNA  conserva)on  Single  nucleo)de  polymorphisms  (SNP)  Repeats  

Page 60: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Human  ACTB  locus  as  seen  in  the  UCSC  Genome  Browser  

Gene  model  Alterna)ve  transcripts  Histone  modifica)ons  Chroma)n  structure  

Transcrip)on  factor  binding  sites  DNA  conserva)on  Single  nucleo)de  polymorphisms  (SNP)  Repeats  

Page 61: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

The  Epigenomics  Roadmap  Project  

hWp://www.roadmapepigenomics.org/    •  Reference  human  epigenomes  •  DNA  methyla)on,  histone  modifica)ons,  chroma)n  

accessibility  and  small  RNA  transcripts    •  Stem  cells  and  primary  ex  vivo  )ssues    •  111  )ssue  and  cell  types  •  2,804  genome-­‐wide  datasets  

Page 62: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Further  reading  

•  Impact  of  ar)fact  removal  on  ChIP  quality  metrics  in  ChIP-­‐seq  and  ChIP-­‐exo  data.  Carrol  et  al,  Front.  Genet.  2014  

•  Impact  of  sequencing  depth  in  ChIP-­‐seq  experiments.  Jung  et  al,  NAR  2014  

•  ChIP-­‐seq  guidelines  and  prac)ces  of  the  ENCODE  and  modENCODE  consor)a.  Landt  et  al,  Genome  Res.  2012  

•  hWp://genome.ucsc.edu/ENCODE/qualityMetrics.html#defini)ons  

•  hWps://www.encodeproject.org/data-­‐standards  

Page 63: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Bioconductor  ChIP-­‐seq  resources  •  General  purpose  tools:  

–  Rsubread  (read  mapping;  not  ideal  for  global  alignment)  –  Rbow)e  (global  alignment)  –  GenomicRanges  (tools  for  manipula)ng  range  data)  –  Rsamtools  (SAM  /  BAM  support)  –  htSeqTools  (tools  for  NGS  data;  post-­‐alignment  QC)  –  chipseq  (u)li)es  for  ChIP-­‐seq  analysis)  –  Csaw  (a  pipeline  for  ChIP-­‐seq  analysis,  including  sta)s)cal  analysis  of  differen)al  occupancy)  

•  Peak  calling  –  SPP  –  BayesPeak  (HMM  and  Bayesian  sta)s)cs)  –  MOSAiCS  (model-­‐based  one  and  two  Sample  Analysis  and  Inference  for  ChIP-­‐Seq)  –  iSeq  (Hidden  Ising  models)  –  ChIPseqR  (developed  to  analyse  nucleosome  posi)oning  data)  

•  Quality  control  –  ChIPQC  

•  Differen)al  occupancy  –  edgeR  –  DESeq,  DESeq2  –  DiffBind  (compa)ble  with  objects  used  for  ChIPQC,  wrapper  for  DESeq  and  edgeR  DE  func)ons)  

•  Peak  Annota)on  –  ChIPpeakAnno  (annota)ng  peaks  with  genome  context  informa)on)  

Page 64: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

•  ChIP  –  sequencing:  introduc)on  from  a  bioinforma)cs  point  of  view  

 •  Principles  of  analysis  of  ChIP-­‐seq  data  

•  ChIP-­‐seq:  downstream  analyses  

•  Resources  

•  Exercise  overview  

Page 65: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Exercise  •  1.  Quality  control  •  2.  Read  preprocessing  •  3.  Peak  calling  •  4.  Exploratory  analysis  (sample  clustering)  •  5.  Visualisa)on  •  6.  Sta)s)cal  analysis  of  differen)al  occupancy      

Page 66: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Did  my  ChIP  work?  

Cross-­‐correla)on   Cumula)ve  enrichment  

−500 0 500 1000 1500

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

strand−shift (100)

cros

s−co

rrela

tion

ENCFF000PED.chr12.bam

NSC=2.50193,RSC=1.87725,Qtag=2

Page 67: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Exploratory  analysis  

Clustering  of  libraries    by  reads  mapped  in  bins,  genome  –  wide  (spearman)  

Clustering  of  libraries    by  reads  mapped  in  peaks  (pearson)  

HeLa  

Sknsh  &  

HepG

2  ne

ural  

HepG

2  

neural  

Sknsh  HeLa  

HepG2  

I  

Ch  

Ch  

I  

I  

Ch  

Page 68: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Binding  profile  around  TSS  

Page 69: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Ques)ons?  

[email protected]  

Page 70: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

That’s  all  for  now,    

)me  to  do  some  hands-­‐on  work  

Page 71: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation
Page 72: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Library  quality  control  and  preprocessing  

•  FastQC  /  Prinseq  

•  Trim  adapters  if  any  adapter  sequences  are  present  in  the  reads  (as  determined  by  the  QC)  

•  In  some  cases,  you’ll  observe  k-­‐mer  enrichment  (especially  if  the  data  is  ChIP-­‐exo,  a  new  varia)on  of  ChIP-­‐seq)  –  it  is  not  necessarily  a  bad  thing,  if  sequence  duplica)on  levels  are  low;  however  it  may  indicate  low  complexity  of  the  library  –  a  warning  sign  that  the  enrichment  in  ChIP  was  not  successful  or  the  libraries  are  over-­‐amplified  (oxen  the  laWer  is  the  consequence  of  the  former)  

Page 73: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Sequence  duplica)on  level  >  70%  (low  complexity  library)  

Quality  control:  tag  uniqueness  –  library  complexity  metric  

NRF:  Non-­‐redundant  frac)on  (of  reads):  propor)on  of  unique  tags  /  total    less  than  20%  of  reads  should  be  duplicates  for  10  million  reads  sequenced  (ENCODE)  

Page 74: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Mapping  reads  to  the  reference  genome  

•  Choose  the  right  reference:  assembly  version  (not  always  the  newest  is  best)  and  type  (primary  assembly,  or  assembly  from  individual  chromosome  sequences  +  non-­‐chromosomal  con)gs;  not  the  top  level  assembly);  choose  the  matching  annota)on  file  (GTF,  GFF)  

•  Read  mapping:  global  alignment  •  Mappers  (=  aligners):  Bow)e,  BWA,  BBMap,  Novoalign,  …  (lots  of  tools  are  

available)  

•  Visualise  data  in  genome  browser  –  BAM  files  or  tracks  (wig,  bedgraph,  bigWig)  –  Local  (IGV)  or  web-­‐based  (UCSC  genome  browser)  –  Data  quality  assessment  

Page 75: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Cross-­‐correla)on  profiles,  RSC  and  NSC  

•  Metrics  to  quan)fy  the  fragment  length  signal  and  the  ra)o  of  fragment  length  signal  to  read  length  signal    

•  Rela)ve  Cross  Correla)on  (RSC)  -­‐      ChIP  to  ar)fact  signal  

•  Normalised  Cross  Correla)on  (NSC)      

•  TFs:  fragment  lengths  are  oxen  greater  than  the  size  of  the  DNA  binding  event,  the  dis)nct  clustering  of  (+)  and  (-­‐)  reads  around  this  site  is  very  apparent  

•  NSC>1.1  (higher  values  indicate  more  enrichment;  1  =  no  enrichment)  •   RSC>0.8  (0  =  no  signal;  <1  low  quality  ChIP;  >1  high  enrichment  •  Broad  peaks:  this  clustering  may  be  more  diffuse  (fragment  length  <  peak)  

CC(Fragment  length)  min  (CC)  

CC(Fragment  length)-­‐min  (CC)    CC  (read  length)  –  min  (CC)    

Page 76: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Comparison  of  peak  calling  algorithms  

Wilbanks  2010  

Peak  overlap  (Ho  et  al,  2012)  

>  50  %  

20  %  

Page 77: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Effect  of  sequencing  depth  on  regions  detected  by  various  algorithms  

 

Jung  2014  

Page 78: Introduc)on*to*Chroman*IP*– sequencing( ChIP7seq ... · Cross7correlaon*plots* −500 0 500 1000 1500 0.200 0.205 0.210 0.215 0.220 0.225 strand−shift (105,455) cross − correlation

Fold  enrichment  =  signal  /  background