The ENCODE DCC
Eurie L. Hong, Ph.D. • Project Manager, ENCODE DCCPI: J. Michael Cherry, Ph.D.
Department of Genetics • Stanford University School of Medicine
https://www.encodeproject.org/
What is the ENCODE Consortium?
Image credit: NHGRI
Production labsAnalysis groups
Role: Data generation Data organization Data access
Tasks: Perform assays Define submission process Web-based searchesPerform analyses Data processing & validation Data downloadsValidate data Data file storageSubmit data files Metadata curationSubmit metadata
Genome Browser
ENCODE portal(DCC)
Role of the Data Coordination Center
Data files
Metadata DCCDCCIntegrative
websites
Scientific
community
Transparency of methods• How was the experiment performed?• What software was used to analyze the data?
Reproducibility of results• What files were used?• What software and parameters were used for the pipelines?
Interoperability with other genomic projects• Can the pipeline software we use be used by other projects?• Can the metadata allow easy integration with other data?
DCC goals for implementation
Data volume: diversity of assays
Modified from PLoS Biol 9-e1001046,2011(M. Pazin)
Approximately ~30 different assays
Data volume: number of assays
(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)
Transparency & reproducibility:Capture the experimental design
Biological replicate 1
Technicalreplicate 1
Biological replicate 2
Raw datafile (fastq)
Processed file (bam)
Experiment
Software & pipelines
Technicalreplicate 1
Raw datafile (fastq)
Processed file (bam)
Software & pipelines
Biological replicate 1
Technicalreplicate 1
Raw datafile (fastq)
Processed file (bam)
Controlexperiment
Software & pipelines
Processed file (peak calls)
Software & pipelines
Data interoperability:uniform processing pipelines
(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)
Processing of TF ChIP-seq assays
FASTQ (SE/PE)ReplicatesControls
Map ReadsFilterPool
SubsamplePseudoreplicates Call Peaks
IDR
Signal Tracks
BAMReplicates
Pooled RepsControls
BAM2 Pseudoreplicates
per replicate2 Pseudoreplicates
per pool
peakReplicates
PseudoreplicatesPools
peakIDR-thresholded
Peak Calls
bigWigReplicates
Pooled Replicates
Specification document (Anshul Kundaje):https://docs.google.com/document/d/1lG_Rd7fnYgRpSIqrIfuVlAz2dW1VaSQThzk836Db99c/edit?usp=sharing
Relative CPU time for ChIP-seq (original)
Map
Signal Tracks
Subsample
Call Peaks
IDR
Relative CPU time per step for a typical transcription factor ChIP-seq experimentIDR can take much longer if there are many regions, as in a typical histone ChIP
IDR
Peak Calling
Nikhil Podduturi
Data volume: TF ChIp-seq
(includes mouse & human)
1 10 100 1000 10000
CPU
NVIDIAGPU
Clock Time (Seconds) Log10 scale
Performance Comparison:IDR analysis CPU (re-engineered) vs GPU
~120x Speed Increase
60 min
30 sec
Nikhil Podduturi
Impact on use for data processing
Re-engineered • improved stability• tests!• ability to run on CPU or GPU
Faster processing• recalculation of entire data corpus against new genome build• allow determination of data-based thresholds and cut-offs
Public availability• Can be run on GPU instances available at AWS• GPU implementation of IDR: https://github.com/ENCODE-DCC/idr-GPU• TF ChIP-seq: https://github.com/ENCODE-DCC/tf_chipseq• Others available: https://github.com/ENCODE-DCC
Next Steps
Data validation• GPU vs CPU results
Pipeline release• Integration into ChIP-seq pipeline• Deployment via AWS instances and at DNAnexus
Adapt additional software components• SPP: https://github.com/nikhilRP/spp-GPU• Hotspots: https://github.com/nikhilRP/hotspot-GPU
15
ENCODE DCC
Nikhil Podduturi, Laurence Rowe, Forrest Tanaka
Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, Seth Strattan
Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz
Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho
@encodedcc [email protected]
Data Wranglers
Software engineers
QA, sysadmins, admin, biocurator
assistant
https://github.com/ENCODE-DCC/
The ENCODE DCC is funded by NHGRI Grant U41HG006992
ENCODE Uniform Processing Pipeline Work
DNAnexus (PaaS): Brett Hannigan, Andrey Kislyuk, Mike Lin, Singer Ma, Ohad RodehNVIDIA Corporation: NVIDIA Academic Hardware donation program
donation of two Kepler K40 GPU; NVIDIA’s NVBIO framework
Ben Hitz, Seth Strattan, Nikhil Podduturi
ChIP-seq against transcription factors: Anshul KundajeChIP-seq against histone marks: Anshul KundajeRNA-seq: ENCODE RNA working groupWhole genome bisulfite sequencing: Junko Tsuji, Zhiping WengDNAse-seq: Alvin Qin, Shirley Liu