implementation of gpu-based bioinformatic tools at the encode dcc
DESCRIPTION
An overview of the assays performed and distributed by the ENCODE DCC as well as a summary of the uniform processing pipelines that are being implemented by the ENCODE Consortium. Here, we talk about the impact using GPUs has on speed of running the ChIP-seq pipeline.TRANSCRIPT
![Page 1: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/1.jpg)
The ENCODE DCC
Eurie L. Hong, Ph.D. • Project Manager, ENCODE DCCPI: J. Michael Cherry, Ph.D.
Department of Genetics • Stanford University School of Medicine
https://www.encodeproject.org/
![Page 2: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/2.jpg)
What is the ENCODE Consortium?
Image credit: NHGRI
![Page 3: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/3.jpg)
Production labsAnalysis groups
Role: Data generation Data organization Data access
Tasks: Perform assays Define submission process Web-based searchesPerform analyses Data processing & validation Data downloadsValidate data Data file storageSubmit data files Metadata curationSubmit metadata
Genome Browser
ENCODE portal(DCC)
Role of the Data Coordination Center
Data files
Metadata DCCDCCIntegrative
websites
Scientific
community
![Page 4: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/4.jpg)
Transparency of methods• How was the experiment performed?• What software was used to analyze the data?
Reproducibility of results• What files were used?• What software and parameters were used for the pipelines?
Interoperability with other genomic projects• Can the pipeline software we use be used by other projects?• Can the metadata allow easy integration with other data?
DCC goals for implementation
![Page 5: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/5.jpg)
Data volume: diversity of assays
Modified from PLoS Biol 9-e1001046,2011(M. Pazin)
Approximately ~30 different assays
![Page 6: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/6.jpg)
Data volume: number of assays
(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)
![Page 7: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/7.jpg)
Transparency & reproducibility:Capture the experimental design
Biological replicate 1
Technicalreplicate 1
Biological replicate 2
Raw datafile (fastq)
Processed file (bam)
Experiment
Software & pipelines
Technicalreplicate 1
Raw datafile (fastq)
Processed file (bam)
Software & pipelines
Biological replicate 1
Technicalreplicate 1
Raw datafile (fastq)
Processed file (bam)
Controlexperiment
Software & pipelines
Processed file (peak calls)
Software & pipelines
![Page 8: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/8.jpg)
Data interoperability:uniform processing pipelines
(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)
![Page 9: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/9.jpg)
Processing of TF ChIP-seq assays
FASTQ (SE/PE)ReplicatesControls
Map ReadsFilterPool
SubsamplePseudoreplicates Call Peaks
IDR
Signal Tracks
BAMReplicates
Pooled RepsControls
BAM2 Pseudoreplicates
per replicate2 Pseudoreplicates
per pool
peakReplicates
PseudoreplicatesPools
peakIDR-thresholded
Peak Calls
bigWigReplicates
Pooled Replicates
Specification document (Anshul Kundaje):https://docs.google.com/document/d/1lG_Rd7fnYgRpSIqrIfuVlAz2dW1VaSQThzk836Db99c/edit?usp=sharing
![Page 10: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/10.jpg)
Relative CPU time for ChIP-seq (original)
Map
Signal Tracks
Subsample
Call Peaks
IDR
Relative CPU time per step for a typical transcription factor ChIP-seq experimentIDR can take much longer if there are many regions, as in a typical histone ChIP
IDR
Peak Calling
Nikhil Podduturi
![Page 11: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/11.jpg)
Data volume: TF ChIp-seq
(includes mouse & human)
![Page 12: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/12.jpg)
1 10 100 1000 10000
CPU
NVIDIAGPU
Clock Time (Seconds) Log10 scale
Performance Comparison:IDR analysis CPU (re-engineered) vs GPU
~120x Speed Increase
60 min
30 sec
Nikhil Podduturi
![Page 13: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/13.jpg)
Impact on use for data processing
Re-engineered • improved stability• tests!• ability to run on CPU or GPU
Faster processing• recalculation of entire data corpus against new genome build• allow determination of data-based thresholds and cut-offs
Public availability• Can be run on GPU instances available at AWS• GPU implementation of IDR: https://github.com/ENCODE-DCC/idr-GPU• TF ChIP-seq: https://github.com/ENCODE-DCC/tf_chipseq• Others available: https://github.com/ENCODE-DCC
![Page 14: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/14.jpg)
Next Steps
Data validation• GPU vs CPU results
Pipeline release• Integration into ChIP-seq pipeline• Deployment via AWS instances and at DNAnexus
Adapt additional software components• SPP: https://github.com/nikhilRP/spp-GPU• Hotspots: https://github.com/nikhilRP/hotspot-GPU
![Page 15: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/15.jpg)
15
ENCODE DCC
Nikhil Podduturi, Laurence Rowe, Forrest Tanaka
Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, Seth Strattan
Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz
Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho
@encodedcc [email protected]
Data Wranglers
Software engineers
QA, sysadmins, admin, biocurator
assistant
https://github.com/ENCODE-DCC/
The ENCODE DCC is funded by NHGRI Grant U41HG006992
![Page 16: Implementation of GPU-based bioinformatic tools at the ENCODE DCC](https://reader034.vdocuments.us/reader034/viewer/2022042607/5598568c1a28aba11d8b46a6/html5/thumbnails/16.jpg)
ENCODE Uniform Processing Pipeline Work
DNAnexus (PaaS): Brett Hannigan, Andrey Kislyuk, Mike Lin, Singer Ma, Ohad RodehNVIDIA Corporation: NVIDIA Academic Hardware donation program
donation of two Kepler K40 GPU; NVIDIA’s NVBIO framework
Ben Hitz, Seth Strattan, Nikhil Podduturi
ChIP-seq against transcription factors: Anshul KundajeChIP-seq against histone marks: Anshul KundajeRNA-seq: ENCODE RNA working groupWhole genome bisulfite sequencing: Junko Tsuji, Zhiping WengDNAse-seq: Alvin Qin, Shirley Liu