Download - Implementation of GPU-based bioinformatic tools at the ENCODE DCC

The ENCODE DCC

Eurie L. Hong, Ph.D. • Project Manager, ENCODE DCCPI: J. Michael Cherry, Ph.D.

Department of Genetics • Stanford University School of Medicine

https://www.encodeproject.org/

What is the ENCODE Consortium?

Image credit: NHGRI

Production labsAnalysis groups

Role: Data generation Data organization Data access

Tasks: Perform assays Define submission process Web-based searchesPerform analyses Data processing & validation Data downloadsValidate data Data file storageSubmit data files Metadata curationSubmit metadata

Genome Browser

ENCODE portal(DCC)

Role of the Data Coordination Center

Data files

Metadata DCCDCCIntegrative

websites

Scientific

community

Transparency of methods• How was the experiment performed?• What software was used to analyze the data?

Reproducibility of results• What files were used?• What software and parameters were used for the pipelines?

Interoperability with other genomic projects• Can the pipeline software we use be used by other projects?• Can the metadata allow easy integration with other data?

DCC goals for implementation

Data volume: diversity of assays

Modified from PLoS Biol 9-e1001046,2011(M. Pazin)

Approximately ~30 different assays

Data volume: number of assays

(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)

Transparency & reproducibility:Capture the experimental design

Biological replicate 1

Technicalreplicate 1


Raw datafile (fastq)

Processed file (bam)

Experiment

Software & pipelines









Controlexperiment


Processed file (peak calls)


Data interoperability:uniform processing pipelines

(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)

Processing of TF ChIP-seq assays

FASTQ (SE/PE)ReplicatesControls

Map ReadsFilterPool

SubsamplePseudoreplicates Call Peaks

IDR

Signal Tracks

BAMReplicates

Pooled RepsControls

BAM2 Pseudoreplicates

per replicate2 Pseudoreplicates

per pool

peakReplicates

PseudoreplicatesPools

peakIDR-thresholded

Peak Calls

bigWigReplicates

Pooled Replicates

Specification document (Anshul Kundaje):https://docs.google.com/document/d/1lG_Rd7fnYgRpSIqrIfuVlAz2dW1VaSQThzk836Db99c/edit?usp=sharing

Relative CPU time for ChIP-seq (original)

Map

Signal Tracks

Subsample

Call Peaks

IDR

Relative CPU time per step for a typical transcription factor ChIP-seq experimentIDR can take much longer if there are many regions, as in a typical histone ChIP

IDR

Peak Calling

Nikhil Podduturi

Data volume: TF ChIp-seq

(includes mouse & human)

1 10 100 1000 10000

CPU

NVIDIAGPU

Clock Time (Seconds) Log10 scale

Performance Comparison:IDR analysis CPU (re-engineered) vs GPU

~120x Speed Increase

60 min

30 sec

Nikhil Podduturi

Impact on use for data processing

Re-engineered • improved stability• tests!• ability to run on CPU or GPU

Faster processing• recalculation of entire data corpus against new genome build• allow determination of data-based thresholds and cut-offs

Public availability• Can be run on GPU instances available at AWS• GPU implementation of IDR: https://github.com/ENCODE-DCC/idr-GPU• TF ChIP-seq: https://github.com/ENCODE-DCC/tf_chipseq• Others available: https://github.com/ENCODE-DCC

Next Steps

Data validation• GPU vs CPU results

Pipeline release• Integration into ChIP-seq pipeline• Deployment via AWS instances and at DNAnexus

Adapt additional software components• SPP: https://github.com/nikhilRP/spp-GPU• Hotspots: https://github.com/nikhilRP/hotspot-GPU

15

ENCODE DCC

Nikhil Podduturi, Laurence Rowe, Forrest Tanaka

Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, Seth Strattan

Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz

Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho

@encodedcc [email protected]

Data Wranglers

Software engineers

QA, sysadmins, admin, biocurator

assistant

https://github.com/ENCODE-DCC/

The ENCODE DCC is funded by NHGRI Grant U41HG006992

ENCODE Uniform Processing Pipeline Work

DNAnexus (PaaS): Brett Hannigan, Andrey Kislyuk, Mike Lin, Singer Ma, Ohad RodehNVIDIA Corporation: NVIDIA Academic Hardware donation program

donation of two Kepler K40 GPU; NVIDIA’s NVBIO framework

Ben Hitz, Seth Strattan, Nikhil Podduturi

ChIP-seq against transcription factors: Anshul KundajeChIP-seq against histone marks: Anshul KundajeRNA-seq: ENCODE RNA working groupWhole genome bisulfite sequencing: Junko Tsuji, Zhiping WengDNAse-seq: Alvin Qin, Shirley Liu

Download - Implementation of GPU-based bioinformatic tools at the ENCODE DCC

Top Related