a massively parallel solution to a massively parallel problem hpc4ngs workshop, may 21-22, 2012...

GPGPU in NGS BioinformaticsA massively parallel solution to a massively parallel problem

HPC4NGS Workshop, May 21-22, 2012Principe Felipe Research Center, Valencia

Brian Lam, PhD

Talk Outline

Next-generation sequencing and current challenges

What is GPGPU computing?

The use of GPGPU in NGS bioinformatics

How it works – the BarraCUDA project

System requirements for GPU computing

What if I want to develop my own GPU code?

What to look out for in the near future?

Conclusions

Talk Outline

What is GPGPU computing?

Conclusions

DNA sequencing

“DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases A, G, C, T in a molecule of DNA.” - Wikipedia

Next-generation DNA sequencing

Commercialised in 2005 by 454 Life Science

Sequencing in a massively parallel fashion

How it works

An example: Whole genome sequencing

Extract genomic DNA from blood/mouth swabs Break into small DNA fragments of 200-400 bp Attach DNA fragments to a surface (flow

cells/slides/microtitre plates) at a high density Perform concurrent “cyclic sequencing reaction”

to obtain the sequence of each of the attached DNA fragments

An Illumina HiSeq 2000 can interrogate 825K spots / mm2

GTCCTGA

Capturing the sequencing signals

TATTTTT

ATTCNGG

Not to scale

What do we get from the sequencers

Billions of short DNA sequences, also called sequence reads ranging from

25 to 400 bp

The throughput of NGS has increased dramatically

Current bioinformatics pipeline

Sequence Alignment

Image AnalysisBase Calling

Variant Calling, Peak calling

Just how complex is the pipeline?

Source: Genologics

Talk Outline

What is Many-core/GPGPU computing?

Conclusions

Many-core computing architectures

A physical die that contains a ‘large number’ of processing

i.e. computation can be done in a massively parallel manner

Modern graphics cards (GPUs) consist of hundreds to thousands of computing cores

Just how fast are today’s GPUs

If GPUs are so fast why do we need CPUs?

GPUs are fast, but there is a catch: SIMD – Single instruction, multiple data

CPUs are powerful multi-purpose processors MIMD – Multiple Instructions, multiple

CPU vs GPU

SIMD – pros and cons

The very same (low level) instruction is applied to multiple data at the same time e.g. a GTX680 can do addition to 1536 data

point at a time, versus 16 on a 16-core CPU.

Branching results in serialisations

The ALUs on GPU are usually much more primitive compared to their CPU counterparts.

GPU in scientific computing

Scientific computing often deal with large amount of data, and in many occasions, applying the same set of instructions to these data. Examples:▪ Monte Carlo simulations▪ Image analysis▪ Next-generation sequencing data analysis

Why bother?

Low capital cost and energy efficient Dell 12-core workstation £5,000, ~1kW Dell 40-core computing cluster £ 20,000+, ~6kW

NVIDIA Geforce GTX680 (1536 cores): £400, <0.2kW NVIDIA C2070 (448 cores): £1000, 0.2kW

Many supercomputers also contain multiple GPU nodes for parallel computations

GPGPU in bioinformatics

Examples: CUDASW++ 6.3X

MUMmerGPU 3.5X

GPU-HMMer 60-100x

Talk Outline

Conclusions

A few GPGPU programs are available

Ion-torrent server (T7500 workstation) uses GPUs for base-calling

MummerGPU – comparisons among genomes

BarraCUDA, Soap3 – short read alignments

Talk Outline

Conclusions

Heterogeneous computing

Current bioinformatics pipeline

Sequence Alignment

Image AnalysisBase Calling

Variant Calling, Peak calling

Sequence alignment

Sequence alignment is a crucial step in the bioinformatics pipeline for downstream

analyses

This step often takes many CPU hours to perform

Usually done on HPC clusters

The BarraCUDA Project – an example of GPGPU computing

The main objective of the BarraCUDA project is to develop a software that runs on GPU/ many-core architectures

i.e. to map sequence reads the same way as they come out from the NGS instrument

Software Pipeline

Genome

Read library

Copy read library to GPU

Copy genome to GPU

Copy alignment

results to CPU

Write to disk

Alignment Results

AlignmentAlignment

Alignment

AlignmentAlignment

Alignment

AlignmentAlignment

Alignment

Burrows-Wheeler transform

Originally intended for data compression, performs reversible transformation of a string

In 2000, Ferragina and Manzini introduced BWT-based index data structure for fast substring matching at O(n)

Sub-string matching is performed in a tree traversal-like manner

Used in major sequencing read mapping programs e.g. BWA, Bowtie, Soap2

How it works – a backward search algorithm

matching substring ‘banan’

In mathematical terms

Modified from Li & Durbin Bioinformatics 2009, 25:14, 1754-1760

Programming code

BWT_exactmatch(READ,i,k,l){

if (i < 0) then return RESULTS;k = C(READ[i]) + O(READ[i],k-1)+1;l = C(READ[i]) + O(READ[i],l);if (k <= l) then BWT_exactmatch(READ,i-1,k,l);

}main(){

Calculate reverse BWT string B from reference string X Calculate arrays C(.) and O (.,.) from B Load READS from disk

For every READ in READS do{i = |READ|; Positionk = 0; Lower boundl = |X|; Upper boundBWT_exactmatch(READ,i,k,l);

Write RESULTS to disk}

Modified from Li & Durbin Bioinformatics 2009, 25:14, 1754-1760

Porting to GPU

Simple data parallelism

Used mainly the GPU for matching

Porting to GPU

__device__BWT_GPU_exactmatch(W,i,k,l){ if (i < 0) then return RESULTS; k = C(W[i]) + O(W[i],k-1)+1; l = C(W[i]) + O(W[i],l); if (k <= l) then BWT_GPU_exactmatch(W,i-1,k,l);}__global__GPU_kernel(){ W = READS[thread_no]; i = |W|; Position k = 0; Lower bound l = |X|; Upper bound BWT_GPU_exactmatch(W,i,k,l);}

main(){

Calculate reverse BWT string B from reference string X Calculate array C(.) and O(.,.) from B Load READS from disk Copy B, C(.) and O(.) to GPU Copy READS to GPU

Launch GPU_kernel with <<|READS|>> concurrent threads

COPY Alignment Results back from GPU Write RESULTS to disk}

How fast is that?

Very fast indeed, using a Tesla C2050, we can match 25 million 100bp reads to the BWT in just over 1 min.

But… is this relevant?

Inexact Matching?

matching substring ‘anb’ where‘b’ is subsituted with an ‘a’

Inexact matching requires base substitution within the query substring

Search space complexity = O(9n)!

Klus et al., BMC Res Notes 2012 5:27

BarraCUDA started its life as a GPU version of BWA

It _________worked! 10% faster than 8x X5472 cores @ 3GHz BWA uses a greedy breadth-first search approach (takes

up to 40MB per thread) Not enough workspace for thousands of concurrent

kernel threads (@ 4KB) – i.e. reduced accuracy – NOT GOOD ENOUGH!

partially

BarraCUDA uses a depth-first search approach

hithit

SIMD – the catch

The very same (low level) instruction is applied to multiple data at the same time e.g. a GTX680 can do addition to 1536 data

point at a time, versus 16 on a 16-core CPU.

Branching results in serialisations

The ALUs on GPU are usually much more primitive compared to their CPU counterparts.

Branch serialisation

Branch divergence

Multi-kernel design

hithit

CPU thread queue:

Thread 1.1

Thread 1.2

Thread 1

Mapping accuracy

BWA 0.5.8 BarraCUDA

Mapping 89.95% 91.42%

Error 0.06% 0.05%

Mapping speed

5160 (1C) X5670 (1C)

5160 (2C) 5160 (4Cs) X5670 (6Cs)

M20900

160153

Porting to GPU

Simple data parallelism

Used mainly the GPU for matching

ScalabilityKlus et al., BMC Res Notes 2012 5:27

Talk Outline

Conclusions

System requirements

Hardware The system must have at least one decent GPU

▪ NVIDIA Geforce 210 will not work!

One or more PCIe x16 slots

A decent power supply with appropriate power connectors▪ 550W for one Tesla C2075, and + 225W for each additional cards▪ Don’t use any Molex converters!

Ideally, dedicate cards for computation and use a separate cheap card for display

System requirements

Software CUDA▪ CUDA toolkit▪ Appropriate NVIDIA drivers

OpenCL▪ CUDA toolkit (NVIDIA)▪ AMD APP SDK (AMD)▪ Appropriate AMD/NVIDIA drivers

System requirements

For example: CUDAtoolkit v4.2

Talk Outline

Conclusions

Coding effort

In the end, how much effort have we put in so far?

3300 lines of new code for the alignment core▪ Compared to 1000 lines in BWA▪ Still on going!

1000 lines for SA to linear space conversion

CUDA vs OpenCL

CUDA (in general) is faster and more complete Tied to NVIDIA hardware

OpenCL Still new, AMD and Apple are pushing

hard on this Can run on a variety of hardware

including CPUs

Where to start

NVIDIA developer zone http://developer.nvidia.com/

OpenCL developer zone http://developer.amd.com/zones/

OpenCLZone/Pages/default.aspx

Things to consider while coding

Different hardware has different capabilities GT200 is very different from GF100, in terms of

cache arrangement and concurrent kernel executions

GF100 is also different from GK110 where the latter can do dynamic parallelism

Low level data management is often required, e.g. earmarking data for memory type, coalescing

memory access

OpenACC directives

The easy way out! A set of compiler directives

It allows programmers to use ‘accelerator’ without explicitly writing GPU code

Supported by CAPS, Cray, NVIDIA, PGI

How it works

Source: NVIDIA

More information?

http://www.openacc-standard.org/

Talk Outline

Conclusions

Future Outlook

The game is changing quickly CPUs are also getting more cores : AMD 16-core Opteron 6200

series

Many-core platforms are evolving Intel MIC processors (50 x86 cores) NVIDIA Kepler platform (1536 cores) AMD Radeon 7900 series (2048 cores)

SIMD nodes are becoming more common in supercomputers

OpenCL

Talk Outline

Conclusions

Few software are available for NGS bioinformatics yet, more to come

BarraCUDA is one of the first attempts to accelerate NGS bioinformatics pipeline

Significant amount of coding is usually required, but more programming tools are becoming available

Many-core is still evolving rapidly and things can be very different in the next 2-3 years

Acknowledgements

IMS-MRL, CambridgeGiles YeoPetr KlusSimon Lam

NIHR-CBRC, CambridgeIan McFarlaneCassie Ragnauth

Whittle Lab, CambridgeGraham PullanTobias Brandvik

Microbiology, University College CorkDag Lyberg

Gurdon Institute, CambridgeNicole Cheung

HPCS, CambridgeStuart Rankin

NVIDIA CorporationThomas BradleyTimothy Lanfear

Thank you.

a massively parallel solution to a massively parallel problem hpc4ngs workshop, may 21-22, 2012...

bp slide

multiple data slide

phd slide

genologics slide

wikipedia slide

peak calling slide

parallel fashion slide

dna sequencing

Documents

massively parallel algorithms - mit...

massively parallel domain decomposition...

multiprocessor cache for massively parallel soc

reliably massively parallel symbolic computing

massively parallel holography and other coherent …

massively parallel exact diagonalization of strongly

a massively parallel architecture for bioinformatics

massively parallel signal processing for wireless

advances in massively parallel electromagnetic simulation

massively parallel reinforcement learning with an

massively parallel processor architectures for resource

massively parallel computing

massively parallel data processing using cuda

massively parallel tumor multigene sequencing to evaluate...

reactive molecular dynamics on massively parallel ... ·...

massively parallel functional analysis of missense

a massively parallel solid mechanics solver in alya a...

massively parallel sequencing for biodiversity

prototyping a scalable massively-parallel supercomputer

massively parallel...