varre_biomanycores_bosc2009

21
Biomanycores, a repository of interoperable open-source code for many-cores bioinformatics Jean-St´ ephane Varr´ e, St´ ephane Janot, Mathieu Giraud [email protected] Sequoia Bioinformatics LIFL – UMR CNRS 8022 – Universit´ e Lille 1, France INRIA Lille Nord-Europe, France June 2009 J.-S. Varr´ e, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 1 / 20

Upload: bosc

Post on 11-May-2015

833 views

Category:

Technology


1 download

TRANSCRIPT

Biomanycores, a repository of interoperableopen-source code for many-cores bioinformatics

Jean-Stephane Varre, Stephane Janot, Mathieu [email protected]

Sequoia BioinformaticsLIFL – UMR CNRS 8022 – Universite Lille 1, France

INRIA Lille Nord-Europe, France

June 2009

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 1 / 20

Outline

High-performance computing

Graphical Processing Units and bioinformatics

biomanycores.orgI aim of the projectI what has been done ?I future developments

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 2 / 20

High Performance Bioinformatics – Manycores

1970 – 2002:Moore’s law =increasing frequencies

problems:power consumption,heat dissipation here

from now on: Moore’s law continues with multiple coresI from multicores: dual-cores, quad-cores, octo-cores...I to manycores:

F Graphic processing units (GPUs)Nvidia GTX 285 ⇒ 30× 8 cores, 1.2 GHz, 40 (×8) GFlops

F convergence CPU-GPU: Intel Larrabee

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 3 / 20

High Performance Bioinformatics – Manycores

GPGPU = General-Purpose computation on GPU

until 2007: tweaking graphics primitives

2007: Nvidia CUDA

2009: OpenCL (Khronos Group)I dec 08: 1.0 specificationI may 09: beta release of a Nvidia compilerI AMD/ATI compiler coming soon

⇒ portable manycores applications ?

With GPGPU...

10× / 100× peak speed-up, low costs ($50–$500)

even with loss due to parallelism, 10× speed-up is possible

(relatively) easy with CUDA / OpenCL, requires some learning

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 4 / 20

GPU + BioinformaticsMethods

“Graphical” GPGPU (2005/06):

speed-upRAxML up to 2× Charalambous et al. 2005

ClustalW up to 7× Liu et al. 2006

CUDA (since 2007):

speed-upmummerGPU up to 10× Schatz et al. 2007

Smith-Waterman up to 15× Manavski and Valle 2008Neighbor-Joining up to 26× Liu et al. 2009

RNAfold up to 17× Risk and Lavenier 2009

∼ 10 papers between 2007 and 2009

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 5 / 20

GPU + BioinformaticsSpecific Bioinformatics HPC Events

HiComb (IEEE Workshop on High Performance Computational Biology)since 2002

in conjunction with IPDPS [may 09, Roma]

PBC (Parallel Bio-Computing Workshop)since 2005, every two years

in conjunction with PPAM [sept 09, Wroclaw]

HiBi (Workshop on High Performance Computational Systems Biology)

[oct 09, Trento]

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 6 / 20

Sequoia BioinformaticsLIFL, INRIA, Universite Lille 1, France

H. Touzet’s group, 14 people (including 5 PhD students)

Large-scale sequence analysis

Sequence comparisons, seed-based heuristics

RNA, transcription factors, NRPS

High-Performance BioinformaticsI SIMD flexible read mapper (L. Noe, M. Gırdea)I GPU PWM scan / P-value (22× – 77× on a GTX 280)I GPU ADP (6.1× – 22.8× on a GTX 280, with U. Bielefeld)I GPU & bit-parallelism pattern matching (ongoing)I Supported by NVIDIA (Professor Partnership, 2009)

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 7 / 20

GPU + Position-Weight Matrices (PWM)Parallel Position Weight Matrices Algorithms. M. Giraud and J.-S. Varre. ISPDC’09

PWMs are used for modeling transcriptionfactor binding sites, transcription start sites,protein domains, . . .

score threshold or P-value computation:requires to enumerate words

occurrences: requires to scan quickly a verylong sequence

WebLogo 3.0

0.0

1.0

2.0

bits

CTAACTCATT

G5A

CTGG

GC

ATACT

100x

10x

1x

35 40 45 50 55 60 65 70

Spee

dup

Matrix length

CPU (one thread)GeForce 8800

GTX 280GTX 280 (+ atomic)

25x

20x

15x

10x

5x

0 10 20 30 40 50 60 70 80 90

Spee

dup

Matrix length

CPU (one thread)GeForce 8800

GTX 280

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 8 / 20

HPC Bioinformatics for human beings ?

Research in High-Performance ComputingI nice ideas, nice papersI but not always exploited

A few HPC bioinformatics frameworks projects...

⇒ far from everyday usage of bioinformaticians and biologists

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 9 / 20

www.biomanycores.org

1. Share OpenCL code= public repository, open-source

2. Make it easy= Bio∗ integration

3. Benchmarkalgorithms, implementations, hardware

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 10 / 20

www.biomanycores.org

1. Share OpenCL code (currently CUDA)

= public repository, open-source

2. Make it easy= Bio∗ integration

3. Benchmarkalgorithms, implementations, hardware

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 10 / 20

Already included projects

SWcuda – Smith-Waterman protein alignmentI CRIBI Genomics, University of Padova, ItalyI S. A. Manavski, G. Valle, CUDA compatible GPU cards as efficient hardware

accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics 2008,9(S2):S10

pknotsRG – pseudonots of an RNA sequenceI Universitat Bielefeld, GermanyI J. Reeder, P. Steffen, R. Giegerich, pknotsRG: RNA pseudoknot folding including

near-optimal structures and sliding windows, Nucl. Acids. Res., 2007

cudaPWM – scan a PWM against a DNA sequenceI Sequoia, LIFL, INRIA, Universite Lille 1I M. Giraud, J.-S. Varre, Parallel Position Weight Matrices Algorithms, ISPDC’09

Interfaces to BioJava 1.6, BioPerl 1.52, and Biopython 1.50b

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 11 / 20

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 12 / 20

Biopython + CRIBI SW

from Bio i m p o r t SeqIO

from Biomanycores i m p o r t PadovaSW

bank = SeqIO . parse ( open ( ” u n i p r o t−s t a r t . f a ” ) , ” f a s t a ” )

f o r query i n SeqIO . parse ( open ( ” p r o t 6 4 . f a ” ) , ” f a s t a ” ) :handle = PadovaSW . run ( query , bank )result = PadovaSW . SWParser ( ) . parse ( )p r i n t result

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 13 / 20

Biopython + CRIBI SWTests on a GeForce 8800

biopython$ time python sw-demo.py cuda

** cd ../bin/ ; ./swcuda config.gpu ../tmp/swcuda.fa ../tmp/swcuda.bank

** 1.846s

12098 results...

[(84.0, 0, 0, ’sp|P30350|ADH1_ANAPL’), (81.0, 0, 0, ’sp|P23991|ADH1_CHICK’), (81.0, 0, 0, ’sp|P72324|ADHI_RHOS4’), (79.0, 0, 0, ’sp|P29274|AA2AR_HUMAN’), ...]

real 2.81 user 1.79 sys 0.27

biopython$ time python sw-demo.py cpu

** cd ../bin/ ; ./swcuda config.cpu ../tmp/swcuda.fa ../tmp/swcuda.bank

** 16.604s

12098 results...

[(84.0, 0, 0, ’sp|P30350|ADH1_ANAPL’), (81.0, 0, 0, ’sp|P23991|ADH1_CHICK’), (81.0, 0, 0, ’sp|P72324|ADHI_RHOS4’), (79.0, 0, 0, ’sp|P29274|AA2AR_HUMAN’), ...]

real 17.57 user 16.42 sys 0.14

10× – 15× paper speedup (BMC Bioinformatics 2008, 9S2)

8.7× application speedup

6.2× final speedup (including Biopython/Biomanycores)

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 14 / 20

BioPerl + CRIBI SW

BioPerl tutorial

use Bio : : Tools : : pSW ;

$factory = new Bio : : Tools : : pSW ( ’−m a t r i x ’=> ’ blosum62 . b l a ’ , ’−gap ’←↩=>12, ’−e x t ’=>2) ;

$factory−>align_and_show ( $seq1 , $seq2 , STDOUT ) ;$aln = $factory−>pairwise_alignment ( $seq1 , $seq2 ) ;

With biomanycores

use Bio : : SeqIO ;use Biomanycores : : PadovaSW ;

$factory = PadovaSW−>new ( ) ;

$factory−>swcuda ( $inputseq , $bank ) ;@r = $factory−>parse_result ( ) ;

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 15 / 20

BioJava + PWM

i m p o r t org . biojavax . bio . seq . RichSequence ;i m p o r t org . biojava . bio . dp . SimpleWeightMatrix ;. . .i m p o r t org . biomanycores . bio . pwm . ∗ ;. . .{

LillePWMScan scanner = new LillePWMScan ( launcher ) ;

// r e a d t h e s e q u e n c eRichSequenceIterator it = n u l l ;BufferedReader in1 = new BufferedReader ( new FileReader ( args [ 1 ] ) ) ;it = RichSequence . IOTools . readFastaDNA ( in1 , n u l l ) ;RichSequence query = it . nextRichSequence ( ) ;

// r e a d a w e i g h t m a t r i xSimpleWeightMatrix pwm = PFMParser . PARSER . get ( args [ 2 ] , alph , ”ACGT” ) ;

// scan t h e s e q u e n c eList<PWMHit> al = scanner . scan ( query , pwm , threshold ) ;

}

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 16 / 20

Challenges

Differents APIs, different philosophiesI BioJava : no external program execution ?I Object representation (alignments)I Object existence (PWM)

Minimal modifications to the source code of applicationsI CribiSW : command-line arguments

Real-world pipelines ?I Bio∗ are not HPC frameworksI Succession of several programs

Usage: requires CUDA / OpenCL SDK

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 17 / 20

Licenses

Projects must have an open-source licence

Bio∗ interfaces : same license than mother APII BioJava: LGPL 2.1I BioPerl: Perl artistic licenseI Biopython: Biopython license

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 18 / 20

www.biomanycores.org

1. Share OpenCL code (currently CUDA)

= public repository, open-source

⇒ bring new projects

2. Make it easy= Bio∗ integration

⇒ integrate new projects

⇒ improve current interfaces

3. Benchmarkalgorithms, implementations, hardware

⇒ think !

J.-S. Varre, S. Janot, M. Giraud (LIFL) Biomanycores June 2009 19 / 20

go back