saahpc 2012 wednesday july 11 th, 2012 fpga-accelerated isotope pattern calculator for use in...

SAAHPC 2012

Wednesday July 11th, 2012

FPGA-Accelerated Isotope Pattern Calculator

for Use in Simulated Mass Spectrometry

Peptide and Protein ChemistryCarlo Pascoe (speaker), David Box,

Herman Lam, Alan George NSF Center for High-Performance Reconfigurable Computing (CHREC)

Dept. of Electrical and Computer Engineering, University of Florida

Gainesville FL, USA

Email: {pascoe, box, hlam, george}@chrec.org

2

Motivation Protein Identification Algorithms (PIAs)

Heavily utilized in pharmaceutical research and cancer diagnostics

Current industry standard methods unreliable (at best!) [1,2]

Highly accurate algorithms with potential to revolutionize accuracy exist, however not/under utilized due to extreme computational intensity and prohibitive execution times Must accelerate for feasible use

Approach: Accelerate Isotope Pattern Calculator

(IPC), a dominant subroutine common in de novo PIAs

Provide customizable design for general use

Capitalize on reconfigurable computing at scale to achieve sustainable supercomputing performance

Objective: Develop sustainable solution for increasing the speed, and thus achievable accuracy, of many PIAs

Presentation Outline Background

Protein Identification De Novo PIAs Theoretical Mass Spectrum Generation

IPC Problem Description Elemental Isotope SADs Stage 1: SED Calculation Stage 2: SED Combination Additional IPC Functionality

A Configurable & Scalable IPC Hardware Architecture SED Calculation Reduced to LUTs SED Iterative Combination in Hardware

Performance Evaluation on Novo-G Single-FPGA Performance Multi-FPGA Performance

Summary & Conclusions Future Work Q&A

3

SAD: Single-Atom DistributionSED: Single-Element Distribution,

Protein Identification Protein: biochemical molecule consisting

of one or more polypeptides Macromolecular chains of linked amino acids

Current protein ID approach Methodically fragment protein sample Analyze with mass spectrometer Employ PIAs to generate string representing

amino acid primary structure Algorithms classified as database or de novo

4

To this...

This…

RPPGFSPFRpeptide amino acid sequence

To this…

De Novo PIAs General de novo approach

Make educated guess for amino acid string Generate theoretical mass spectra and

compare to experimental spectrum Iteratively refine guess until theoretical

and experimental spectra match Theoretical need to consider all linear

combinations of amino acids Number of candidates grows exponentially with final

sequence length Employ diverse heuristic pruning methods to limit protein

search space Necessity for practical use on conventional computing

systems Often leads to false identifications (e.g., N and GG can

have same mass)

5

By accelerating key computation common in many de novo algorithms, algorithm developers can employ less restrictive pruning criteria, potentially

allowing a greater degree of accuracy in less time

Theoretical Mass Spectrum Generation Majority of execution time for many highly accurate de novo algorithms

Calculation comprises: Decomposition of candidate sequence string into many amino acid

substrings Generation of probable mass contributions for each predicted

substring Histogram-like combination of probable masses to form theoretical

distribution directly comparable to experimental mass spectra Complicated by fact that, in nature, elements occur as mixture

of isotopes Neutron quantity differences suggest distribution of possible

molecule masses Use IPC subroutine to predict possible masses

Enumerates all possible combinations of constituent element isotopes

produce list of mass/probability pairs

6

Although a relatively simple calculation for the smallest of molecules, IPC executions for medium- to large- sized molecules quickly become a

computational bottleneck of many chemistry applications, most notably de novo protein identification.

IPC Problem DescriptionGiven a chemical formula and a database of element isotope SADs,

produce a list of mass/probability pairs representing the distribution of possible molecular masses

Analogous to evaluating represents ith isotope of jth unique element in chemical

formula containing Nj atoms of jth element type

Problem reducible to two-stage process

1. Compute each single element distribution (SED)

2. Combine SEDs to form final distribution

7

SAD: Single-Atom DistributionSED: Single-Element Distribution,

Elemental Isotope SADs

8

Stage 1: SED Calculation Consider SEDs of Hydrogen from

SAD

9

ALGORITHM 1. Calculate HN SEDp0 ← 0.999885, m0 ← 1.007825p1 ← 0.000115, m1 ← 2.014102FOR ← 0 to N ← N – p ← m ← PRINT (m, p)END LOOP

H1: → → M= 1.007825, p= 9.99885e-01,M= 2.014102, p= 1.15e-04

H2: → → M= 2.01565, p= 9.99770e-01,M= 3.02193, p= 2.2997e-04,M= 4.02820, p= 1.3225e-06

HN: →

Impose Threshold Probability

→ A really long list with many low probability peaks!

Stage 1: SED Calculation Can modify ALGORITHM 1. to handle any element

with two stable isotopes (e.g., Helium, Carbon, Nitrogen, etc.)

If an element has more than two stable isotopes? Consider SEDs of Sulfur

10

ALGORITHM 2. Calculate SN SED p0 ← 0.9493, m0 ← 31.972079 p1 ← 0.0076, m1 ← 32.971459 p2 ← 0.0429, m2 ← 33.967867 p3 ← 0.0002, m3 ← 35.967081 FOR ← 0 to N FOR ← 0 to N – FOR ← 0 to N – (+ ) ← N – (+ + ) p ← m ← PRINT (m, p) END LOOP END LOOP END LOOP

SN: → A REALLY, REALLY long list!

Computation Significantly Increases as the Number of Stable Isotopes Increases

Stage 2: SED Combination With Stage 1 complete, analogous to evaluating

represents ith peak from SED generated for jth unique element

Removal of exponent allows for straightforward combination

11

Simple Example) H2O: →

M= 2.01565, p= 9.9977e-01,M= 3.02193, p= 2.2997e-04,M= 4.02820, p= 1.3225e-06

M= 15.9949, p= 9.9757e-01,M= 16.9991, p= 3.8e-04,M= 17.9992, p= 2.05e-03

M= 2.01565 + 15.9949 = 18.0106, p= 9.9977e-01 * 9.9757e-01 = 9.9734e-01,M= 2.01565 + 16.9991 = 19.0148, p= 9.9977e-01 * 3.8e-04 = 3.7991e-04,M= 2.01565 + 17.9992 = 20.0149, p= 9.9977e-01 * 2.05e-03 = 2.0495e-03,M= 3.02193 + 15.9949 = 19.0168, p= 2.2997e-04 * 9.9757e-01 = 2.2941e-04,M= 3.02193 + 16.9991 = 20.0210, p= 2.2997e-04 * 3.8e-04 = 8.7389e-08,M= 3.02193 + 17.9992 = 21.0211, p= 2.2997e-04 * 2.05e-03 = 4.7144e-07,M= 4.02820 + 15.9949 = 20.0231, p= 1.3225e-06 * 9.9757e-01 = 1.3193e-06,M= 4.02820 + 16.9991 = 21.0273, p= 1.3225e-06 * 3.8e-04 = 5.0255e-10,M= 4.02820 + 17.9992 = 22.0274, p= 1.3225e-06 * 2.05e-03 = 2.7111e-09

Additional IPC Functionality

12

M= 18.0106, p= 9.9734e-01,M= 19.0148, p= 3.7991e-04,M= 20.0149, p= 2.0495e-03,M= 19.0168, p= 2.2941e-04,M= 20.0210, p= 8.7389e-08,M= 21.0211, p= 4.7144e-07,M= 20.0231, p= 1.3193e-06,M= 21.0273, p= 5.0255e-10,M= 22.0274, p= 2.7111e-09

Simple Example

ContinuedH2O:

Probability Threshold: Filter prob < PT (e.g, PT = 1.0e-05) M= 18.0106, p= 9.9734e-01,M= 19.0148, p= 3.7991e-04,M= 20.0149, p= 2.0495e-03,M= 19.0168, p= 2.2941e-04M= 18.0106, p= 9.9734e-01,

M= 19.0148, p= 3.7991e-04,M= 19.0168, p= 2.2941e-04,M= 20.0149, p= 2.0495e-03,M= 20.0210, p= 8.7389e-08,M= 20.0231, p= 1.3193e-06,M= 21.0211, p= 4.7144e-07,M= 21.0273, p= 5.0255e-10,M= 22.0274, p= 2.7111e-09

Sort by Mass

M= 18.0106, p= 9.9734e-01,M= 20.0149, p= 2.0495e-03,M= 19.0148, p= 3.7991e-04,M= 19.0168, p= 2.2941e-04,M= 20.0231, p= 1.3193e-06,M= 21.0211, p= 4.7144e-07,M= 20.0210, p= 8.7389e-08,M= 22.0274, p= 2.7111e-09,M= 21.0273, p= 5.0255e-10

Sort by Probability

M= 18.0106, p= 9.9734e-01,M= 20.0149, p= 2.0495e-03,M= 19.0148, p= 3.7991e-04,M= 19.0168, p= 2.2941e-04,M= 20.0231, p= 1.3193e-06,M= 21.0211, p= 4.7144e-07

M= 18.0106, p= 9.9734e-01,M= 19.0148, p= 3.7991e-04,M= 19.0168, p= 2.2941e-04,M= 20.0149, p= 2.0495e-03,M= 20.0210, p= 8.7389e-08,M= 20.0231, p= 1.3193e-06

Window Filter: Filter any peaks after the Nth (e.g, N = 6)

M= 18.0106, p= 9.9734e-01,M= 19.0156, p= 6.0932e-04,M= 20.0149, p= 2.0509e-03

Mass Peak Centroiding: Essentially moving average filter over close peaks, weighted by probability

A Configurable & Scalable IPC Hardware Architecture

13

Adapt two-stage procedure to a configurable & scalable hardware architecture capable of converting a stream of independent chemical formula

queries into a delimited stream of variable-quantity mass/probability pairs

Stream Consists of Chemical

Formula Query Information and

Control Data

Multiple Modules Handle Stage 2 Computation

Single Module Handles Stage 1 Functionality

No. of Modules Independent from Input Stream Data

and Host

Result Distributions Returned in Same Order as Received in Input Stream

Single-Element

DistributionLookup

&Query

Scheduler

StreamFromHost

Distribution Calculator 1



Distribution Calculator M-2


Distribution Calculator M

StreamToHost

ResultSerializer

SED calculation reduced to LUTs

14

Single-Element

DistributionLookup

&Query

Scheduler

StreamFromHost







StreamToHost

ResultSerializer

SEDs Presorted by Probability,

Filtered at Runtime with Configurable Threshold

Prob.

Sample LUT Address

Space for SEDs 0

2047

H1 − H256C1 − C256

N1 − N256

O1 − O256S1 − S64

60 Other Elements with 16 SEDs per Element

Precompute SEDs Exactly,

Pull SEDs from LUTS at Runtime vs. SADs vs.

FCFDs

In-Stream Control[3]

Single Bank of LUTs Feed All Distribution

Calculators

Token-Based Round Robin

Scheduler

StreamFromHost

Single-Element Distribution Lookup & Query Scheduler

Query Scheduler

BRAM1




FIFO (1,2)

FIFO (2,2)

FIFO (N+1,2)

FIFO (1,M)

FIFO (2,M)

FIFO (N+1,M)

FIFO (1,1)

FIFO (2,1)

FIFO (N+1,1)

Control Data

Filter

BRAM2

Filter

FilterBRAM

N – 1

FilterBRAMN

Push 1..................

Push MStream Pop

SAD: Single-Atom DistributionSED: Single-Element DistributionFCFD: Full Chemical Formula Distribution

Equation from Slide 8:

Insert Centroiding Here if so Desired

SED Iterative Combination in Hardware

15

Single-cycle SED combination architecture required for worst-case excessively wasteful when processing common-case, employ

iterative combination to boost hardware utilization

X: No. of Parallel Multipliers and

AddersY: Buffer Depth

Result Reporting Circuitry Operates Independently of Distribution

Calculation

FIFO (1,M)


Distribution CalculationController

Resu

lt Coll

ectio

n FIFO

Result ReportingController

Mult/A

dder B

ank

Filter

Pre-So

rt

Insertio

n So

rt Lo

gicCu

rrent

Iteratio

n Bu

ffer

Prev

ious I

terati

on B

uffer

Input

Switc

h Ne

twork

Final

Resu

lt Buff

er

:: :: :: :::::

:::

FIFO (N+1,M)

To Result Serializer

Single-Element

DistributionLookup

&Query

Scheduler

StreamFromHost







StreamToHost

ResultSerializer

ALGORITHM 3. Distribution Calculator Procedure

WHILE Control ≠ “done” SED[1…N]←FIFO[1…N].pop(), Control←FIFO[N+1].pop() IF Control = “begin” PrevItBuff[1...N]←SED[1…N], PrevItBuff[N+1...Y]←(-1,0) CurrItBuff[1...Y] ←(-1,0) IF Control = “middle” or Control = “end” WHILE tmp←PrevItBuff[1..Y].shift() ≠ (-1,0) i ← 1 WHILE i ≤ N and SED[i].prob > 0 MultAdd[1...X].mass←SED[i...i+X−1].mass + tmp.mass MultAdd[1...X].prob←SED[i...i+X−1].prob ∗ tmp.prob PSort[1…X]←Sort(Filter(MultAdd[1...X], TP)) CurrItBuff[1…Y]←InSort(CurrItBuff[1…Y], PSort[1..X]) i ← i + X END LOOP END LOOP PrevItBuff[1...Y]←CurrItBuff[1…Y] CurrItBuff[1...Y] ←(-1,0) IF Control = “end” FinalResBuff[1...Y]←PrevItBuff[1...Y] END LOOP

Performance Evaluation on Novo-G

Previously discussed hardware architecture implemented in VHDL and tested on Novo-G[4,5]

Initial experiments on single Altera Stratix IV E530 FPGA in GiDEL PROCStar IV board along with an Intel Xeon E5620 CPU for host support

16

Novo-G Annual Growth2009: 96 top-end Stratix-III FPGAs,

each with 4.25GB SDRAM2010: 96 more Stratix-III FPGAs,

each with 4.25GB SDRAM2011: 96 top-end Stratix-IV FPGAs,

each with 8.50GB SDRAM2012: 96 more Stratix-IV FPGAs,

each with 8.50GB SDRAM

Single-device implementation scaled up to a single Novo-G “ps4” compute node i.e., up to 16 E530s in 4 PROCStar

IVs Implications of scaling to multiple

compute nodes of Novo-G discussed

Software baseline: highly optimized, serial C++ code mirroring hardware algorithm Executed on single E5620 core Orders of magnitude faster than

code at [6] Hardware and software results

compared to confirm hardware correctness

Single-FPGA Performance

17

TABLE I. Single-FPGA performance for several parameter configurations.

Configuration* Freq

(MHz)

Speedup† /DC

Speedup† /FPGA N X Y Qi.fMas Qi.fProb Cen M

1. 16 1 128 16.16 1.31 N 12 115 − 72

2. 16 1 128 14.8 1.15 N 20 145 − 115

3. 16 2 128 14.8 1.15 N 15 145 − 115

4. 16 3 128 14.8 1.15 N 10 110 − 87

5. 12 2 128 14.8 1.15 N 14 155 − 123

6. 16 2 80 14.8 1.15 N 21 150 − 120

7. 12 2 80 14.8 1.15 N 22 155 − 127

8. 16 1 128 16.16 1.31 Y 11 120 17 186

9. 16 1 128 14.12 1.23 Y 13 135 18 236

10. 16 1 128 14.8 1.15 Y 16 140 24 384

11. 16 1 128 14.12 1.23 Y 13 135 18 236

12. 16 2 128 14.12 1.23 Y 10 130 32 325

13. 16 3 128 14.12 1.23 Y 6 100 35 214

14. 16 2 128 14.12 1.23 Y 10 130 32 325

15. 12 2 128 14.12 1.23 Y 10 135 34 338

16. 8 2 128 14.12 1.23 Y 11 135 35 381

17. 16 2 100 14.12 1.23 Y 13 135 34 444

18. 16 2 80 14.12 1.23 Y 14 130 32 454

19. 16 2 60 14.12 1.23 Y 17 130 33 566

20. 12 2 80 14.12 1.23 Y 15 135 34 516 * N: max peaks/SED, X: number of parallel peak computations per DC,

Y: max peaks/output-window, Qi.f: fixed-point word width bits (integer.fractional), Cen: centroiding capability enabled (Yes/No), M: DCs per FPGA.

† w.r.t. C++ software processing 16x220 statistically representative queries (based on relative elemental abundance in amino acids) in 821 seconds on a single E5620 core.

Performance Trends for Various IPC Parameter Configurations

Configurations Bandwidth Limited

Computation-bound problem in software

becomes I/O-bound in FPGAs

Reducing Calculation Word WidthReduced Logic Usage & Increased Operating

Frequency vs.

Reduced Result PrecisionIncreasing Parallel Computations per DCIncreased Operations per Clock Cycle

vs.Increased Logic Usage, Reduced

Routability, & Operating Frequency

Reducing Distribution Window WidthReduced Logic Usage

vs. Reduced Result Exactness

Suitable “sweet spot,” achieving remarkable speedup while ensuring results remain scientifically

relevant

Multi-FPGA Performance

18

TABLE II. Multi-FPGA performance for several node configurations.

Boards /Node

FPGAs /Board

Total FPGAs

Speedup† /FPGA

Total Speedup†

1. 1 1 1 517 517 2. 1 2 2 516 1031

3. 1 3 3 398 1192

4. 1 4 4 315 1259

5. 2 1 2 516 1033

6. 2 2 4 502 2009

7. 2 4 8 313 2510

8. 4 1 4 492 1968

9. 4 4 16 209 3340

† w.r.t. C++ software processing 230 statistically representative queries (based on

relative elemental abundance) in 52,478 seconds on a single E5620 core.

Performance Trends of “sweet spot” for Various Novo-G Node Configurations

Increasing FPGAs per PROCStar IVScalability limited by I/O-bandwidth

PROCStar IV only supports 8 lane, Gen 1 PCIe

Expect increased scalability with system config. employing more lanes and/or more recent Gen 3

PCIe standard

Increasing PROCStar IVs per NodeAvailable system bandwidth far exceeds

board link bandwidth bottleneck observed with single-board Scalability now limited by CPU resourcesNovo-G “ps4” nodes have 8 physical cores (16 logical with hyper-threading) vs. max 32 threads for row 9Expect increased scalability with

system config employing more physical cores

Multi-Node Scaling ExpectationsAssuming input queries are pre-

partitioned, no required communication between compute nodesOverhead limited to initialization & completion synchronization so expect performance to scale almost linearly with additional nodes We plan to verify these expectations by scaling to multiple compute nodes in Novo-G as future work

Multiple FPGA Advantage?

Summary & Conclusions

19

Presented first FPGA-based Isotope Pattern Calculator Computationally intense subroutine common in de novo PIAs Provides 23 customization parameters for general use Discussed parameter tradeoffs & experimentally demonstrate effect

on performance Between 72 and 566 speedup† on a single FPGA

Up to 1259 speedup † on a single board (4

FPGAs)Up to 3340 speedup † on a single node (16

FPGAs)

† with respect to a highly optimized, serial C++ IPC implementation

Wide range of achieved single-node performance

due to embarrassingly parallel scalability

restricted by real-world system limitations such as

insufficient I/O bandwidth and CPU resources

Can enable use of previously dismissed protein identification algorithms with potentially

revolutionary accuracy yet obscene execution time on conventional computing platforms

Still much to be done before this is a reality for protein Identification

Future Work Continue scaling design to multiple nodes of Novo-G Integrate FPGA accelerated IPC into full de novo PIA

First integrate with full theoretical spectrum generator Move more of algorithm onto FPGA to lessen bandwidth

bottleneck issue Explore the possibility of a GPU accelerated IPC

GPU amenable given minor modifications to the algorithm as stated

Preliminary design already mapped out, ready for implementation & testing

20

Implement non-sorted output option Sorting fundamentally integral to current

DC design Non-sorting DC would allow greater

parallelization while utilizing less resources If sorted distribution not required by

targeted PIA, expect much greater performance

Thank You For Listing! Any Questions

21

References[1] A. W. Bell et al., “A HUPO test sample study reveals common problems in mass

spectrometry-based proteomics,” Nat. Methods, vol 6, pp. 423-430, 2009.

[2] E. A. Kapp et al., "An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis," Proteomics, vol 5, pp. 3475–3490, 2005.

[3] C. Pascoe et al., “Reconfigurable supercomputing with scalable systolic arrays and in-stream control for wavefront genomics processing,” Proc. of Symposium on Application Accelerators in High-Performance Computing (SAAHPC), TN, 2010.

[4] A. George, H. Lam, A. Lawande, C. Pascoe, and G. Stitt, “Novo-G: A View at the HPC Crossroads for Scientific Computing,” Proc. of the Int. Conf. on Eng. of Reconf. Sys. and Algs. (ERSA), NV, 2010.

[5] A. George, H. Lam, and G. Stitt, “Novo-G: At the Forefront of Scalable Reconfigurable Computing,” IEEE Computing in Sci. & Eng. (CiSE), Vol. 13, No. 1, Jan/Feb. 2011, pp. 82-86.

[6] Dirk (2005), Isotopic Pattern Calculator, http://isotopatcalc.sourceforge.net/index.php, File: gips-0.7.tar.gz.

22

saahpc 2012 wednesday july 11 th, 2012 fpga-accelerated isotope pattern calculator for use in...

Documents

novo algorithms calculation

novo approachmake

experimental mass spectrac

theoretical distribution

protein identificationprotein

mass spectrometeremploy

theoretical need

accurate algorithms