saahpc 2012 wednesday july 11 th, 2012 fpga-accelerated isotope pattern calculator for use in...
TRANSCRIPT
SAAHPC 2012
Wednesday July 11th, 2012
FPGA-Accelerated Isotope Pattern Calculator
for Use in Simulated Mass Spectrometry
Peptide and Protein ChemistryCarlo Pascoe (speaker), David Box,
Herman Lam, Alan George NSF Center for High-Performance Reconfigurable Computing (CHREC)
Dept. of Electrical and Computer Engineering, University of Florida
Gainesville FL, USA
Email: {pascoe, box, hlam, george}@chrec.org
2
Motivation Protein Identification Algorithms (PIAs)
Heavily utilized in pharmaceutical research and cancer diagnostics
Current industry standard methods unreliable (at best!) [1,2]
Highly accurate algorithms with potential to revolutionize accuracy exist, however not/under utilized due to extreme computational intensity and prohibitive execution times Must accelerate for feasible use
Approach: Accelerate Isotope Pattern Calculator
(IPC), a dominant subroutine common in de novo PIAs
Provide customizable design for general use
Capitalize on reconfigurable computing at scale to achieve sustainable supercomputing performance
Objective: Develop sustainable solution for increasing the speed, and thus achievable accuracy, of many PIAs
Presentation Outline Background
Protein Identification De Novo PIAs Theoretical Mass Spectrum Generation
IPC Problem Description Elemental Isotope SADs Stage 1: SED Calculation Stage 2: SED Combination Additional IPC Functionality
A Configurable & Scalable IPC Hardware Architecture SED Calculation Reduced to LUTs SED Iterative Combination in Hardware
Performance Evaluation on Novo-G Single-FPGA Performance Multi-FPGA Performance
Summary & Conclusions Future Work Q&A
3
SAD: Single-Atom DistributionSED: Single-Element Distribution,
Protein Identification Protein: biochemical molecule consisting
of one or more polypeptides Macromolecular chains of linked amino acids
Current protein ID approach Methodically fragment protein sample Analyze with mass spectrometer Employ PIAs to generate string representing
amino acid primary structure Algorithms classified as database or de novo
4
To this...
This…
RPPGFSPFRpeptide amino acid sequence
To this…
De Novo PIAs General de novo approach
Make educated guess for amino acid string Generate theoretical mass spectra and
compare to experimental spectrum Iteratively refine guess until theoretical
and experimental spectra match Theoretical need to consider all linear
combinations of amino acids Number of candidates grows exponentially with final
sequence length Employ diverse heuristic pruning methods to limit protein
search space Necessity for practical use on conventional computing
systems Often leads to false identifications (e.g., N and GG can
have same mass)
5
By accelerating key computation common in many de novo algorithms, algorithm developers can employ less restrictive pruning criteria, potentially
allowing a greater degree of accuracy in less time
Theoretical Mass Spectrum Generation Majority of execution time for many highly accurate de novo algorithms
Calculation comprises: Decomposition of candidate sequence string into many amino acid
substrings Generation of probable mass contributions for each predicted
substring Histogram-like combination of probable masses to form theoretical
distribution directly comparable to experimental mass spectra Complicated by fact that, in nature, elements occur as mixture
of isotopes Neutron quantity differences suggest distribution of possible
molecule masses Use IPC subroutine to predict possible masses
Enumerates all possible combinations of constituent element isotopes
produce list of mass/probability pairs
6
Although a relatively simple calculation for the smallest of molecules, IPC executions for medium- to large- sized molecules quickly become a
computational bottleneck of many chemistry applications, most notably de novo protein identification.
IPC Problem DescriptionGiven a chemical formula and a database of element isotope SADs,
produce a list of mass/probability pairs representing the distribution of possible molecular masses
Analogous to evaluating represents ith isotope of jth unique element in chemical
formula containing Nj atoms of jth element type
Problem reducible to two-stage process
1. Compute each single element distribution (SED)
2. Combine SEDs to form final distribution
7
SAD: Single-Atom DistributionSED: Single-Element Distribution,
Elemental Isotope SADs
8
Stage 1: SED Calculation Consider SEDs of Hydrogen from
SAD
9
ALGORITHM 1. Calculate HN SEDp0 ← 0.999885, m0 ← 1.007825p1 ← 0.000115, m1 ← 2.014102FOR ← 0 to N ← N – p ← m ← PRINT (m, p)END LOOP
H1: → → M= 1.007825, p= 9.99885e-01,M= 2.014102, p= 1.15e-04
H2: → → M= 2.01565, p= 9.99770e-01,M= 3.02193, p= 2.2997e-04,M= 4.02820, p= 1.3225e-06
HN: →
Impose Threshold Probability
→ A really long list with many low probability peaks!
Stage 1: SED Calculation Can modify ALGORITHM 1. to handle any element
with two stable isotopes (e.g., Helium, Carbon, Nitrogen, etc.)
If an element has more than two stable isotopes? Consider SEDs of Sulfur
10
ALGORITHM 2. Calculate SN SED p0 ← 0.9493, m0 ← 31.972079 p1 ← 0.0076, m1 ← 32.971459 p2 ← 0.0429, m2 ← 33.967867 p3 ← 0.0002, m3 ← 35.967081 FOR ← 0 to N FOR ← 0 to N – FOR ← 0 to N – (+ ) ← N – (+ + ) p ← m ← PRINT (m, p) END LOOP END LOOP END LOOP
SN: → A REALLY, REALLY long list!
Computation Significantly Increases as the Number of Stable Isotopes Increases
Stage 2: SED Combination With Stage 1 complete, analogous to evaluating
represents ith peak from SED generated for jth unique element
Removal of exponent allows for straightforward combination
11
Simple Example) H2O: →
M= 2.01565, p= 9.9977e-01,M= 3.02193, p= 2.2997e-04,M= 4.02820, p= 1.3225e-06
M= 15.9949, p= 9.9757e-01,M= 16.9991, p= 3.8e-04,M= 17.9992, p= 2.05e-03
M= 2.01565 + 15.9949 = 18.0106, p= 9.9977e-01 * 9.9757e-01 = 9.9734e-01,M= 2.01565 + 16.9991 = 19.0148, p= 9.9977e-01 * 3.8e-04 = 3.7991e-04,M= 2.01565 + 17.9992 = 20.0149, p= 9.9977e-01 * 2.05e-03 = 2.0495e-03,M= 3.02193 + 15.9949 = 19.0168, p= 2.2997e-04 * 9.9757e-01 = 2.2941e-04,M= 3.02193 + 16.9991 = 20.0210, p= 2.2997e-04 * 3.8e-04 = 8.7389e-08,M= 3.02193 + 17.9992 = 21.0211, p= 2.2997e-04 * 2.05e-03 = 4.7144e-07,M= 4.02820 + 15.9949 = 20.0231, p= 1.3225e-06 * 9.9757e-01 = 1.3193e-06,M= 4.02820 + 16.9991 = 21.0273, p= 1.3225e-06 * 3.8e-04 = 5.0255e-10,M= 4.02820 + 17.9992 = 22.0274, p= 1.3225e-06 * 2.05e-03 = 2.7111e-09
Additional IPC Functionality
12
M= 18.0106, p= 9.9734e-01,M= 19.0148, p= 3.7991e-04,M= 20.0149, p= 2.0495e-03,M= 19.0168, p= 2.2941e-04,M= 20.0210, p= 8.7389e-08,M= 21.0211, p= 4.7144e-07,M= 20.0231, p= 1.3193e-06,M= 21.0273, p= 5.0255e-10,M= 22.0274, p= 2.7111e-09
Simple Example
ContinuedH2O:
Probability Threshold: Filter prob < PT (e.g, PT = 1.0e-05) M= 18.0106, p= 9.9734e-01,M= 19.0148, p= 3.7991e-04,M= 20.0149, p= 2.0495e-03,M= 19.0168, p= 2.2941e-04M= 18.0106, p= 9.9734e-01,
M= 19.0148, p= 3.7991e-04,M= 19.0168, p= 2.2941e-04,M= 20.0149, p= 2.0495e-03,M= 20.0210, p= 8.7389e-08,M= 20.0231, p= 1.3193e-06,M= 21.0211, p= 4.7144e-07,M= 21.0273, p= 5.0255e-10,M= 22.0274, p= 2.7111e-09
Sort by Mass
M= 18.0106, p= 9.9734e-01,M= 20.0149, p= 2.0495e-03,M= 19.0148, p= 3.7991e-04,M= 19.0168, p= 2.2941e-04,M= 20.0231, p= 1.3193e-06,M= 21.0211, p= 4.7144e-07,M= 20.0210, p= 8.7389e-08,M= 22.0274, p= 2.7111e-09,M= 21.0273, p= 5.0255e-10
Sort by Probability
M= 18.0106, p= 9.9734e-01,M= 20.0149, p= 2.0495e-03,M= 19.0148, p= 3.7991e-04,M= 19.0168, p= 2.2941e-04,M= 20.0231, p= 1.3193e-06,M= 21.0211, p= 4.7144e-07
M= 18.0106, p= 9.9734e-01,M= 19.0148, p= 3.7991e-04,M= 19.0168, p= 2.2941e-04,M= 20.0149, p= 2.0495e-03,M= 20.0210, p= 8.7389e-08,M= 20.0231, p= 1.3193e-06
Window Filter: Filter any peaks after the Nth (e.g, N = 6)
M= 18.0106, p= 9.9734e-01,M= 19.0156, p= 6.0932e-04,M= 20.0149, p= 2.0509e-03
Mass Peak Centroiding: Essentially moving average filter over close peaks, weighted by probability
A Configurable & Scalable IPC Hardware Architecture
13
Adapt two-stage procedure to a configurable & scalable hardware architecture capable of converting a stream of independent chemical formula
queries into a delimited stream of variable-quantity mass/probability pairs
Stream Consists of Chemical
Formula Query Information and
Control Data
Multiple Modules Handle Stage 2 Computation
Single Module Handles Stage 1 Functionality
No. of Modules Independent from Input Stream Data
and Host
Result Distributions Returned in Same Order as Received in Input Stream
Single-Element
DistributionLookup
&Query
Scheduler
StreamFromHost
Distribution Calculator 1
Distribution Calculator 2
Distribution Calculator 3
Distribution Calculator M-2
Distribution Calculator M-1
Distribution Calculator M
StreamToHost
ResultSerializer
SED calculation reduced to LUTs
14
Single-Element
DistributionLookup
&Query
Scheduler
StreamFromHost
Distribution Calculator 1
Distribution Calculator 2
Distribution Calculator 3
Distribution Calculator M-2
Distribution Calculator M-1
Distribution Calculator M
StreamToHost
ResultSerializer
SEDs Presorted by Probability,
Filtered at Runtime with Configurable Threshold
Prob.
Sample LUT Address
Space for SEDs 0
2047
H1 − H256C1 − C256
N1 − N256
O1 − O256S1 − S64
60 Other Elements with 16 SEDs per Element
Precompute SEDs Exactly,
Pull SEDs from LUTS at Runtime vs. SADs vs.
FCFDs
In-Stream Control[3]
Single Bank of LUTs Feed All Distribution
Calculators
Token-Based Round Robin
Scheduler
StreamFromHost
Single-Element Distribution Lookup & Query Scheduler
Query Scheduler
BRAM1
Distribution Calculator 1
Distribution Calculator 2
Distribution Calculator M
FIFO (1,2)
FIFO (2,2)
FIFO (N+1,2)
FIFO (1,M)
FIFO (2,M)
FIFO (N+1,M)
FIFO (1,1)
FIFO (2,1)
FIFO (N+1,1)
Control Data
Filter
BRAM2
Filter
FilterBRAM
N – 1
FilterBRAMN
Push 1..................
Push MStream Pop
SAD: Single-Atom DistributionSED: Single-Element DistributionFCFD: Full Chemical Formula Distribution
Equation from Slide 8:
Insert Centroiding Here if so Desired
SED Iterative Combination in Hardware
15
Single-cycle SED combination architecture required for worst-case excessively wasteful when processing common-case, employ
iterative combination to boost hardware utilization
X: No. of Parallel Multipliers and
AddersY: Buffer Depth
Result Reporting Circuitry Operates Independently of Distribution
Calculation
FIFO (1,M)
Distribution Calculator M
Distribution CalculationController
Resu
lt Coll
ectio
n FIFO
Result ReportingController
Mult/A
dder B
ank
Filter
Pre-So
rt
Insertio
n So
rt Lo
gicCu
rrent
Iteratio
n Bu
ffer
Prev
ious I
terati
on B
uffer
Input
Switc
h Ne
twork
Final
Resu
lt Buff
er
:: :: :: :::::
:::
FIFO (N+1,M)
To Result Serializer
Single-Element
DistributionLookup
&Query
Scheduler
StreamFromHost
Distribution Calculator 1
Distribution Calculator 2
Distribution Calculator 3
Distribution Calculator M-2
Distribution Calculator M-1
Distribution Calculator M
StreamToHost
ResultSerializer
ALGORITHM 3. Distribution Calculator Procedure
WHILE Control ≠ “done” SED[1…N]←FIFO[1…N].pop(), Control←FIFO[N+1].pop() IF Control = “begin” PrevItBuff[1...N]←SED[1…N], PrevItBuff[N+1...Y]←(-1,0) CurrItBuff[1...Y] ←(-1,0) IF Control = “middle” or Control = “end” WHILE tmp←PrevItBuff[1..Y].shift() ≠ (-1,0) i ← 1 WHILE i ≤ N and SED[i].prob > 0 MultAdd[1...X].mass←SED[i...i+X−1].mass + tmp.mass MultAdd[1...X].prob←SED[i...i+X−1].prob ∗ tmp.prob PSort[1…X]←Sort(Filter(MultAdd[1...X], TP)) CurrItBuff[1…Y]←InSort(CurrItBuff[1…Y], PSort[1..X]) i ← i + X END LOOP END LOOP PrevItBuff[1...Y]←CurrItBuff[1…Y] CurrItBuff[1...Y] ←(-1,0) IF Control = “end” FinalResBuff[1...Y]←PrevItBuff[1...Y] END LOOP
Performance Evaluation on Novo-G
Previously discussed hardware architecture implemented in VHDL and tested on Novo-G[4,5]
Initial experiments on single Altera Stratix IV E530 FPGA in GiDEL PROCStar IV board along with an Intel Xeon E5620 CPU for host support
16
Novo-G Annual Growth2009: 96 top-end Stratix-III FPGAs,
each with 4.25GB SDRAM2010: 96 more Stratix-III FPGAs,
each with 4.25GB SDRAM2011: 96 top-end Stratix-IV FPGAs,
each with 8.50GB SDRAM2012: 96 more Stratix-IV FPGAs,
each with 8.50GB SDRAM
Single-device implementation scaled up to a single Novo-G “ps4” compute node i.e., up to 16 E530s in 4 PROCStar
IVs Implications of scaling to multiple
compute nodes of Novo-G discussed
Software baseline: highly optimized, serial C++ code mirroring hardware algorithm Executed on single E5620 core Orders of magnitude faster than
code at [6] Hardware and software results
compared to confirm hardware correctness
Single-FPGA Performance
17
TABLE I. Single-FPGA performance for several parameter configurations.
Configuration* Freq
(MHz)
Speedup† /DC
Speedup† /FPGA N X Y Qi.fMas Qi.fProb Cen M
1. 16 1 128 16.16 1.31 N 12 115 − 72
2. 16 1 128 14.8 1.15 N 20 145 − 115
3. 16 2 128 14.8 1.15 N 15 145 − 115
4. 16 3 128 14.8 1.15 N 10 110 − 87
5. 12 2 128 14.8 1.15 N 14 155 − 123
6. 16 2 80 14.8 1.15 N 21 150 − 120
7. 12 2 80 14.8 1.15 N 22 155 − 127
8. 16 1 128 16.16 1.31 Y 11 120 17 186
9. 16 1 128 14.12 1.23 Y 13 135 18 236
10. 16 1 128 14.8 1.15 Y 16 140 24 384
11. 16 1 128 14.12 1.23 Y 13 135 18 236
12. 16 2 128 14.12 1.23 Y 10 130 32 325
13. 16 3 128 14.12 1.23 Y 6 100 35 214
14. 16 2 128 14.12 1.23 Y 10 130 32 325
15. 12 2 128 14.12 1.23 Y 10 135 34 338
16. 8 2 128 14.12 1.23 Y 11 135 35 381
17. 16 2 100 14.12 1.23 Y 13 135 34 444
18. 16 2 80 14.12 1.23 Y 14 130 32 454
19. 16 2 60 14.12 1.23 Y 17 130 33 566
20. 12 2 80 14.12 1.23 Y 15 135 34 516 * N: max peaks/SED, X: number of parallel peak computations per DC,
Y: max peaks/output-window, Qi.f: fixed-point word width bits (integer.fractional), Cen: centroiding capability enabled (Yes/No), M: DCs per FPGA.
† w.r.t. C++ software processing 16x220 statistically representative queries (based on relative elemental abundance in amino acids) in 821 seconds on a single E5620 core.
Performance Trends for Various IPC Parameter Configurations
Configurations Bandwidth Limited
Computation-bound problem in software
becomes I/O-bound in FPGAs
Reducing Calculation Word WidthReduced Logic Usage & Increased Operating
Frequency vs.
Reduced Result PrecisionIncreasing Parallel Computations per DCIncreased Operations per Clock Cycle
vs.Increased Logic Usage, Reduced
Routability, & Operating Frequency
Reducing Distribution Window WidthReduced Logic Usage
vs. Reduced Result Exactness
Suitable “sweet spot,” achieving remarkable speedup while ensuring results remain scientifically
relevant
Multi-FPGA Performance
18
TABLE II. Multi-FPGA performance for several node configurations.
Boards /Node
FPGAs /Board
Total FPGAs
Speedup† /FPGA
Total Speedup†
1. 1 1 1 517 517 2. 1 2 2 516 1031
3. 1 3 3 398 1192
4. 1 4 4 315 1259
5. 2 1 2 516 1033
6. 2 2 4 502 2009
7. 2 4 8 313 2510
8. 4 1 4 492 1968
9. 4 4 16 209 3340
† w.r.t. C++ software processing 230 statistically representative queries (based on
relative elemental abundance) in 52,478 seconds on a single E5620 core.
Performance Trends of “sweet spot” for Various Novo-G Node Configurations
Increasing FPGAs per PROCStar IVScalability limited by I/O-bandwidth
PROCStar IV only supports 8 lane, Gen 1 PCIe
Expect increased scalability with system config. employing more lanes and/or more recent Gen 3
PCIe standard
Increasing PROCStar IVs per NodeAvailable system bandwidth far exceeds
board link bandwidth bottleneck observed with single-board Scalability now limited by CPU resourcesNovo-G “ps4” nodes have 8 physical cores (16 logical with hyper-threading) vs. max 32 threads for row 9Expect increased scalability with
system config employing more physical cores
Multi-Node Scaling ExpectationsAssuming input queries are pre-
partitioned, no required communication between compute nodesOverhead limited to initialization & completion synchronization so expect performance to scale almost linearly with additional nodes We plan to verify these expectations by scaling to multiple compute nodes in Novo-G as future work
Multiple FPGA Advantage?
Summary & Conclusions
19
Presented first FPGA-based Isotope Pattern Calculator Computationally intense subroutine common in de novo PIAs Provides 23 customization parameters for general use Discussed parameter tradeoffs & experimentally demonstrate effect
on performance Between 72 and 566 speedup† on a single FPGA
Up to 1259 speedup † on a single board (4
FPGAs)Up to 3340 speedup † on a single node (16
FPGAs)
† with respect to a highly optimized, serial C++ IPC implementation
Wide range of achieved single-node performance
due to embarrassingly parallel scalability
restricted by real-world system limitations such as
insufficient I/O bandwidth and CPU resources
Can enable use of previously dismissed protein identification algorithms with potentially
revolutionary accuracy yet obscene execution time on conventional computing platforms
Still much to be done before this is a reality for protein Identification
Future Work Continue scaling design to multiple nodes of Novo-G Integrate FPGA accelerated IPC into full de novo PIA
First integrate with full theoretical spectrum generator Move more of algorithm onto FPGA to lessen bandwidth
bottleneck issue Explore the possibility of a GPU accelerated IPC
GPU amenable given minor modifications to the algorithm as stated
Preliminary design already mapped out, ready for implementation & testing
20
Implement non-sorted output option Sorting fundamentally integral to current
DC design Non-sorting DC would allow greater
parallelization while utilizing less resources If sorted distribution not required by
targeted PIA, expect much greater performance
Thank You For Listing! Any Questions
21
References[1] A. W. Bell et al., “A HUPO test sample study reveals common problems in mass
spectrometry-based proteomics,” Nat. Methods, vol 6, pp. 423-430, 2009.
[2] E. A. Kapp et al., "An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis," Proteomics, vol 5, pp. 3475–3490, 2005.
[3] C. Pascoe et al., “Reconfigurable supercomputing with scalable systolic arrays and in-stream control for wavefront genomics processing,” Proc. of Symposium on Application Accelerators in High-Performance Computing (SAAHPC), TN, 2010.
[4] A. George, H. Lam, A. Lawande, C. Pascoe, and G. Stitt, “Novo-G: A View at the HPC Crossroads for Scientific Computing,” Proc. of the Int. Conf. on Eng. of Reconf. Sys. and Algs. (ERSA), NV, 2010.
[5] A. George, H. Lam, and G. Stitt, “Novo-G: At the Forefront of Scalable Reconfigurable Computing,” IEEE Computing in Sci. & Eng. (CiSE), Vol. 13, No. 1, Jan/Feb. 2011, pp. 82-86.
[6] Dirk (2005), Isotopic Pattern Calculator, http://isotopatcalc.sourceforge.net/index.php, File: gips-0.7.tar.gz.
22