v12 030 parallelism

Parallelism or Paralysis: The Essential High Performance Debate | June 2014

2014 The TABB Group, LLC. All Rights Reserved. May not be reproduced by any means without express permission. | 0

Parallelism or Paralysis: The Essential High Performance Debate

There is an insatiable appetite for performance in capital markets. Not just investment performance, but computational performance, as well – and more so with each passing day. Buy side, sell side, intermediary or vendor – and no matter the nature of the underlying strategies or mix of services – all of these firms will compete increasingly on output from high-performance computing (HPC) infrastructure. The ability to perform increasingly complex calculations on increasingly complex data flows, at higher update frequencies, is growing in today’s pantheon of competitive necessities. Navigating global markets no longer can be conducted with the low precision of overnight batch runs; it now demands the enhanced precision of intraday, near-real-time, and real-time calculations.

Parallelism – a topic that will be new to many, but one that TABB Group believes will quickly become a part of common technology vernacular in our business – offers a significant key to HPC challenges. On the backs of increasingly parallel storage, compute, and network architectures, specifically developed (and re-developed) software can now deliver performance that is orders of magnitude greater than the combination of serial software operating on serial hardware.

Fair warning: Parallel programming is hard. It will not be a challenge tackled by all trading firms and their solution providers, and certainly not for all use cases. There is no known way to decompose and parallelize many computational challenges of today. However, for those use cases that do apply, exceedingly few have been structured to exploit parallelism. This leaves an incredible wealth of performance capabilities lying untapped on nearly every computer system. This is a call to first movers to get up to speed on the competitive advantages of parallelism.

E. Paul Rowady, Jr. V12:030

June 2014 www.tabbgroup.com

Data and Analytics



Introduction Riding an unprecedented and global wave of regulations, increased competition, the high pace of transformation, and increasingly complex data flows has come a growing need for high-performance capabilities. A few steps back from the bleeding edges of speed being explored by some trading firms exists a spectrum of computationally intensive use cases that have spawned an ongoing search for new methods and tools to employ more number-crunching horsepower. In many ways, these challenges – sometimes known as throughput computing applications – are far more complex than the pure speed challenges. As such, this is a broad area of development – which we have been calling Latency 2.0: Bigger Workloads, Faster – that is more generally being addressed by high-performance computing (HPC) platforms. Parallelism is a topic within the overall HPC juggernaut that is emerging in both awareness and deployments because of its potential to respond to some of these performance demands. The benefits of parallelism can be achieved on multiple and complementary levels – ranging from storage architectures to compute architectures to network architectures – and already have demonstrated potential for dramatic performance gains. Modern server architectures are now increasingly parallel. Therefore, compute performance is now a function of the level of parallelism enabled by your software (see Exhibit 1, below).

Exhibit 1 Only Parallel Software + Parallel Hardware = Parallel Performance

Source: TABB Group, Intel

Where graphical processing units (GPUs) were recently seen as the tool of choice to harvest the benefits of highly parallel processing for general processing applications, now new central processing unit (CPU) architectures are proving equally as powerful, yet at a lower total cost of ownership (TCO) – a more detailed comparison will follow.



Computational Targets Parallelism can significantly boost the performance and throughput of problems that are well suited for it – that is, large problems that can be decomposed easily into smaller problems that are then solved in parallel. It is technology’s answer to divide et impera – divide and conquer. Truth be told, however, parallelism may not be right for all firms and definitely is not right for all use cases. On top of the hardware and code development challenges, there are applications with no known way to parallelize them today. For instance, those cases that are related to alpha discovery and capture are really tough to optimize because the out-of-sample behavior is so dynamic. From a programmer’s perspective, parallel programming brings a number of challenges to the table that don’t exist for sequential programming. These include a lack of basic developer tools (IDEs, compilers, debuggers, etc.), the added complexity of writing thread-safe parallel algorithms, and exposure to low-level programming languages such as C and C++. Specific capital markets examples of these include:

Derivatives pricing (including swaps) and volatility estimation; Portfolio optimizations; Credit value adjustment (CVA) and other “xVA” calculations; and, Value-at-Risk (VaR), stress tests and other risk analytics.

These examples represent a subset of a much broader spectrum of computational challenges that generally exhibit the greatest potential for parallelism, including those that use or require:

Linear Algebra: Used for matrix or vector multiplication; often a fundamental part of in finance, bioinformatics and fluid dynamics.

Monte Carlo Simulations: Taking a single function, executing it in parallel on independent data sets, and using the results to glean actionable knowledge – for example, identifying probabilistic outcomes, including “fat tail” events, in a financial portfolio.

Fast Fourier Transformations (FFTs): Converting time domains into frequency domains, and vice versa; essential for signal processing applications and useful for options pricing and other financial time series analysis, including applications in high-frequency trading.

Image Processing: With each pixel calculated separately, there is ample opportunity for parallelism – for example, facial recognition software.

Map/reduce: Taking large datasets and distributing the data filtering, sorting and aggregation workloads across multiple nodes in a compute cluster – for example, Google’s MapReduce or Hadoop.



To further establish a comparative sense of the challenges that parallelism is most likely to address, a list of throughput computing kernels and their characteristics can be found in Exhibit 2, below:

Exhibit 2 Sample List – Throughput Computing Kernels and Applications




Shifting Hardware Considerations General-purpose graphical processing units (GPGPUs) have earned a place in the high-performance spotlight over the past few years because they are specifically designed for parallel processing, and thus, throughput computing applications. As a result, trading strategy and risk analytics developers in search of higher performance gravitated to GPUs; as they were seen at the time as the only option to harvest the benefits of parallelism. And, for a time, this may have been a solid assumption. The logic was pretty straight-forward, although easier said than done: Plug a GPU with specifically (and often painstakingly) designed code into an x86-based server and you’re off to the races, so to speak. Of course, studies touting the 100x performance advantage of GPUs over multi-core CPUs didn’t hurt. Today, this logic no longer holds up nearly as well – and the claims of the early comparative studies have largely been debunked. Recent studies of the latest CPU architectures (using processors with and without coprocessors), some of which are highlighted below, show the same or similar performance benefits as GPUs across a diverse array of throughput “kernels” while also delivering the additional benefits of lower costs, less operational risk, and greater returns on investment from existing infrastructure and tools. These advantages are critically important in the current environment. Mounting global markets headwinds are forcing firms to take a more holistic view of technology costs and deployment strategies. This is true even in areas where the demands for higher performance are very strong, such as in pricing, valuation and risk analytics. Unfortunately, the luxury of experimentation – even in areas with strong demands – still comes with boundaries. This means that the “single-purpose-built” technologies – like GPUs - make their TCO less compelling than originally envisioned. Sure, if your firm is clicking on all cylinders with GPUs – and has all the talent in place to extract satisfactory ROI – then it is likely you will want to stick with what you have. But this is the exception rather than the rule. Most trading firms and most computational challenges (that would apply in the first place) are still sitting on square one with un-optimized, serial and/or single-threaded code. Meanwhile, new developments over the past 5 years allow similar levels of performance to be achieved on CPUs. It turns out that x86-based chip architectures may have much more to offer the growing high-performance crowd and represent a better fit for this “more-for-less” era. Consider that processing speed is no longer only about fiddling with the balance of frequency (GHz) and cycles-per-instruction (CPI). For the past 20 years, increased performance came mainly from improving CPI and increasing GHz. But that lever is played out; no one is going to crank up 100 GHz processors anytime soon. Laws of physics get in the way of this.



Wider Hardware

Now, computers aren’t necessarily becoming faster, they are becoming “wider.” The “width” of new computers is a third factor contributing to new levels of performance. So not only are the benefits of parallelism accessible from multiple cores, but also independently from vector executions within each core using SIMD (single instructions on multiple data) technology. (This is the difference, by the way, between thread-level parallelism, or TLP, and data-level parallelism, DLP.). Executing SIMD means a single instruction operating on numerous single precision (SP) or double precision (DP) values at the same time. In this sense, processing speed is dependent on the number of “lanes,” a number that has recently been expanding on the back of the evolution of 128-bit, then 256-bit, and now up to 512-bit processors. (The figures 128, 256 and 512 refer to the number of bits in a vector register. The “lanes” are the number of data items that can fit into those registers.) Another way to think of this is that supercomputers keep getting smaller – on the order of up to 40 cores on a single 2U chassis. On a walk down a different (memory) lane, achieving that much compute in a single machine used to require a lot more money, a lot more data center space, and a lot more power than it does today.



Software Going Wider With a somewhat broader list of hardware choices in mind, we can now turn to the software development, starting with this simple axiom: If your code isn’t parallelized, it won’t matter how many cores you have, or how wide the computer is, or the specific brand of processing architecture it uses. Applications that have not been created or modified to utilize high degrees of parallelism – via tasks, threads, vectors and so on – will be severely limited in the benefit they derive from hardware that is designed to offer high degrees of parallelism. Code optimization projects can range from minor work to major restructuring to expose and exploit parallelism through multiple tasks and use of vectors. This is where the more nuanced choices related to programmer skills, toolsets, hardware design, power utilization and other factors come into play (see Exhibit 3, below).

Exhibit 3 Comparative Analysis: CPU vs. GPU


Parallel programming is challenging, requiring advanced expertise. You are going to pay the programmer and software development costs whether you go the CPU or GPU route. This is due to increased complexity from the likes of task decomposition, mapping and synchronization – challenges that don’t exist in sequential programming. And because of the mostly pioneering nature of new parallelization efforts, they can be error-prone. Currently, the main weapon to combat this complexity is experience. But the analysis must also reflect the costs and benefits of multi-use hardware vs. specialized hardware. If parallelizing your code is challenging to begin with, then parallelizing your code on specialized hardware is going to be more challenging, more expensive, and simply more risky from an operational perspective. Furthermore, and perhaps even more important, consider that there is a critical temporal component in this decision process as well. It will take time no matter which course you choose. Hardware design – and the evolution of hardware improvements – comes into play in a big way here: Leveraging the benefits of GPUs will usually require a total re-write of existing



code, leaving the benefits to production until the end of this re-writing process. Depending on numerous factors – but principally due to the nature of the problem and the programmer skills at hand – this performance enhancement process typically can be measured in weeks or even several months. This point is extremely important: Since the advertised 100x (or more) performance enhancement of GPUs – over base case, un-optimized code – will not happen overnight, it may take more time than originally expected to achieve fully optimized performance gains on various use cases and on the existing hardware. By comparison, accessing the same or similar parallel benefits on new CPUs can be achieved incrementally. Even this exercise comes at some cost: Fully taking advantage of the parallel compute capabilities of modern x86 hardware requires a solid understanding of low-level programming, machine architecture, and thread concurrency. Consider an experiment in which a performance comparison of parallelized and serialized code is performed on the latest 6 vintages of Intel Xeon platforms, including the upcoming version for later in 2014 (see Exhibit 4, below).

Exhibit 4 Incremental Performance Improvement of CPUs




This experiment shows that the average performance improvement over these six platforms and nine use cases (or kernels) is more than 86x, with the peak improvement of the latest platform on the Monte Carlo single precision kernel of 375x. Now, when we place these results in the context of additional studies, including one from 20101, which specifically compares the performance of GPUs and CPUs and finds average GPU performance advantages of 2.5x, the evidence for equivalence given CPU architecture improvements since 2010 becomes even more compelling. This improving performance trajectory of CPUs for parallel computing is further supported by a very recent study (May 2014) conducted by the Securities Technology Analysis Center (STAC)2 that yielded the following highlights, among others [STAC-A2 is a technology-neutral benchmark suite developed by bank quants to represent a non-trivial calculation typical in computational finance (calculating Greeks for multi-asset American options using modern methods)]:

In the end-to-end Greeks benchmark (STAC-A2.β2.GREEKS.TIME), this system was:

o The fastest of any system published to date (cold, warm, and mean results); o 34% faster than the average speed of the next fastest system, which used GPUs

(SUT ID: NVDA131118); and o More than 9x the average speed of the previous Intel 4-socket system tested

(SUT ID: INTC130607b).

In the capacity benchmarks (STAC-A2.β2.GREEKS.MAX_ASSETS and STAC-A2.β2.GREEKS.MAX_PATHS) this system handled:

o The most assets of any system; o Over 63% more assets than the next best system, which used GPUs (SUT ID:

NVDA131118); o The most paths of any system; and

1 “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU,” Intel, June 2010. 2 STAC-A2 Benchmark tests on a stack consisting of the STAC-A2 Pack for Intel Composer XE (Rev C) with Intel MKL 11.1, Intel Compiler XE 14, and Intel Threading Building Blocks 4.2 on an Intel White Box using 4 x Intel Xeon E7-4890 v2 @ 2.80 GHz (Ivy Bridge EP) processors with 1TB RAM and Red Hat Enterprise Linux 6.5 (SUT ID: INTC140509).



o More than 58% more paths than the next best system, which used GPUs (SUT ID: NVDA131118).

Of course, there is plenty of art to go with this science, since the aggregate community knowledgebase that would normally support such efforts is still in a formative stage – and there is no way of knowing when the maximum performance level has been reached (outside of the growing archive of benchmarks such as those referenced above). It may take months of trial and error to achieve an initial 10x performance enhancement over baseline sequential code with either architecture. In the meantime, there will be no production code with GPUs – particularly if you are just getting started on the parallelism journey. However, using the tools and building blocks you’re already more familiar with in x86 architectures (vs. something completely unknown), you can more easily start the optimization effort with an expectation that within a reasonable amount of time an incremental increase in performance can be deployed in production while you consider seeking additional incremental improvements on the research bench.



Conclusion Parallel programming will not always be as challenging as it is today. It’s early; support from communities, libraries and tools will grow. As such, your firm’s software parallelization strategy – and it will need one if it does not already have one – is critical to future success. Most trading firms and their solution partners need to go down the path of parallelizing their code for certain use cases sooner or later, particularly if you believe (as we do) that there will be increasing demand for higher-performance computing applications along the road ahead. Two factors support this claim. No. 1, the most relevant applications in capital markets – such as option pricing and certain risk analytics – have not been structured to exploit parallelism. This leaves an incredible wealth of capabilities lying untapped on nearly every computer system. No. 2, as the spectrum of high-performance applications to known use cases expands – and the community, libraries and other knowledgebase of tools accumulates as well – TABB believes that new and previously inconceivable use cases will become conceivable. You can’t see these use cases from where you are today. Being on the HPC/parallelism journey is the only way to unleash this new level of creativity. With that in mind, parallelism is one choice that all capital markets firms need to explore. Except, today, there is more than one way to get there. And even if you go to GPUs later, you should explore on x86 first to test performance improvements on existing hardware. Consider this: Applications that show positive results with GPUs should always benefit from CPUs (stand-alone or with coprocessors) because the same fundamentals of vectorization – another term for parallelization – must be present. However, the opposite is never true. The flexibility of CPUs includes support for applications that cannot run on GPUs. This is the main reason that a system built including CPUs will have broader applicability than a system using GPUs. Ask yourself: Is my code even taking advantage of the stuff I already have? Chances are your answer is, “No.” This needs to change.



About TABB Group

TABB Group is a financial markets research and strategic advisory firm focused exclusively on capital markets. Founded in 2003 and based on the methodology of first-person knowledge, TABB Group analyzes and quantifies the investing value chain, from the fiduciary and investment manager, to the broker, exchange and custodian. Our goal is to help senior business leaders gain a truer understanding of financial markets issues and trends so they can grow their businesses. TABB Group members are regularly cited in the press and speak at industry conferences. For more information about TABB Group, visit www.tabbgroup.com.

The Author

E. Paul Rowady, Jr. Paul Rowady, Principal and Director of Data and Analytics Research joined TABB Group in 2009. He has more than 24 years of capital markets, proprietary trading and hedge fund experience, with a background in strategy research, risk analytics and technology development. Paul also has specific expertise in derivatives, highly automated trading systems, and numerous data management initiatives. He is a featured speaker at capital markets, data and technology events; regularly quoted in national, financial and industry media; and has provided live and taped commentary for CNBC, National Public Radio, and client media channels. With TABB, Paul’s research and consulting focus ranges from market data, risk analytics, high performance computing, social media impacts and data visualization to OTC derivatives reform, clearing and collateral management; and includes authorship of reports such as “Faster to Smarter: The Single Global Markets Megatrend”; “Fixed Income Market Data: Growth of Context and the Rate of Triangulation”; “Patterns in the Words: The Evolution of Machine-Readable Data,” “Enhanced Risk Discovery: Exploration into the Unknown,” “The Risk Analytics Library: Time for a Single Source of Truth,” “The New Global Risk Transfer Market: Transformation and the Status Quo,” “Real-Time Market Data: Circus of the Absurd,” and “Quantitative Research: The World After High-Speed Saturation.” Paul earned a Master of Management from the J. L. Kellogg Graduate School of Management at Northwestern University and a B.S. in Business Administration from Valparaiso University. He was also awarded a patent related to data visualization for trading applications in 2008.



www.tabbgroup.com New York + 1.646.722.7800 Westborough, MA + 1.508.836.2031

London + 44 (0) 203 207 9027

v12 030 parallelism

Documents

computational performance

investment performance

hpc challenges

analytics parallelism

complex data flows

complex calculations

computational challenges

tabb group