benchmark and application selection for the evaluation of the

FP7-215216

Architecture Paradigms and Programming Languages for Efficient programmingof multiple COREs

Specific Targeted Research Project (STReP) THEME ICT-1-3.4

Benchmark and Application Selection for the

Evaluation of the Microgrid Architecture and its Tool

Chain

Deliverable D2.2, Issue 1

Workpackage WP2

Author(s): Daniel Rolls, Frank Penczek, Artjoms Sinkarovs, Carl Joslin

Reviewer(s): C.Jesshope, S.-B. Scholz

WP/Task No.: WP2 Number of pages: 18

Issue date: 28.5.2009 Dissemination level: Public

Purpose: The purpose of this deliverable is to select and implement applications for SaC, SLand C for benchmarks in Apple-CORE. This deliverable sets out the our choice of benchmarksfor Apple-CORE and the rational for the choice.

Results: The chosen applications are representative of a varied selection of applications fromhigh-performance computation, embedded systems and mainstream applications.

Conclusion: The applications selected broadly range over many of the benchmarking suitesthat we consider important representative of high-performance computation, embedded systemsand mainstream applications for demonstrating industrial might and for challenging partners toexploit parallelism in applications considered difficult to parallelise.

Approved by the project coordinator: Yes Date of delivery to the EC: 28.5.2009

Document history

When Who Comments

2009/03/15 Daniel Rolls Initial version

2009/03/19 Daniel Rolls Review of suites

2009/03/19 Carl Joslin NPB1

2009/03/19 Frank Penczek Spelling

2009/03/21 Daniel Rolls Added NAS suites

2009/03/27 Daniel Rolls More suites

2009/03/29 Daniel Rolls More suites

2009/04/11 Daniel Rolls Restructured

2009/05/04 Artjoms Sinkarovs Quicksort, H264, N-Body

2009/05/10 Frank Penczek Added a few references and English fixes

2009/05/13 Frank Penczek Added Talis and Lattice Boltzmann

2009/05/18 Daniel Rolls Added BLAS

2009/05/18 Daniel Rolls Added challenges section

2009/05/22 Daniel Rolls Added intro to selections section, submitted to Chris

2009/05/23 Daniel Rolls Subsections added to selection section and various changesfrom feedback from partners

Project co-funded by the European Commission within the7th Framework Programme (2007-11).

Table of Contents

1 Introduction 1

2 Methodology 1

3 Benchmarking Suites 23.1 Benchmark Suites for Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . 23.2 Benchmark Suites for High-Performance Computing . . . . . . . . . . . . . . . . . . 33.3 Benchmark Suites for Mainstream Computing . . . . . . . . . . . . . . . . . . . . . . 9

4 Benchmark Selection 124.1 Basic Linear Algebra Subprograms (BLAS) . . . . . . . . . . . . . . . . . . . . . . . 134.2 The Livermore Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3 Advanced Encryption Standard (AES) . . . . . . . . . . . . . . . . . . . . . . . . . . 134.4 Sudoku Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.5 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.6 Embarrassingly Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.7 Conjugate gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.8 Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.9 3D FFT PDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.10 Integer Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.11 H.264 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.12 Bzip2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.13 N-Body . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.14 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.15 TVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.16 MTI Radar Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.17 Numerical Lattice Boltzman Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 164.18 Implementation and Measuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Challenges 16

6 Conclusion 18

Appendices 18

1

1 Introduction

This document defines the set of benchmarking applications for Apple-CORE. To meet our variedaims we have selected benchmarking algorithms from a range of benchmarking suites representativeof our core markets as stated in the DOW. The presented selection is drawn from the extensiverange of established benchmark suites to ensure industrial significance as well as diversity. Carehas been taken to compose a balanced selection of inherently parallel algorithms and algorithms forwhich parallelisation is technically challenging. All implementations and measurements from ourselections will be published on our online benchmarking system, unibench, which is described indeliverable D2.1a.

In the following section this report develops a methodology for selecting a range of applicationsand algorithms which we deem indispensable in any fair comparison between emerging and estab-lished architectures and technologies. The main objective here has to be the substantiation of madeclaims about the new technologies developed within Apple-CORE.

In Section 3 we familiarise the reader with the available pool of benchmark suits from which wedraw our selection. The selection of applications, kernels and benchmarks is presented in Section 4.This document also exposes the architectural and technical challenges we have encountered whileimplementing our selection of benchmarks. Section 5 is dedicated to this matter. Section 6 concludesand gives some final remarks.

2 Methodology

The primary intention of this document is to select a range of benchmarks representative of ourcore markets of

• high-performance computation;

• embedded systems; and

• mainstream applications.

The selection must not be necessarily easy to parallelise. Benchmarks that are easy to paralleliseallow us to see the best-case impact of new architectures and compilers. Benchmarks that aredifficult to parallelise allow us to see limitations of architectures and compilers. Benchmarks with arange of memory access patterns will allow us to exercise the distributed memory system: we needbenchmarks across distributed shared memory with both local data and global access patterns.

Benchmarking suites are often composed with particular philosophies which in turn depend onthe target audiences of the suite. Suites exist for evaluations of languages, platforms, user-bases,parallelisable code and applications areas. We feel it important to draw upon the efforts by thewriters of benchmarking suites to identify core kernels representative of applications in the areas ofhigh-performance computing, embedded systems and mainstream applications.

It is common for benchmarking suites to identify kernels of computation representative of muchlarger problems. These kernels need to be carefully selected so that it can reasonably be arguedthat they are truly representative of the applications from which they were taken and also providea fair means of comparison without bias. This is a difficult and subjective process and differencesin opinion can lead to the creation of yet more benchmarking suites.

In the following section we provide a comprehensive analysis of benchmarking suites categorisedby our core markets of high-performance computation, embedded systems and mainstream appli-cations to ensure a broad selection of benchmarks as stated in our aims. By surveying the suites inthis way we hope to select a broad and fair representative set from each field. The final selection willconsist of benchmarks selected from this analysis and from collaboration with industrial partners.

2 3 BENCHMARKING SUITES

3 Benchmarking Suites

In this section we set out an extensive list of benchmark suites that we consider important. Mostbenchmark suites consists of several sub-benchmarks, some suits also contain sub-suites. We listall sub-suites where applicable and present sub-benchmarks in table form. Within such a table weassign an ID to a sub-benchmark which we use in later sections to refer to a specific benchmark.

3.1 Benchmark Suites for Embedded Systems

EMBC

This Benchmark collection consists of several sub-suites:

• AutoBench automotive/industrial [?]

• ConsumerBench consumer market [?]

• DenBench digital entertainment [?]

• GrinderBench embedded Java platforms [?]

• Networking IP, TCP, routing and QoS benchmarks [?]

• OABench office automation [?]

• TeleBench embedded telecommunication applications [?]

All of the above suites consist of several representative applications, we list here applications fromAutoBench and DenBench to give an overview.

Id Application/Algorithm/Description

EMBCAB1 Angle to Time ConversionEMBCAB2 Basic Integer and Floating PointEMBCAB3 Bit ManipulationEMBCAB4 Cache BusterEMBCAB5 CAN Remote Data RequestEMBCAB6 Fast Fourier Transform (FFT)EMBCAB7 Finite Impulse Response (FIR) FilterEMBCAB8 Inverse Discrete Cosine Transform (iDCT)EMBCAB9 Inverse Fast Fourier Transform (iFFT)EMBCAB10 Infinite Impulse Response (IIR) FilterEMBCAB11 Matrix ArithmeticEMBCAB12 Pointer ChasingEMBCAB13 Pulse Width Modulation (PWM)EMBCAB14 Road Speed C:alculationEMBCAB15 Table Lookup and InterpolationEMBCAB16 Tooth to Spark

3.2 Benchmark Suites for High-Performance Computing 3


EMBCDB1 AESEMBCDB2 DESEMBCDB3 Calculating the DENmark and other DENbench Consolidated ScoresEMBCDB4 High-Pass Gray-Scale FilterEMBCDB5 Huffman DecodingEMBCDB6 MP3 DecodeEMBCDB7 MPEG-2 DecodeEMBCDB8 MPEG-2 EncodeEMBCDB9 MPEG-4 DecodeEMBCDB10 MPEG-4 EncodeEMBCDB11 RGB to CMYK ConversionEMBCDB12 RGB to YIQ ConversionEMBCDB13 RSA

The Extreme Benchmark Suite

The Extreme Benchmark Suite [?] aims to measure high-performance embedded systems. Thebenchmarks of this suite are:


EB1 CJPEGEB2 802.11A transmitterEB3 vector additionEB4 FIR, matrix transpose, inverse cosine transform

The MiBench suite

The MiBench suite [?] comprises multiple benchmark suites, which are comparable to the EMBCsuites above. In addition to the already mentioned suites, MiBench contains a sub-suite on crypto-graphic algorithms:


MI1 blowfish encryptionMI2 blowfish decryptionMI3 pgp signMI4 pgp verifyMI5 rijndael encryptionMI6 rijndael decryptionMI7 sha

Miscellaneous

For completeness we shall also reference GraalBench for 3D graphics on mobile phones [?].

3.2 Benchmark Suites for High-Performance Computing

SPEC HPG

The SPEC High-Performance Group has developed a benchmark that represents large, real appli-cations from the scientific and technical computing domain. It is suitable for both, shared anddistributed memory machines. The benchmark suite consists of two main sub-suites:

• MPI2007 benchmark suite for measuring performance of compute intensive applications usingthe Message-Passing Interface (MPI) across a wide range of cluster and SMP hardware [?]


• SPECOMP benchmark suite for evaluating performance based on OpenMP [?]

The benchmark suites contain several applications. The applications of MPI2007 are the following:


SHPG1 Physics: Quantum ChromodynamicsSHPG2 Fluid DynamicsSHPG3 Computational ElectromagneticsSHPG4 Fluid DynamicsSHPG5 OceanographySHPG6 Ray TracingSHPG7 Molecular DynamicsSHPG8 Weather PredictionSHPG9 Finite Element ModelingSHPG10 HydrodynamicsSHPG11 Quantum ChemistrySHPG12 Physics / HydrodynamicsSHPG13 Fluid Dynamics

The applications of SPECOMP are listed in the following table:


SHPG14 quantum chromodynamicsSHPG15 shallow water modelingSHPG16 multi-grid solver in 3D potential fieldSHPG17 parabolic/elliptic partial differential equationsSHPG18 fluid dynamics analysis of oscillatory instabilitySHPG19 neural network simulation of adaptive resonance theorySHPG20 finite element simulation of earthquake modelingSHPG21 computational chemistrySHPG22 finite-element crash simulationSHPG23 solves problems regarding temperature, wind, distribution

of pollutantsSHPG24 genetic algorithm code

BioPerf Benchmark Suite

BioPerf [?] is a benchmark suite of representative bioinformatics applications to facilitate the designand evaluation of high-performance computer architectures for these workloads.


BIP1 Blast, heuristic methods to search sequence databsesBIP2 ClustlW, mulitple sequence alignment programBIP3 HMM, hidden Markov model application for sequencingBIP4 T-Coffee, sequencing applicationBIP5 FASTA, local similarity search algorithmBIP6 Glimmer, Gene locatorBIP7 Phylip, phylogeny inference packageBIP8 GRAPPA, genome rearrangements analysisBIP9 CE, combinatorial extensionsBIP10 Predator, protein structure analysis

High-Performance Challenge Benchmark

The high-performance challenge benchmark suite [?] consists of seven individual tests.



HPCP1 HPL - the Linpack TPP benchmark which measures the floating pointrate of execution for solving a linear system of equations.

HPCP2 DGEMM - measures the floating point rate of execution of doubleprecision real matrix-matrix multiplication.

HPCP3 STREAM - a simple synthetic benchmark program that measuressustainable memory bandwidth (in GB/s) and the correspondingcomputation rate for simple vector kernel.

HPCP4 PTRANS (parallel matrix transpose) - exercises the communicationswhere pairs of processors communicate with each othersimultaneously. It is a useful test of the total communicationscapacity of the network.

HPCP5 RandomAccess - measures the rate of integer random updates ofmemory (GUPS).

HPCP6 FFT - measures the floating point rate of execution of doubleprecision complex one-dimensional Discrete Fourier Transform (DFT).

HPCP7 Communication bandwidth and latency - a set of tests to measurelatency and bandwidth of a number of simultaneous communicationpatterns; based on b eff (effective bandwidth benchmark).

HPL Benchmark Suite

The HPL benchmark suite [?] aims to be a portable implementation of the high-performance Lin-pack benchmark for distributed-memory computers. It consists of one main application whichsolves linear equations using LU decomposition and distributes data across all available nodes of adistributed memory machine.


HPL1 HPL algorithm consisting of LU, panel factorisation, panel broadcast,look-ahead, update, backward substitution and solution checking.

Parsec Benchmark Suite

The Princeton Application Repository for Shared-Memory Computers (PARSEC) [?] is a benchmarksuite composed of multithreaded programs. The suite focuses on emerging workloads and wasdesigned to be representative of next-generation shared-memory programs for chip-multiprocessors.

It consists of the following applications:



PARSEC1 blackscholes, option pricing with Black-Scholes PartialDifferential Equation (PDE)

PARSEC2 bodytrack, body tracking of a personPARSEC3 canneal, simulated cache-aware annealing to optimize routing

cost of a chip designPARSEC4 dedup, next-generation compression with data deduplicationPARSEC5 facesim, simulates the motions of a human facePARSEC6 ferret, content similarity search serverPARSEC7 fluidanimate, fluid dynamics for animation purposes with Smoothed

Particle Hydrodynamics (SPH) methodPARSEC8 freqmine, frequent itemset miningPARSEC9 raytrace, real-time raytracingPARSEC10 streamcluster - Online clustering of an input streamPARSEC11 swaptions - Pricing of a portfolio of swaptionsPARSEC12 vips - image processingPARSEC13 x264 - H.264 video encoding

The Parsec benchmark suite has been compared [?] to SPLASH-2 which is listed below.

SPLASH-2

The SPLASH-2 benchmark suite [?] consists of parallel applications to study shared-memory sys-tems.


SPL1 Barnes, n-body problemSPL2 Cholesky, matrix factorisationSPL2 FFT, fast fourier transformationSPL2 FMM, n-body problemSPL2 LU matrix, matrix decompositionSPL2 Ocean, large scale ocean movement simulationSPL2 Radiosity, equilibrium distribution of lightSPL2 Radix, integer radix sortSPL2 Raytrace, three-dimensional scene ray tracingSPL2 Volrend, three-dimensional volume renderingSPL2 Water-Nsq, evaluation of forces and potentials in water moleculesSPL2 Water-Sp, as Water-Nsq, but more efficient algorithm

BLAS and Related Benchmarks

The Basic Linear Algebra Subprograms [?, ?, ?, ?, ?, ?] are routines that provide standard build-ing blocks for performing basic vector and matrix operations. The subprograms are divided intothree levels: The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations.Because the BLAS are efficient, portable, and widely available, they are commonly used in thedevelopment of linear algebra software.



BLAS1 dot productBLAS2 constant times a vector plus a vectorBLAS3 setup rotationBLAS4 apply rotationBLAS5 modify/apply rotationBLAS6 copy x into yBLAS7 swap x and yBLAS8 2-norm (euclidian)BLAS9 sum of absolute valuesBLAS10 index of element having maximum absolute value

Closely related to BLAS are linear algebra libraries and packages such as LAPACK [?], LIN-PACK [?], EISPACK [?] and self-optimising libraries such as ATLAS [?, ?]. We do not describethem detail in this document and subsume these under BLAS.

NAS Parallel Benchmarks

The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate theperformance of parallel supercomputers. The benchmarks, which are derived from computationalfluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications.

• NPB1 These are the original benchmarks [?]. Vendors and others implement the detailedspecifications in the NPB 1 report, using algorithms and programming models appropriate totheir different machines.

• NPB2 These benchmark implementations [?] are written and distributed by NAS. They areintended to be run with little or no tuning, and approximate the performance a typical usercan expect to obtain for a portable parallel program.

• NPB3 These are parallel implementations using OpenMP, High Performance Fortran (HPF),and Java, respectively [?, ?, ?]. They were derived from the NPB-serial implementationsreleased with NPB 2.

• GridNPB3 These benchmarks designed specifically to rate the performance of computationalgrids [?]. Each of the four benchmarks consists of a collection of communicating tasks derivedfrom the NPB. They symbolize distributed applications typically run on grids. The distribu-tion contains serial and concurrent reference implementations in Fortran and Java, includinga version that uses Globus as grid middleware.

The NPB3 benchmark suite consists of the following applications:



NAS1 BT, CFD application that uses an implicit algorithm to solve3-dimensional (3-D) compressible Navier-Stokes equations.

NAS2 Simulated CFD application that has a similar structure to BT.The finite differences solution to the problem is based on aBeam-Warming approximate factorization that decouples the x, yand z dimensions.

NAS3 LU, simulated CFD application that uses symmetric successiveover-relaxation (SSOR)

NAS4 FT, contains the computational kernel of a 3-D Fast FourierTransform (FFT)-based spectral method

NAS5 MG, uses a V-cycle MultiGrid method to compute the solutionof the 3-D scalar Poisson equation

NAS6 CG, Conjugate Gradient method to compute an approximation tothe smallest eigenvalue of a large, sparse, unstructured matrix

NAS7 EP, Embarrassingly Parallel benchmark. It generates pairs ofGaussian random deviates according to a specific scheme

Livermore Loops

The Livermore loops are a selection of common patterns of nested loops from scientific source codefrom Lawrence Livermore National Laboratory [?]. They were originally written in Fortran. Thesuite was designed to evaluate vector computers and their parallelising compilers at the LawrenceLivermore labs and represents a set of kernel loops that have a range of characteristics, i.e. of varyinglengths (LML2 has loops varying from 2 to N/2 in powers of 2), with and without dependenciesand from simple to very complex loops. They range from embarissingly parallel to very hard toparallelise.


LML1 hydrodynamics fragmentLML2 Cholesky conjugate gradientLML3 inner productLML4 linear system solverLML5 tridiagonal linear system solverLML6 general linear recureence equationsLML7 equation of state fragmentLML8 alternating direction implicit integrationLML9 integrate predictorsLML10 difference predictorsLML11 first sumLML12 first differenceLML13 2-D particle in a cellLML14 1-D particle in a cellLML15 casual FortranLML16 Monte Carlo searchLML17 implicit conditional computationLML18 2-D explicit hydrodynamics fragmentLML19 general linear recurrence equationsLML20 discrete ordinates transportLML21 matrix-matrix transportLML22 Planckian distributionLML23 2-D implicit hydrodynamics fragmentLML24 location of a first array minimum

3.3 Benchmark Suites for Mainstream Computing 9

3.3 Benchmark Suites for Mainstream Computing

SPEC benchmarks

Standard Performance Evaluation Corporation offers a variety of benchmark suits, each targetinga certain application area. We are listing only the most current instances of each suite here.

SPEC CPU2006: This benchmark [?] is designed to provide performance measurements thatcan be used to compare compute-intensive workloads on different computer systems, SPECCPU2006 [?] contains two benchmark suites: CINT2006 for measuring and comparing compute-intensive integer performance, and CFP2006 for measuring and comparing compute-intensivefloating point performance.

SPEC Viewperf 10: This benchmark [?] measures graphics and workstation performance. Itcompares performance of systems running in higher-quality graphics modes that use full-scene anti-aliasing, and measures how effectively graphics subsystems scale when runningmultithreaded graphics content.

SPEC APC Graphics benchmark that measures performance based on the workload of a typicaluser, including functions such as wireframe modeling, shading, texturing, lighting, blending,inverse kinematics, object creation and manipulation, editing, scene creation, particle trac-ing, animation and rendering. It is also available as benchmarking tool for selected CADapplications.

SPEC HPC: Benchmarks for high-performance computing. These are treated in detail in 3.2.

SPEC jAppServer: This benchmark [?] is designed to measure the performance of J2EE 1.3 appli-cation servers.

SPEC jbb2005: This benchmark [?] was developed for evaluating the performance of servers run-ning typical Java business applications, JBB2005 represents an order processing applicationfor a wholesale supplier. The benchmark can be used to evaluate performance of hardware andsoftware aspects of Java Virtual Machine (JVM) servers. This benchmark has been ported.

SPEC jms2007: This benchmark [?] is aimed at evaluation of the performance of enterprisemessage-oriented middleware servers based on JMS (Java Message Service). It provides astandard workload and performance metrics for competitive product comparisons, as well asa framework for indepth performance analysis of enterprise messaging platforms.

SPEC jvm2008: A benchmark suite [?] for measuring the performance of a Java Runtime Envi-ronment (JRE), containing several real life applications and benchmarks focusing on core javafunctionality.

SPEC mail2009: The benchmark [?] measures performance of enterprise mail servers compliantwith SMTP and IMAP4 protocols. The benchmark mimics a workload derived from a 40,000-user corporate mail store and SSL/TLS encryption.

SPEC sfs2008: This benchmark [?] is designed to evaluate the speed and request-handling capabil-ities of file servers utilizing the NFSv3 and CIFS protocols. This benchmark has a relativelylong history [?].

SPEC Power ssj2008: A benchmark for evaluating the power and performance characteristics [?]of volume server class computers. This benchmark addresses the performance of server-sideJava applications.

SPEC web2005: The benchmark [?, ?] implementation emulates users sending browser requestsover broadband Internet connections to a web server. It provides several workloads (non-exhaustive): a banking site (HTTPS), an e-commerce site (HTTP/HTTPS mix), and a sup-port site (HTTP). Dynamic content is implemented in PHP and JSP.


We focus here on SPEC benchmarks suites that are relevant to this project.

The SPEC CPU benchmark suite consists of the following integer operation based algorithms:


SPECCPUI1 perlbenchSPECCPUI2 bzip2SPECCPUI3 gccSPECCPUI4 mcfSPECCPUI5 gobmkSPECCPUI6 hmmerSPECCPUI7 sjengSPECCPUI8 libquantumSPECCPUI9 h264refSPECCPUI10 omnetppSPECCPUI11 astarSPECCPUI12 xalancbmk

and the following floating-point operation based algorithms:


SPECCPUF1 bwavesSPECCPUF2 gamessSPECCPUF3 milcSPECCPUF4 zeusmpSPECCPUF5 gromacsSPECCPUF6 cactusADMSPECCPUF7 leslie3dSPECCPUF8 namdSPECCPUF9 dealIISPECCPUF10 soplexSPECCPUF11 povraySPECCPUF12 calculixSPECCPUF13 GemsFDTDSPECCPUF14 tontoSPECCPUF15 lbmSPECCPUF16 wrfSPECCPUF17 sphinx3SPECCPUF18 specrand

The SPEC HPC benchmark suite is treated in detail under 3.2.

Futuremark Benchmark Suites

These benchmark suites are often referred to as xMARK benchmarks.

• 3DMARK 3DMark Vantage [?] is a standard PC gaming benchmark. It employs 3D graphicsgenerated in real-time to measure system performance.

• PCMARK The benchmark [?] collection aims to provide a selection of common tasks such asviewing and editing photos, video, music and other media, gaming, communications, produc-tivity and security to measure standard PC performance.

These benchmarks are popular among PC enthusiasts and are included here merely for com-pleteness. The suites are mainly targeted at consumer-grade PCs and have no further relevancehere, as the provided measurements are covered by other benchmark suites as well.

3.3 Benchmark Suites for Mainstream Computing 11

BAPCO Benchmark Suite

The Business Applications Performance Corporation offers benchmark suites to characterise theperformance of main-stream computing machinery.

• SYSmark [?, ?] is a benchmark used to characterize the performance of a business client. Itcontains a variety of workloads that represent a range of activities that a desktop worker mayencounter.

• MobileMark [?] provides a performance qualified battery life metric based on real world ap-plications.

The measurements are bases on typical office applications as outlined below.


SMM1 Adobe After Effects 7.0 Special effects to be added to moviesSMM2 Adobe Photoshop CS2 Images, manipulated, compressedSMM3 Sony Vegas 7.0 Digital MovieSMM4 Microsoft Windows Media Encoder 9 Series Compressed

soundtracks and videosSMM5 Adobe Illustrator CS2 Images manipulatedSMM6 Microsoft PowerPoint 2003 PresentationSMM7 Adobe Photoshop CS2 Images, manipulated, compressedSMM8 Adobe Flash 8 Vector Graphics, AnimationSMM9 Adobe Illustrator CS2 Images manipulatedSMM10 Microsoft Project 2003 Project managementSMM11 Microsoft Excel 2003 Calculation sheetsSMM12 Microsoft Outlook 2003 Emails, calendars, schedulerSMM13 Microsoft PowerPoint 2003 Slide presentationsSMM14 Microsoft Word 2003 Formatted text documentsSMM15 WinZip Computing WinZip Pro 10.0 Compressed ArchivesSMM16 Autodesk 3ds max 8.0 3D rendered images, 3D vector scenes/modelsSMM17 SketchUp 5 3D scene

Mediabench

Mediabench [?, ?] consists of applications and a few kernels mainly targeted at multimedia appli-cations. They include audio and video codecs, video converters and renderers. Other applicationsinclude cryptographic applications and character recognition systems.

The Mediabench benchmark suite consists of the following applications:


MB1 JPEG, Lossy compression for still images. (Encode/Decode)MB2 MPEG, Lossy compression for video. (Encode/Decode)MB3 GSM, European standard for speech transcoding. (Encode/Decode)MB4 G721, CCITT voice compression. (Encode/Decode)MB5 PGP, IDEA/RSA public-key encryption algorithm. (Encrypt/Decrypt)MB6 PEGWIT, Elliptical curve public-key encryption algorithm.

(Encrypt/Decrypt)MB7 Ghostscript, Postscript interpreter.MB8 Mesa, Public OpenGL clone. (Mipmap, Osdemo, Texgen)MB9 RASTA, Speech recognition.MB10 EPIC, Experimental image compression. (Encode/Decode)MB11 ADPCM, Adaptive differential pulse code modulation audio

coding. (Encode/Decode)

12 4 BENCHMARK SELECTION

Dwarf Dwarf Name Coverage

1 Dense Linear Algebra 4.1

2 Sparse Linear Algebra 4.7

3 Spectral Methods 4.9

4 N-Body 4.13

5 Structured Grids 4.8

6 Unstructured Grids 4.7

7 Monte Carlo 4.6

8 Combinational Logic 4.3

9 Graph Traversal 4.11

10 Dynamic Programming 4.11

11 Back-track and Branch+Bound 4.5

12 Graphical Models 4.14

13 Finite State Machines 4.11

Table 1: The coverage of the thirteen Dwarfs from Berkley’s position paper [?] is shown above

4 Benchmark Selection

Composing a benchmarking suite is tough: HiPEAC note that in academia there is usually ‘anabsence of real applications’ [?]. An in-depth analysis of available benchmarks has shown thata surprisingly small core collection of kernels are represented repeatedly in benchmark suites ofvarying focus. We have drawn upon common applications from the benchmarks and have foundrepresentative kernels that span across these suites to provide a selection of kernels.

Some of the applications we have selected were suggested by our industrial collaborators. Theseapplications are a media codec, an embedded signal processing and a partial differentiation solvercommonly used in CFD applications. The rest were selected as representative kernels and applica-tions. CFD applications are usually included in suites for high performance computing. The NASbenchmark suites [?], SPECOMP [?], MPI2007 [?] and Parsec [?] suites all have CFD applicationsor kernels.

Compression is heavily represented in Section 3, particularly with multimedia codecs [?, ?, ?].A kernel for a compression algorithm from SPEC will be implemented along with the multimediacodec suggested by an industrial partner. The first will be a representative kernel whilst the secondwill consist of SaC code linked with C code.

We have specifically selected some applications and common kernels representing applicationsfrom the survey above. From the benchmarks in Section 3 many core kernels and applicationsreappear regularly. Linear algebra related kernels are commonly represented in benchmarkingsuites [?, ?, ?, ?] and have a dedicated suite [?] that we can make use of to ensure a representa-tive selection of linear algebra kernels. This suite of small generic but highly challenging problemscomplements The Livermore Loops selected in the previous benchmark selection gathering process.The Livermore Loops are commonly known and representative of real loops in scientific software.

To broaden our coverage of high performance computing a representative of a physics simulation,the n-body [?, ?] problem, was selected. To add bioinformatics to our repertoire a Hidden MarkovModel was added. This features in SPECCPUI6 and BIP3.

Lastly we chose to add an encryption cipher and a search based problem. Encryption ciphersfeature regularly [?, ?, ?, ?] so a well respected block cipher was selected as a representative of anindustrial grade cipher. Search based problems are extremely common so we thought it essential toimplement a tree-search based algorithm.

In addition to the survey of benchmarking suites we also closely looked at the ‘The Landscapeof Parallel Computing Research’ review paper from Berkely [?]. They take a fresh look at kernelselection for a broad range of applications whilst trying not to be too application specific. Each

4.1 Basic Linear Algebra Subprograms (BLAS) 13

kernel is referred to as a dwarf. The paper sets out with a very similar agenda to ourselves. Theytry not to focus on particular markets and have representative experts with expertise in ‘circuitdesign, computer architecture, massively parallel computing, computer-aided design, embeddedhardware and software, programming languages, compilers, scientific programming and numericalanalysis’ [?]. It would be naive for us to not make use of this comprehensive survey. Table 1illustrates our coverage of the Dwarfs.

4.1 Basic Linear Algebra Subprograms (BLAS)

These algebra routines implement operations on vectors and matrices. We will implement a carefullyselected subset of BLAS to cover major operations in linear algebra. We will also cover a range ofmemory access patterns since this suite includes operations with global access patterns like matrixmultiplication and operations with local access patterns lke matrix addition. This covers [BLAS1-BLAS10] and relates to [EB4,HPCP1,HPL1].

This benchmark already has many results [?, ?, ?, ?, ?, ?] which will provide us with theopportunity to test absolute performance against other architectures and not just their speedupdue to concurrency.

4.2 The Livermore Loops

All Livermore loops [LML1-LML24] are included in the selection because, as a set, they representa range of kernels that will challenge any parallelising compiler. These loops contain a mix of bothlocal and global memory access patterns. They include both independent and dependent loops,which need to be recognised and mapped to the restrictions on the microgrid’s shared registercommunication and synchronisation as well reductions which can be recognised and reordered. Themore complex loops will also challenge the low-level compiler in optimising the use of registerresources.

4.3 Advanced Encryption Standard (AES)

We will fully implement the Advanced Encryption Standard (AES) block encryption standard forsymmetric encryption. The standard is simple enough to feasibly implement and run on ourtoolchain and is the current recognised standard for block encryption [?]. As with most blockciphers it consists of rounds which must be run in sequence and consequently is not an obviouschoice for parallelisation. We hope to exploit parallelism within each round itself to illustrate theadvantages of rapid thread creation and to show what performance can be achieved without usingdedicated hardware accelerators for cryptography. This covers [EMBCDB1,MI5,MI6].

4.4 Sudoku Solver

As an illustrative representative of a divide and conquer search based problem with backtrackingwe are producing Sudoku solver implementations in SaC and SL. The term divide and conquersubsumes a set of common and universal problems. This relates to [BIP2, SPL1, HPCP4].

4.5 Quicksort

Quicksort [?] was chosen as a representative of classic computer science algorithms that is well un-derstood and documented. Whilst we concentrate on the demands of as broad a range of real worldproblems as possible we feel it important to have a ‘classic’ and widely understood benchmark algo-rithm in both SL and SaC to ensure that we have results accessible to those with little knowledge ofbenchmark kernels and outside of the application domains from which we have borrowed. Quicksortwill be an interesting test of the memory subsystem since the size of the array returned from eachrecursive call in this doubly recursive algorithm cannot be determined statically and parallelisingquicksort will require parallelising these recursive calls.

14 4 BENCHMARK SELECTION

4.6 Embarrassingly Parallel

This NAS kernel requires little inter-thread communication and as such is worthwhile to show achiev-able upper limits. The particular benchmark chosen by NASA as an example of an embarrassinglyparallel benchmark was Gaussian random deviate pair generation [?]. Each randomly generatedpair is independent and as such the job of pair generation can be split between identical cores witha linear speedup of the number of cores. The benchmark includes tabulation and verification jobsthat must happen after generation to produce a final result. Memory access patterns are local. Thiscollection of results itself requires communication between threads and so even the ‘embarrasinglyparallel’ benchmark is not as embarrisingly parallel as one might expect! This is [NAS7].

4.7 Conjugate gradient

This conjugate gradient numerical method from the parallel NAS benchmark suite computes ona large, sparse matrix and is considered typical of computations on unstructured grids. Theseare identified as important in Berkley’s paper (see Table 1). The conjugate gradient method is amethod for solving linear equations with symmetric and positive-definite matrices. This particularbenchmark searches for the largest eigenvalue in the randomly generated, large, sparse matrix. Thisis [NAS6].

4.8 Multigrid

A kernel from the NAS suite that uses the V-cycle multigrid numerical method to find a solutionfor a partial differential equation known as the discrete poisson problem. We consider this bench-mark important since Multgrids are commonly used to solve partial differential equations as wasacknowledged in Berkley’s position paper [?]. This is [NAS5, SHPG16].

4.9 3D FFT PDE

A complete Fast Fourier Transform (FFT) from the Parallel NAS used to solve a three dimensionalpartial differential equation. FFT algorithms compute Discrete Fourier Transforms and have appli-cations in a whole host of other fields from signal processing to data compression. Three dimensionalFFTs in particular are important in many computational fluid dynamics applications. This bench-mark gives us the opportunity to test implementations with known global memory access patterns.This is [NAS4] and also [HPCP6, SPL2].

4.10 Integer Sort

This is a parallel sort algorithm known as a bucket sort from the Parallel NAS suite. Bucket sortworks by defining buckets as non-overlapping ranges over all possible values and then placing allelements of the unsorted arrays into a ‘bucket’ based on the range they fall within. The final sortedarray is produced by sorting the values from each bucket in order and placing them into the newarray. The algorithm is considered import when using particle method codes in CFD applications [?].This is [NAS1-NAS3,SPL2].

4.11 H.264

The H.264 standard defines a set of video compressing standards. We have had extensive discussionson H264 with NXP; it is one of their key algorithms. H.264 finds application in the consumer market— currently most prominently in high-definition television broadcasting.

Extensive research efforts have been spent to produce parallelized implementations of the algo-rithms comprising H.264 [?, ?, ?]. H264 is difficult because although there are is concurrency onblocks, the amount of computation per block is highly variable and dependent on the data withinthe blocks. Also there is concurrency across blocks but these have some complex dependencies.

4.12 Bzip2 15

Rather than implementing H.264 in SaC in full we are implementing part of a large C applicationin SaC and hope to show how performance could be increased for signal processing applications onmicrogrid architectures. This relates to [SPECCPUI9, PARSEC13].

4.12 Bzip2

The bzip2 [?] algorithm is an example of an extensively used and well known compression algorithm.We feel it important to attempt this sort of algorithm to test the Microgrid against demands fromgeneral purpose computing to discover how it copes with them. This covers [SPECCPUI2, SMM16,EMBCDB5].

4.13 N-Body

The N-body problem describes the evolution of a system of N bodies which comply to classical lawsof mechanics. Each body interacts with all the other bodies. N-body algorithms have numerousapplications in areas such as astrophysics, molecular dynamics and plasma physics [?]. Also, N-bodyis a part of the SPEC [?] benchmark suite and a Dwarf from Berkley’s review paper [?].

The classic implementation of an N-body simulation is of complexity O(N2) and is not suitablefor large scale simulations. It is, however, widely used in benchmark suits for its simplicity. Theexistence of an N-body implementation on the microgrid architecture is therefore imperative fordirect comparison to classical architectures.

A commonly used optimization of this problem is the Barnes-Hut algorithm [?] which reducesthe complexity to O(n · log n). This algorithm is well-suited for parallelization [?] and is thereforeinteresting in its own right. An implementation of the Barnes-Hut algorithm allows us to compareperformance of SaC and SL against an extensive set of scalable, parallel implementations. Thiscovers [SPL1].

4.14 Hidden Markov Models

Hidden Markov Models are statistical models based on Markov chains. They are used in manyareas, as for example gene prediction, speech recognition, speech synthesis and finance. This covers[SPECCPUI6, BIP3].

4.15 TVD

TVD, or total variation diminishing, is an equation solver with applications in physics. It uses theTVD property to solve a partial difference equation to simulate a shock wave. The implementationswill be compared to Fortran implementations written by an experienced computational scientist runon legacy but parallel hardware. This touches [SHPG2, SHPG13, PARSEC7].

4.16 MTI Radar Application

The MTI radar application [?] is a signal processing application from the embedded and real-timesystems domain. It is under development by Thales Research and it has become a joint projectof the aforementioned institution and the University of Hertfordshire. High throughput and lowlatency are the natural requirements for this application, in which adaptive filter algorithms areapplied to incoming radar burst echoes.

The main goal of this project is to re-design existing low-level but high-performance C imple-mentations of signal processing functions and required algorithms in a high level language withoutsacrificing runtime performance.

This project demonstrates the suitability of SaC as is an implementation language and themicrogrid architecture as execution platform for applications of this industrial domain. For thisreason it is imperative to include this application in our pool of chosen applications and benchmarks.

16 5 CHALLENGES

4.17 Numerical Lattice Boltzman Simulation

The Lattice Boltzmann method [?, ?] is used for numerical simulation of physical phenomena andserves as an alternative to classical solvers of partial differential equations. The primary domain ofapplication is fluid dynamics. The LBM is an invaluable tool in fundamental research, as it keepsthe cycle between the elaboration of a theory and the formulation of a corresponding numericalmodel short [?].

Recent contact with The MathWorks [?] has revealed difficulties in designing a parallel imple-mentation of required algorithms that offer adequate performance for their renown Matlab productrange. The SaC implementation of Lattice Boltzman algorithms is underway and their performanceon the new architecture will be evaluated within Apple-CORE.

The opportunity to break into this segment of the market must not be missed and we thereforeinclude Lattice Boltzman in our benchmark selection.

4.18 Implementation and Measuring

This subsection discuses the methodology for implementing, publishing and measuring kernels andapplications for the selections above. SL is a system language and not designed to be used byapplication programmers: no benefit can be obtained by writing large applications in SL. For largeSaC applications handwritten SL will not be produced but automatically parallelised versions ofthe SaC code will still be compared.

Details about benchmarks and specifications appear on unibench. Doing this promotes carefuldocumentation. Many benchmarks will appear on unibench multiple times. Each benchmark hasits own specification and implementations including sometimes a reference implementation. Onceinputs are specified unibench can automatically determine the experiments it can run and whento run them but these schedules can be overridden. Once unibench has collected useful results,diagrams are produced and then the results can be publicly released.

Initial benchmarks for SaC will initially be measured on legacy hardware for ease of testing.Performance results will be automatically collected by unibench using all available routes of thetoolchain.

Results will be collected in emulated time for a given clock frequency. The latter is estimatedbased on a detailed analysis of the implementation of the DRISC core structures [?], which hasalso been used to estimate the number of cores that could be implemented for a given die andfeature size. This will enable us to make realistic comparisons of performance with other publishedresults for the same code on different targets. Also providing speedup figures without showing realperformance is a very old trick in benchmarking. By using the most inefficient algorithms (e.g.using software floating point rather than using hardware) speedup figures can often be enhanced,so although we are interested in speedup this will not be in isolation of real performance.

5 Challenges

The disruptive technology of Apple-CORE has created various technical challenges causing us tolook carefully at the technical implications of our choice. This created particular challenges whenevaluating applications for our industrial partners since we were forced to find representative kernelsfor specific applications rather than kernels representative of an application area.

Challenges were met by speed limitations of running simulations, the lack of system calls, un-finished IO libraries and restrictions on available features of SaC working on early versions of thetoolchain. Below are experiences are summarised.

The Apple-CORE toolchain is intentionally large and has grown with the recent addition ofthe SL language. In Figure 1 the paths in the toolchain from just SL source alone are shown.Simulators are inherently substantially slower than the hardware simulated which limits the practicalcumulative processing time of any application that can run on a simulator. Prior to the porting ofa microkernel onto the platform, applications making use of system calls will have to be adapted.

17

SL source code

pure C code

C code with inline MT asm

µTC code

C++ code using µTC-ptl

SL to sequential C

SL to µTC-ptl

SL to C + MT asm

SL to µTC

Native C compiler

Native C++ compiler

Stock GCC cross-compiler to Linux-Alpha/Sparc

µTC core compiler

Native executable

Microgrid executable

Native run Simulated run FPGA run

Post-compilation filter

Microgrid code linkerNative linker

INPU

TTR

ANSL

ATIO

NCO

MPI

LE A

ND L

INK

EXEC

UTE

SL Compiler

SL Runner

Key for compilation chains:POSIX Threads implementation

Sequential C implementation

Microgrid via "alternate" compiler

Microgrid via µTC core compiler

(future work)

SaC C

Figure 1: A diagram to show the many paths in the toolchain run either natively, through a simulatoror on FPGA.

18

In particular IO bound applications are affected by this: applications that use IO functions fromlibraries require an infeasible investment of man-hours to rewrite without these library functioncalls and whilst the linking of C modules is planned we feel it would not be a wise investment oftime currently to implement complex IO functions. Consequently we currently restrict ourselves tovery basic IO functions.

Languages used for the toolchain have to be restricted too. For example the feasibility ofporting SaC programs to the microgrid architecture has to be carefully considered since recursiveconcurrency creation has hard limits on a given processor and by implication over an entire clusterwhere creates are distributed to all cores in a cluster. These limits are imposed by family table size,thread table size and register file size. Solutions to these limits have to be explored and includesequentialisation at some stage in the concurrency tree or delegation to new clusters where anotherset of resources can be exploited.

This will require some code to be modified, for example some SaC code will be inlined. Thiscreates challenges when benchmarking recursive versions of quicksort for instance. A restrictedsubset of the SaC standard library is viable for simulation on the current iteration of the toolchainas is reflected by the ‘sacsl’ language on unibench. Parts of the SaC standard library will need tobe temporarily rewritten to allow for the limited IO on early implementation and prototypes.

Taking into consideration the above challenges, our benchmark selections needs to be compre-hensively analysed so that a feasible range of applications representative of our core markets can becovered.

Due to the technical challenges mentioned above no large applications (tens of thousands oflines) will be completely implemented in SL or SaC. Instead we will need to carefully selectkernels representative of the applications we wish to consider. Chosen kernels must be suitablycomputationally bound so meaningful measurements can be taken but the kernels must still berepresentative of applications that they represent. The limits are largely architectural and notimposed by SL, although this may add a small constant (e.g. 1 FT entry to allow for functioncalls).

6 Conclusion

This document gave the selection of application kernels for high-performance computation, embed-ded systems and mainstream applications that will be implemented in SL and SaC for Apple-CORE.The benchmarks in this document, the latest implementations and the latest results are publiclyviewable on our online, automated benchmark system, unibench. To view these visit

unibench.apple-core.info

and click on benchmarks, followed by Apple-CORE.

benchmark and application selection for the evaluation of the

Documents