don batory, bryan marker, rui gonçalves, robert van de geijn, and janet siegmund department of...

Rethinking Component BasedSoftware Engineering

Don Batory, Bryan Marker, Rui Gonçalves, Robert van de Geijn, and Janet Siegmund

Department of Computer ScienceUniversity of Texas at Austin

Austin, Texas 78746

DxT-1

Introduction

• Software Engineering (SE) largely aims at techniques, tools to aid masses of programmers whose code is used by hoards

• these programmers need all the help they can get

• Many areas where programming tasks are so difficult, only a few expert programmers – and their code is used by hoards

• these experts need all the help they can get too

DxT-2

Our Focus is CBSE for…

• Dataflow domains: – nodes are computations– edges denote node

inputs and outputs

• General: Virtual Instruments (LabVIEW), applications of streaming languages…

• Our domains:• Distributed-Memory Dense Linear Algebra Kernels• Parallel Relational Query Processors• Crash Fault-Tolerant File Servers

DxT-3

A

B

CC D

α

β

δ

Approach

• CBSE Experts produce “Big Bang”spaghetti diagrams (dataflow graphs)

• We derive dataflow graphs from domain knowledge (DxT)

• When we have proofs of each step:

• Details later…

DxT-4

Correct By Construction

HJOINAB

A*B

A

B

A*BBLOOM

BFILTER

HJOIN

HJOIN

B HSPLIT

MERGEBFILTER

BFILTER

...

B1

Bn

B’1

B’n

M1

HJOIN

HSPLIT

MERGEHJOIN

HJOIN

...

A1 B’1

An B’n

HSPLIT

BLOOM

AM

HSPLIT

MERGEBLOOM

BLOOM MMERGE

...

A1

An

A1

An

HJOIN

BFILTER

MSPLITMn

B’

B HSPLIT

BFILTER

BFILTER

...B1

Bn

B’1

B’n

MERGEHJOIN

HJOIN

...

B1

Bn

A*B1

A*Bn

A HSPLIT

BLOOM

BLOOM

...

A1

An

A1

An

HJOIN

A B

State of Art forDistributed Memory Dense Linear Algebra Kernels

• Portability of DLA kernels is problem:• may not work – distributed memory kernels don’t work on sequential

machines• may not perform well

• choice of algorithms to use may be different• cannot “undo” optimizations and reapply others

• if hardware is different enough, code kernels from scratch

DxT-5

Why? Because Performance is Key!

• Applications that make DLA kernel calls are common to scientific computing:• simulation of airflow, climate change, weather forecasting

• Applications are run on extraordinarily expensive machines• time on these machines = $$• higher performance means quicker/cheaper runs or more accurate results

• Application developers naturally want peak performance to justify costs

DxT-6

Distributed DLA Kernels

• Deals with SPMD (Single Program, Multiple Data) architectures• same program is run on each processor but with different inputs

• Expected operations to support are fixed – but with lots of variants

DxT-7

BLAS3 # of VariantsLevel 3 Basic Linear Algebra Subprograms (BLAS3)

basically matrix-matrix operations


• Deals with SIMD (Single Instruction, multiple data) architectures• same program is run on each processor but with different inputs


DxT-8

BLAS3 # of Variants

Gemm

Hemm

Her2k

Herk

Symm

Syr2k

Trmm

Trsm

triangular matrix-matrix multiply

general matrix-matrix multiply

Hermitian matrix-matrix multiply

symmetric matrix-matrix multiply

solving non-singular triangularsystem of eqns


• Deals with SIMD (Single Instruction, multiple data) architectures• same program is run on each processor but with different inputs


DxT-9

BLAS3 # of Variants

Gemm 12

Hemm 8

Her2k 4

Herk 4

Symm 8

Syr2k 4

Trmm 16

Trsm 16

triangular matrix-matrix multiply

general matrix-matrix multiply

Hermitian matrix-matrix multiply

symmetric matrix-matrix multiply

solving non-singular triangularsystem of eqns

12 Variants of Distributed Gemm

• Where: and:

• Specialize implementation for distributed memory based on , , or is largest

• Similar distinctions for other operations

DxT-10

𝐶≔𝛼𝑜𝑝 ( 𝐴 )𝑜𝑝 (𝐵 )+𝛽𝐶

𝑜𝑝 ( 𝐴 )={ 𝐴 𝑁𝑜𝑟𝑚𝑎𝑙𝐴𝑇 𝑇𝑟𝑎𝑛𝑠𝑝𝑜𝑠𝑒𝑑

𝑜𝑝 (𝐵 )={ 𝐵 𝑁𝑜𝑟𝑚𝑎𝑙𝐵𝑇 𝑇𝑟𝑎𝑛𝑠𝑝𝑜𝑠𝑒𝑑

4×3=12

Further

• Want to optimize “LAPACK-level” algorithms which call DLA and BLAS3 operations:

• solvers• decomposition functions (e.g. Cholesky factorization)• eigenvalue problems

• Have to generate high-performance algorithms for these operations too

• Our work mechanizes the decisions of experts on van de Geijn’s FLAME project, in particular Elemental library (J. Poulson)

• rests on 20 years of polishing, creating elegant layered designs of DLA libraries and their computations

DxT-11

Performance Results

• Target machines:

• Benchmarked against ScaLAPACK• vendors standard option for distributed memory machines;

auto-tuned or manually-tuned• only alternative available for target machines except for FLAME

• DxT automatically generated & optimized BLAS3 and Cholesky FLAME algorithms

DxT-12

Machine # of Cores Peak Performance

Argonne’s BlueGene/P (Intrepid) 8,192 27+ TFLOPS

Texas Advanced Computing Center (Lonestar) 240 3.2 TFLOPS

DxT-13

Gem

m N

N

Gem

m N

T

Gem

m T

N

Gem

m T

T

Sym

m L

L

Sym

m R

L

Sym

m L

U

Sym

m R

U

Syr2

k LN

Syr2

k LT

Syr2

k UN

Syr2

k UT

Syrk

LN

Syrk

LT

Syrk

UN

Syrk

UT

Trm

m L

LNN

Trm

m R

LNN

Trm

m L

LTN

Trm

m L

UNN

Trsm

LLN

N

Trsm

RLN

N

Trsm

LLT

N

Trsm

LUN

N

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

BLAS3 Performance on Intrepid

ScaLAPACKDxTer

Perf

orm

ance

(GFL

OPS

)

Cholesky Factorization

DxT-14

DxT Not Limited to DLA

• DLA components are stateless – DxT does not require stateless components• DxT originally developed for stateful Crash-Fault-Tolerant Servers

• Correct by Construction, can design high performing programs, and best of all:can teach it to undergrads!

• Gave project to an undergraduate class of 30+ students• Had them build Gamma – classical parallel join algorithm circa 1990s

using same DxT techniques we used for DLA code generation

• We asked them to compare this with “big bang” approach which directly implements the spaghetti diagram (final design)

DxT-15

• Compared to “Big Bang”

Preliminary User Study #s

DxT-16

25/28 = 89%

They Really Loved It

DxT-17

I have learned the most from this project than any other CS project I have ever done.

Honestly, I don't believe that software engineers ever have a source (to provide a DxT explanation) in real life. If there was such a thing we would lose our jobs, because there is an explanation which even a monkey can implement.

It's so much easier to implement (using DxT). The big-bang makes it easy to make so many errors, because you can't test each section separately. DxT might take a bit longer, but saves you so much time debugging, and is a more natural way to build things. You won't get lost in your design trying to do too many things at once.

I even made my OS group do DxT implementation on the last 2 projects due to my experience implementing gamma.

What are Secrets Behind DxT?

DxT-18

I’m sorry – I ran out of

time…

questions?

don batory, bryan marker, rui gonçalves, robert van de geijn, and janet siegmund department of...

Documents

variants dxt

matrixmatrix operations

flame dxt

general matrixmatrix

computations dxt

stateless dxt

dxt techniques

hermitian matrixmatrix