The Return of the SIMD The Return of the SIMD Computers: Computers:
UCSC Kestrel and BeyondUCSC Kestrel and Beyond
Andrea Di BlasAndrea Di Blas
School of EngineeringUniversity of CaliforniaSanta Cruz
A. Di BlasA. Di Blas 22
OutlineOutline
Introduction: UCSC KestrelIntroduction: UCSC Kestrel
““Synchronous” applicationsSynchronous” applications
““Asynchronous” applicationsAsynchronous” applications
A. Di BlasA. Di Blas 66
A long time ago (late 1980’s) A long time ago (late 1980’s) the computing community the computing community
had high hope and had high hope and expectation in a new kind of expectation in a new kind of
architecture, the “Single architecture, the “Single Instruction-Multiple Data” Instruction-Multiple Data”
(SIMD) parallel computers.(SIMD) parallel computers.
Many started building Many started building “massively-parallel” SIMD “massively-parallel” SIMD computers with thousands computers with thousands
of processors.of processors.
A. Di BlasA. Di Blas 77
However, almost all were However, almost all were short-lived. Their high cost, short-lived. Their high cost,
the ever-increasing power of the ever-increasing power of the evil serial CPUs and, the evil serial CPUs and,
above all, the effort required above all, the effort required to program such an to program such an
unfamiliar architecture, unfamiliar architecture, forced big SIMD machines to forced big SIMD machines to
an early retirement.an early retirement.
A. Di BlasA. Di Blas 88
By the mid-90’s, SIMD By the mid-90’s, SIMD machines were already machines were already
disappearing from disappearing from Top500, the list of the Top500, the list of the
world’s largest world’s largest supercomputers.supercomputers.
A. Di BlasA. Di Blas 99
But in late 1998, a small group But in late 1998, a small group at UC Santa Cruz finally had at UC Santa Cruz finally had
the first working prototype of a the first working prototype of a new kind of high-performance, new kind of high-performance, low-cost SIMD co-processor. low-cost SIMD co-processor.
Originally designed for Originally designed for computational biology, it computational biology, it
proved extremely powerful in a proved extremely powerful in a variety of other applications.variety of other applications.
In the computing galaxy, a new In the computing galaxy, a new SIMD star was born…SIMD star was born…
A. Di BlasA. Di Blas 1313
MIMD and SIMDMIMD and SIMD
Multiple Instruction-Multiple Data
Single Instruction-Multiple Data
A. Di BlasA. Di Blas 1414
Image Filters on KestrelImage Filters on Kestrel
2D Gaussian filter2D Gaussian filter
Edge detectorEdge detector
A. Di BlasA. Di Blas 1818
2D Gaussian convolution2D Gaussian convolution
The 2D Gaussian kernel is separableThe 2D Gaussian kernel is separable
A. Di BlasA. Di Blas 1919
2D Gaussian convolution2D Gaussian convolution
512x512-pixel512x512-pixel
Image (8bpp)Image (8bpp)
Kernel size, time in sKernel size, time in s
5x55x5 7x77x7 9x99x9 11x1111x11
CPU timeCPU time 0.0500.050 0.0700.070 0.0700.070 0.0800.080
Kestrel timeKestrel time 0.0160.016 0.0170.017 0.0180.018 0.0190.019
SPEEDUPSPEEDUP 3.123.12 4.124.12 3.893.89 4.214.21
CPU: 1GHz Pentium-III 256 MB RAM cc –O2
Kestrel runs at 20 MHz!
A. Di BlasA. Di Blas 2323
Edge detectorEdge detector
512x512-pixel512x512-pixel
Image (8bpp)Image (8bpp)time [s]time [s]
CPUCPU 0.0400.040
KestrelKestrel 0.0180.018
SPEEDUPSPEEDUP 2.222.22
CPU: 1GHz Pentium-III 256 MB RAM cc –O2
A. Di BlasA. Di Blas 2424
Asynchronous applicationsAsynchronous applications
Mandelbrot SetMandelbrot Set
2D Median filter2D Median filter
A. Di BlasA. Di Blas 2727
““SIMD Phase Programming Model”SIMD Phase Programming Model”
Simple methodology to turn a sequential, Simple methodology to turn a sequential, data-dependent algorithm into a SIMD-data-dependent algorithm into a SIMD-parallel oneparallel one
Can be used with “partitionable” problemsCan be used with “partitionable” problemsProvides dynamic load balancing without Provides dynamic load balancing without
the need of a high-level support systemthe need of a high-level support system
A. Di BlasA. Di Blas 3232
Mandelbrot setMandelbrot set
512x512-pixel512x512-pixel
Image (16bpp)Image (16bpp)
Max # of iterations, time in sMax # of iterations, time in s
10001000 50005000 1000010000
CPU timeCPU time 4.884.88 22.2122.21 44.3744.37
Kestrel time (synch)Kestrel time (synch) 3.653.65 17.1817.18 34.7934.79
Kestrel time (SPPM)Kestrel time (SPPM) 3.553.55 8.738.73 15.1115.11
SPEEDUP (SPPM vs CPU)SPEEDUP (SPPM vs CPU) 1.371.37 2.542.54 2.942.94
SPEEDUP (SPPM vs synch)SPEEDUP (SPPM vs synch) 1.031.03 1.971.97 2.302.30
CPU: 500 MHz UltraSPARC-II, 640MB RAM, cc –xO3
A. Di BlasA. Di Blas 3737
2D Median filter2D Median filter
512x512-pixel512x512-pixel
Image (8bpp)Image (8bpp)
Window sizeWindow size
5x55x5 7x77x7 9x99x9 11x1111x11
CPU timeCPU time 0.1900.190 0.3700.370 0.5400.540 0.7600.760
Kestrel timeKestrel time 0.0540.054 0.0760.076 0.1050.105 0.1410.141
SPEEDUPSPEEDUP 3.523.52 4.974.97 5.145.14 5.395.39
CPU: 1GHz Pentium-III 256 MB RAM cc –O2