research accelerator for multiprocessing
DESCRIPTION
Research Accelerator for MultiProcessing. Dave Patterson, UC Berkeley January 2006. - PowerPoint PPT PresentationTRANSCRIPT
1
Research Accelerator for MultiProcessing
Dave Patterson, UC BerkeleyJanuary 2006
+ RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley), Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley)
2
Old: Multiplies are slow, Memory access is fast New: “Memory wall” Memory slow, multiplies fast
(200 clocks to DRAM memory, 4 clocks for multiply) Old: Power is free, Transistors expensive New: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on) Old: Uniprocessor performance 2X / 1.5 yrs New: Power Wall + Memory Wall = Brick Wall
Uniprocessor performance only 2X / 5 yrs Sea change in chip design: multiple “cores”
(2X processors per chip / ~ 2 years) More instances of simpler processors are more power efficient
Conventional Wisdom in Computer Architecture
3
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Per
form
ance
(vs.
VA
X-1
1/78
0)
25%/year
52%/year
20%/year
Uniprocessor Performance (SPECint)
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: 20%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
4
Sea Change in Chip Design Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip
• Processor is the new transistor?
RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache RISC II shrinks to ~ 0.02 mm2 at 65 nm Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ? Proximity Communication via capacitive coupling at > 1 TB/s ?
(Ivan Sutherland @ Sun / Berkeley)
5
Déjà vu all over again?“… today’s processors … are nearing an impasse as
technologies approach the speed of light..”David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing
Custom multiprocessors strove to lead uniprocessors Procrastination rewarded: 2X seq. perf. / 1.5 years
“We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”
Paul Otellini, President, Intel (2004) All microprocessor companies switch to MP
Procrastination penalized: 2X sequential perf. / 5 yrs Biggest programming challenge: 1 to 2 CPUs
6
Problems with Sea Change 1. Algorithms, Programming Languages,
Compilers, Operating Systems, Architectures, Libraries, … not ready for 1000 CPUs / chip
2. Software people don’t start working hard until hardware arrives
• 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW
3. How do research in timely fashion on 1000 CPU systems in algorithms, compilers, OS, architectures, … without waiting years between HW generations?
7
Characteristics of Ideal Academic CS Research Supercomputer?
Scale – Hard problems at 1000 CPUs Cheap – 2006 funding of academic research Cheap to operate, Small, Low Power – $
again Community – share SW, training, ideas, … Simplifies debugging – high SW churn rate Reconfigurable – test many parameters,
imitate many ISAs, many organizations, … Credible – results translate to real computers Performance – run real OS and full apps,
results overnight
8
Build Academic SC from FPGAs
As ~ 25 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from ~ 40 FPGAs?• 16 32-bit simple “soft core” RISC at 150MHz in 2004 (Virtex-II)• FPGA generations every 1.5 yrs; ~2X CPUs, ~1.2X clock rate
HW research community does logic design (“gate shareware”) to create out-of-the-box, MPP that runs standard binaries of OS and applications Gateware: Processors, Caches, Coherency, Ethernet Interfaces, Switches,
Routers, … (some free from open source hardware) E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-
coherent supercomputer @ 200 MHz/CPU in 2007
9
Why RAMP Good for Research MPP? SMP Cluster Simulate RAMP
Scalability (1k CPUs)
C A A A
Cost (1k CPUs) F ($40M) C ($2-3M)
A+ ($0M) A ($0.1-0.2M)
Cost of ownership
A D A A
Power/Space(kilowatts, racks)
D (120 kw, 12 racks)
D (120 kw, 12 racks)
A+ (.1 kw, 0.1 racks)
A (1.5 kw, 0.3 racks)
Community D A A AObservability D C A+ A+Reproducibility B D A+ A+Reconfigurability D C A+ A+Credibility A+ A+ F APerform. (clock) A (2 GHz) A (3 GHz) F (0 GHz) C (0.1-.2
GHz)GPA C B- B A-
10
Completed Dec. 2004 (14x17 inch 22-layer PCB) Module:
5 Virtex II FPGAs, 18 banks DDR2-400 memory, 20 10GigE conn.
Administration/maintenance ports:
10/100 Enet HDMI/DVI USB
~$4K in Bill of Materials (w/o FPGAs or DRAM)
RAMP 1 Hardware
BEE2: Berkeley Emulation Engine 2
11
Multiple Module RAMP 1 Systems
8 compute modules (plus power supplies) in 8U rack mount chassis
2U single module tray for developers Many topologies possible Disk storage: via disk emulator +
Network Attached Storage
12
Quick Sanity Check BEE2 uses old FPGAs (Virtex II), 4 banks
DDR2-400/cpu 16 32-bit Microblazes per Virtex II FPGA,
0.75 MB memory for caches 32 KB direct mapped Icache, 16 KB direct mapped Dcache
Assume 150 MHz, CPI is 1.5 (4-stage pipe) I$ Miss rate is 0.5% for SPECint2000 D$ Miss rate is 2.8% for SPECint2000, 40% Loads/stores
BW need/CPU = 150/1.5*4B*(0.5% + 40%*2.8%) = 6.4 MB/sec
BW need/FPGA = 16*6.4 = 100 MB/s Memory BW/FPGA = 4*200 MHz*2*8B = 12,800 MB/s Plenty of room for tracing, …
13
RAMP FAQ on ISAs Which ISA will you pick?
Goal is replacible ISA/CPU L1 cache, rest infrastructure unchanged (L2 cache, router, memory controller, …)
What do you want from a CPU? Standard ISA (binaries, libraries, …), simple (area), 64-bit
(coherency), DP Fl.Pt. (apps) Multihreading? As an option, but want to get to 1000 independent
CPUs When do you need it? 3Q06 RAMP people port my ISA , fix my ISA?
Our plates are full already Type A vs. Type B gateware Router, Memory controller, Cache coherency, L2 cache, Disk module,
protocol for each Integration, testing
14
Handicapping ISAs Got it: Power 405 (32b), SPARC v8 (32b),
Xilinx Microblaze (32b) Very Likely: SPARC v9 (64b), Likely: IBM Power 64b Probably (haven’t asked): MIPS32, MIPS64 Not likely: x86 Even less likely: x86-64 We’ll sue you: ARM
15
RAMP Development Plan1. Distribute systems internally for RAMP 1 development
Xilinx agreed to pay for production of a set of modules for initial contributing developers and first full RAMP system
Others could be available if can recover costs2. Release publicly available out-of-the-box MPP emulator
Based on standard ISA (IBM Power, Sun SPARC, …) for binary compatibility Complete OS/libraries Locally modify RAMP as desired
3. Design next generation platform for RAMP 2 Base on 65nm FPGAs (2 generations later than Virtex-II) Pending results from RAMP 1, Xilinx will cover hardware costs for initial set of
RAMP 2 machines Find 3rd party to build and distribute systems (at near-cost), open
source RAMP gateware and software Hope RAMP 3, 4, … self-sustaining
NSF/CRI proposal pending to help support effort 2 full-time staff (one HW/gateware, one OS/software) Look for grad student support at 6 RAMP universities from industrial donations
16
RAMP Milestones 2006Name
Goal Target CPUs Details
Red (SU)
A Start 1Q06 8 32b Power hard cores
Transactional memory SMP
Blue (Cal)
Scale 3Q06 1024 32b Microblaze soft cores
Cluster, MPI
WhitWhitee 1.02.03.04.0
Features 2Q06
3Q064Q061Q07
64 hard PPC128? soft 32b64? soft 64bMultiple ISAs
Cache coherent, shared address, deterministic, debug/monitor, commercial ISA
17
the stone soup of architecture
research platforms
I/OI/OPattersonPatterson
MonitoringMonitoringKozyrakisKozyrakis
Net SwitchNet SwitchOskinOskin
CoherenceCoherenceHoeHoe
CacheCacheAsanovicAsanovic
PPCPPCArvindArvind
x86x86LuLu
Glue-supportGlue-supportChiouChiou
HardwareHardwareWawrzynekWawrzynek
18
Gateware Design Framework Insight: almost every large building block
fits inside FPGA today what doesn’t is between chips in real design
Supports both cycle-accurate emulation of detailed parameterized machine models and rapid functional-only emulations
Carefully counts for Target Clock Cycles Units in any hardware design language
(will work with Verilog, VHDL, BlueSpec, C, ...)
RAMP Design Language (RDL) to describe plumbing to connect units in
19
Gateware Design Framework Design composed of units
that send messages over channels via ports
Units (10,000 + gates) CPU + L1 cache, DRAM
controller…. Channels (~ FIFO)
Lossless, point-to-point, unidirectional, in-order message delivery…
Channel Receiving UnitSending Unit
Port
Port
Sending Unit
Channel
Port “DataOut”
DataOut
__DataOut_READY
__DataOut_WRITE
Receiving Unit
Port “DataIn”
DataIn
__DataIn_READ
__DataIn_READY
20
Status Submitted NSF proposal August 2005 Biweekly teleconferences (since June 05) IBM, Sun donating commercial ISA,
simple, industrial-strength, CPU + FPU Technical report, RDL document RAMP 1/RDL short course/board
distribution in Berkeley for 40 people @ 6 schools Jan 06
FPGA workshop @ HPCA 2/06, @ ISCA 6/06
ramp.eecs.berkeley.edu
21
RAMP uses (internal)
Internet-in-a-BoxInternet-in-a-BoxPattersonPatterson
TCCTCCKozyrakisKozyrakis
DataflowDataflowOskinOskin
Reliable MPReliable MPHoeHoe
1M-way MT1M-way MTAsanovicAsanovic
BlueSpecBlueSpecArvindArvind
x86x86LuLu
Net-uPNet-uPChiouChiou
WawrzynekWawrzynek
BEEBEE
22
Multiprocessing Watering Hole
Killer app: All CS Research, Ind. Advanced Development RAMP attracts many communities to shared artifact
Cross-disciplinary interactions Accelerate innovation in multiprocessing
RAMP as next Standard Research Platform? (e.g., VAX/BSD Unix in 1980s)
Parallel file systemThread scheduling
Multiprocessor switch designFault insertion to check dependability
Data center in a boxInternet in a box
Dataflow language/computerSecurity enhancements
Router design Compile to FPGAParallel languages
RAMPRAMP
23
Supporters (wrote letters to NSF) Gordon Bell (Microsoft) Ivo Bolsens (Xilinx CTO) Norm Jouppi (HP Labs) Bill Kramer (NERSC/LBL) Craig Mundie (MS CTO) G. Papadopoulos (Sun CTO) Justin Rattner (Intel CTO) Ivan Sutherland (Sun Fellow) Chuck Thacker (Microsoft) Kees Vissers (Xilinx)
Doug Burger (Texas) Bill Dally (Stanford) Carl Ebeling (Washington) Susan Eggers (Washington) Steve Keckler (Texas) Greg Morrisett (Harvard) Scott Shenker (Berkeley) Ion Stoica (Berkeley) Kathy Yelick (Berkeley)
RAMP Participants: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley), Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley)
24
RAMP as system-level time machine: preview computers of future to accelerate HW/SW generations Trace anything, Reproduce everything, Tape out every day FTP new supercomputer overnight and boot in morning Clone to check results (as fast in Berkeley as in Boston?) Emulate Massive Multiprocessor, Data Center, or Distributed Computer
Carpe Diem Systems researchers (HW & SW) need the capability FPGA technology is ready today, and getting better every year Stand on shoulders vs. toes: standardize on design framework, multi-
year Berkeley effort on FPGA platforms (Berkeley Emulation Engine BEE2)
Architecture researchers get opportunity to immediately aid colleagues via gateware (as SW researchers have done in past)
“Multiprocessor Research Watering Hole” accelerate research in multiprocessing via standard research platform hasten sea change from sequential to parallel computing
Conclusions
25
Backup Slides
26
UT FAST 1MHz to 100MHz, cycle-accurate, full-system,
multiprocessor simulator Well, not quite that fast right now, but we are using embedded
300MHz PowerPC 405 to simplify X86, boots Linux, Windows, targeting 80486
to Pentium M-like designs Heavily modified Bochs, supports instruction trace and rollback
Working on “superscalar” model Have straight pipeline 486 model with TLBs and caches
Statistics gathered in hardware Very little if any probe effect
Work started on tools to semi-automate micro-architectural and ISA level exploration Orthogonality of models makes both simpler
Derek Chiou, UTexas Derek Chiou, UTexas
27
Example: Transactional Memory
Processors/memory hierarchy that support transactional memory
Hardware/software infrastructure for performance monitoring and profiling Will be general for any type of event
Transactional coherence protocol
Christos Kozyrakis, StanfordChristos Kozyrakis, Stanford
28
Example: PROTOFLEX Hardware/Software Cosimulation/test
methodology Based on FLEXUS C++ full-system
multiprocessor simulator Can swap out individual components to hardware
Used to create and test a non-block MSI invalidation-based protocol engine in hardware
James Hoe, CMUJames Hoe, CMU
29
Example: Wavescalar Infrastructure
Dynamic Routing Switch Directory-based coherency scheme and
engine
Mark Oskin, U WashingtonMark Oskin, U Washington
30
Example RAMP App: “Internet in a Box” Building blocks also Distributed Computing RAMP vs. Clusters (Emulab, PlanetLab)
Scale: RAMP O(1000) vs. Clusters O(100)Private use: $100k Every group has oneDevelop/Debug: Reproducibility, ObservabilityFlexibility: Modify modules (SMP, OS)Heterogeneity: Connect to diverse, real routers
Explore via repeatable experiments as vary parameters, configurations vs. observations on single (aging) cluster that is often idiosyncratic
David Patterson, UC BerkeleyDavid Patterson, UC Berkeley
31
Why RAMP Attractive?
1a. Cost of purchase1b. Cost of ownership (staff to administer it)1c. Scalability (1000 much better than 100 CPUs)4. Power/Space (machine room cooling, number of racks)5. Community synergy (share code, …)6. Observability (non-obtrusively measure, trace
everything)7. Reproducibility (to debug, run experiments)8. Flexibility (change for different experiments)9. Credibility (Faithfully predicts real hardware behavior)10. Performance (As long as experiments not too slow)
Priorities for Research Parallel ComputersInsight – Commercial priorities radically different from research
32
Related Approaches (1) Quickturn, Axis, IKOS, Thara:
FPGA- or special-processor based gate-level hardware emulators
Synthesizable HDL is mapped to array for cycle and bit-accurate netlist emulation
RAMP’s emphasis is on emulating high-level architecture behaviors
Hardware and supporting software provides architecture-level abstractions for modeling and analysis
Targets architecture and software research Provides a spectrum of tradeoffs between speed and
accuracy/precision of emulation RPM at USC in early 1990’s:
Up to only 8 processors Only the memory controller implemented with configurable
logic
33
Related Approaches (2) Software Simulators Clusters (standard microprocessors) PlanetLab (distributed environment) Wisconsin Wind Tunnel (used CM-5 to
simulate shared memory)All suffer from some combination of:
Slowness, inaccuracy, scalability, unbalanced computation/communication, target inflexibility