computer architecture is … parallel computer architecture
TRANSCRIPT
CS 258Parallel Computer Architecture
Lecture 1
Introduction to Parallel Architecture
January 23, 2008Prof John D. Kubiatowicz
Lec 1.21/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Computer Architecture Is …the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARESOFTWARE
Lec 1.31/23/08 Kubiatowicz CS258 ©UCB Spring 2008
The Instruction Set: a Critical Interface
instruction set
software
hardware
• Properties of a good abstraction– Lasts through many generations (portability)– Used in many different ways (generality)– Provides convenient functionality to higher levels– Permits an efficient implementation at lower levels– Changes very slowly! (Although this is increasing)
• Is there a solid interface for multiprocessors?– No standard hardware interface
Lec 1.41/23/08 Kubiatowicz CS258 ©UCB Spring 2008
What is Parallel Architecture?• A parallel computer is a collection of processing
elements that cooperate to solve large problems• Some broad issues:
– Models of computation: PRAM? BSP? Sequential Consistency?– Resource Allocation:
» how large a collection? » how powerful are the elements?» how much memory?
– Data access, Communication and Synchronization» how do the elements cooperate and communicate?» how are data transmitted between processors?» what are the abstractions and primitives for cooperation?
– Performance and Scalability» how does it all translate into performance?» how does it scale?
Lec 1.51/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Topologies of Parallel Machines• Symmetric Multiprocessor
– Multiple processors in box with shared memory communication
– Current MultiCore chips like this– Every processor runs copy of OS
• Non-uniform shared-memory with separate I/O through host
– Multiple processors » Each with local memory» general scalable network
– Extremely light “OS” on node provides simple services
» Scheduling/synchronization– Network-accessible host for I/O
• Cluster– Many independent machine connected with
general network – Communication through messages
P P P P
Bus
Memory
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
Host
Network Lec 1.61/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Conventional Wisdom (CW) in Computer Architecture
1. Old CW: Power is free, but transistors expensive• New CW is the “Power wall”:
Power is expensive, but transistors are “free”– Can put more transistors on a chip than have the
power to turn on2. Old CW: Only concern is dynamic power• New CW: For desktops and servers, static power due
to leakage is 40% of total power 3. Old CW: Monolithic uniprocessors are reliable
internally, with errors occurring only at pins • New CW: As chips drop below 65 nm feature sizes,
they will have high soft and hard error rates
Lec 1.71/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Conventional Wisdom (CW)in Computer Architecture
4. Old CW: By building upon prior successes, continue raising level of abstraction and size of HW designs
• New CW: Wire delay, noise, cross coupling, reliability, clock jitter, design validation, …stretch development time and cost of large designs at ≤65 nm
5. Old CW: Researchers demonstrate new architectures by building chips
• New CW: Cost of 65 nm masks, cost of ECAD, and design time for GHz clocks ⇒ Researchers no longer build believable chips
6. Old CW: Performance improves latency & bandwidth• New CW: BW improves > (latency improvement)2
Lec 1.81/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Conventional Wisdom (CW) in Computer Architecture
7. Old CW: Multiplies slow, but loads and stores fast• New CW is the “Memory wall”:
Loads and stores are slow, but multiplies fast – 200 clocks to DRAM, but even FP multiplies only 4 clocks
8. Old CW: We can reveal more ILP via compilers and architecture innovation
– Branch prediction, OOO execution, speculation, VLIW, …• New CW is the “ILP wall”:
Diminishing returns on finding more ILP9. Old CW: 2X CPU Performance every 18 months• New CW is Power Wall + Memory Wall + ILP Wall =
Brick Wall
Lec 1.91/23/08 Kubiatowicz CS258 ©UCB Spring 2008
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Perfo
rman
ce (v
s. V
AX-1
1/78
0)
25%/year
52%/year
??%/year
Uniprocessor Performance (SPECint)
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006
⇒ Sea change in chip design: multiple “cores” or processors per chip
3X
Lec 1.101/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Sea Change in Chip Design• Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip
• RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
• 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache
– RISC II shrinks to ≈ 0.02 mm2 at 65 nm– Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) or 3D chip stacking– Proximity Communication via capacitive coupling at > 1
TB/s ? Ivan Sutherland @ Sun / Berkeley)
• Processor is the new transistor?
Lec 1.111/23/08 Kubiatowicz CS258 ©UCB Spring 2008
ManyCore Chips: The future is here
• “ManyCore” refers to many processors/chip– 64? 128? Hard to say exact boundary
• How to program these?– Use 2 CPUs for video/audio– Use 1 for word processor, 1 for browser– 76 for virus checking???
• Something new is clearly needed here…
• Intel 80-core multicore chip (Feb 2007)– 80 simple cores– Two floating point engines /core– Mesh-like "network-on-a-chip“– 100 million transistors– 65nm feature size
Frequency Voltage Power Bandwidth Performance3.16 GHz 0.95 V 62W 1.62 Terabits/s 1.01 Teraflops5.1 GHz 1.2 V 175W 2.61 Terabits/s 1.63 Teraflops5.7 GHz 1.35 V 265W 2.92 Terabits/s 1.81 Teraflops
Lec 1.121/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Conventional Wisdom (CW) in Computer Architecture
10. Old CW: Increasing clock frequency is primary method of performance improvement
• New CW: Processors Parallelism is primary method of performance improvement
11. Old CW: Don’t bother parallelizing app, just wait and run on much faster sequential computer
• New CW: Very long wait for faster sequential CPU– 2X uniprocessor performance takes 5 years?– End of La-Z-Boy Programming Era
12. Old CW: Less than linear scaling ⇒ failure• New CW: Given the switch to parallel hardware, even
sublinear speedups are beneficial13 New Moore’s Law is 2X processors (“cores”) per chip every
technology generation, but same clock rate– “This shift toward increasing parallelism is not a triumphant stride
forward based on breakthroughs …; instead, this … is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional solutions.”
The Parallel Computing Landscape: A Berkeley View, Dec 2006
Lec 1.131/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Déjà vu all over again?• Multiprocessors imminent in 1970s, ‘80s, ‘90s, …• “… today’s processors … are nearing an impasse as technologies
approach the speed of light..”David Mitchell, The Transputer: The Time Is Now (1989)
• Transputer was premature ⇒ Custom multiprocessors strove to lead uniprocessors⇒ Procrastination rewarded: 2X seq. perf. / 1.5 years
• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”
Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies switch to
multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs) ⇒ Procrastination penalized: 2X sequential perf. / 5 yrs⇒ Biggest programming challenge: 1 to 2 CPUs
Lec 1.141/23/08 Kubiatowicz CS258 ©UCB Spring 2008
CS258: InformationInstructor: Prof John D. Kubiatowicz
Office: 673 Soda HallPhone: 643-6817 Email: [email protected] Hours: Wed 1:00 - 2:00 or by appt.Class: Mon, Wed 2:30-4:00pm 310 Soda Hall
Web page: http://www.cs/~kubitron/courses/cs258-S08/Lectures available online <Noon day of lecture
Email: [email protected] signup link on web page (as soon as it is up)
Lec 1.151/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Computer Architecture Topics (252+)
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation
Addressing,Protection,Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
RAID
VLSI
Input/Output and Storage
MemoryHierarchy
Pipelining and Instruction Level Parallelism
NetworkCommunication
Oth
er P
roce
ssor
s
Lec 1.161/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Computer Architecture Topics (258)
M
Interconnection NetworkS
PMPMPMP ° ° °
Topologies,Routing,Bandwidth,Latency,Reliability
Network Interfaces
Shared Memory,Message Passing,Data ParallelismTransactional MemoryCheckpoint/Restart
Processor-Memory-Switch
MultiprocessorsNetworks and InterconnectionsProgramming Models/Communications StylesReliability/Fault ToleranceEverything in previous slide but more so!
Lec 1.171/23/08 Kubiatowicz CS258 ©UCB Spring 2008
What will you get out of CS258?• In-depth understanding of the design and engineering
of modern parallel computers– technology forces– Programming models– fundamental architectural issues
» naming, replication, communication, synchronization– basic design techniques
» cache coherence, protocols, networks, pipelining, …– methods of evaluation
• from moderate to very large scale• across the hardware/software boundary• Study of REAL parallel processors
– Research papers, white papers• Natural consequences??
– Massive Parallelism ⇒ Reconfigurable computing?– Message Passing Machines ⇒ NOW ⇒ Peer-to-peer systems?
Lec 1.181/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Will it be worthwhile?• Absolutely!
– Now, more than ever, industry trying to figure out how to build these new multicore chips….
• The fundamental issues and solutions translate across a wide spectrum of systems.
– Crisp solutions in the context of parallel machines.
• Pioneered at the thin-end of the platform pyramid on the most-demanding applications
– migrate downward with time
• Understand implications for software
• Network attachedstorage, MEMs, etc?
SuperServers
Departmenatal Servers
Workstations
Personal Computers
Workstations
Lec 1.191/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Role of a computer architect:To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.
Parallelism:• Provides alternative to faster clock for performance• Applies at all levels of system design• Is a fascinating perspective from which to view
architecture• Is increasingly central in information processing• How is instruction-level parallelism related to course-grained
parallelism??
Why Study Parallel Architecture?
Lec 1.201/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Is Parallel Computing Inevitable?This was certainly not clear just a few years ago
Today, however: YES!• Industry is desperate for solutions!• Application demands: Our insatiable need for
computing cycles• Technology Trends: Easier to build• Architecture Trends: Better abstractions• Current trends:
– Today’s microprocessors are multiprocessors and/or have multiprocessor support
– Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!...
Lec 1.211/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Why Me? The Alewife Multiprocessor
• Cache-coherence Shared Memory– Partially in Software!
• User-level Message-Passing• Rapid Context-Switching• Asynchronous network• One node/board
Lec 1.221/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Lecture style
• 1-Minute Review • 20-Minute Lecture/Discussion• 5- Minute Administrative Matters• 25-Minute Lecture/Discussion• 5-Minute Break (water, stretch)• 25-Minute Lecture/Discussion• Instructor will come to class early & stay after to
answer questions
Attention
Time
20 min. Break “In Conclusion, ...”
Lec 1.231/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Course Methodology• Study existing designs through research papers
– We will read about a number of real multiprocessors– We will discuss network router designs– We will discuss cache coherence protocols– Etc…
• High-level goal:– Understand past solutions so that…– We can make proposals for ManyCore designs
• This is a critical point in parallel architecture design– Industry is really looking for suggestions– Perhaps you can make them listen to your ideas???
Lec 1.241/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Research Paper Reading
• As graduate students, you are now researchers.• Most information of importance to you will be in
research papers.• Ability to rapidly scan and understand research
papers is key to your success.
• So: you will read lots of papers in this course!– Quick 1 paragraph summaries will be due in class– Students will take turns discussing papers
• Papers will be scanned and on web page.
Lec 1.251/23/08 Kubiatowicz CS258 ©UCB Spring 2008
TextBook: Two leaders in fieldText: Parallel Computer Architecture:
A Hardware/Software Approach, By: David Culler & Jaswinder Singh
Covers a range of topics
We will not necessarily cover them in order.
Lec 1.261/23/08 Kubiatowicz CS258 ©UCB Spring 2008
How will grading work?No TA This Term!
Rough Breakdown:• 20% Paper Summaries/Presentations• 30% One Midterm• 40% Research Project (work in pairs)
– meet 3 times with me to see progress– give oral presentation– give poster session– written report like conference paper– 6 weeks work full time for 2 people– Opportunity to do “research in the small” to help make transition
from good student to research colleague
• 10% Class Participation
Lec 1.271/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Application Trends
• Application demand for performance fuels advances in hardware, which enables new appl’ns, which...
– Cycle drives exponential increase in microprocessor performance– Drives parallel architecture harder
» most demanding applications
• Programmers willing to work really hard to improve high-end applications
• Need incremental scalability:– Need range of system performance with progressively increasing cost
New ApplicationsMore Performance
Lec 1.281/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Metrics of Performance
Compiler
Programming Language
Application
DatapathControl
Transistors WiresPins
ISA
Function Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per monthOperations per second
And What about: Programmability, Reliability, Energy?
Lec 1.291/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Speedup
• Speedup (p processors) =
• Common mistake: – Compare parallel program on 1 processor to parallel program
on p processors
• Wrong!:– Should compare uniprocessor program on 1 processor to
parallel program on p processors
• Why? Keeps you honest– It is easy to parallelize overhead.
Time (1 processor)
Time (p processors)
Lec 1.301/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Amdahl's Law
Speedup due to enhancement E:ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
Lec 1.311/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Amdahl’s Law for parallel programs?
( )( )par
overhead
parparallel
ExTimestuffp,ExTime
pFraction
Fraction
1 Speedup ++−
=1
Best you could ever hope to do:
( )parmaximum Fraction - 1
1 Speedup =
( ) ( )stuffp,ExTimep
FractionFraction ExTime ExTime overhead
parparserpar +⎥
⎦
⎤⎢⎣
⎡+−×= 1
Worse: Overhead may kill your performance!
Lec 1.321/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Where is Parallel Arch Going?
Application Software
SystemSoftware SIMD
Message PassingShared MemoryDataflow
SystolicArrays Architecture
• Uncertainty of direction paralyzed parallel software development!
Old view: Divergent architectures, no predictable pattern of growth.
Lec 1.331/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Today• Extension of “computer architecture” to support
communication and cooperation– Instruction Set Architecture plus Communication Architecture
• Defines – Critical abstractions, boundaries, and primitives (interfaces)
– Organizational structures that implement interfaces (hw or sw)
• Compilers, libraries and OS are important bridges today• Still – not enough standardization!
– Nothing close to the stable x86 instruction set!
Lec 1.341/23/08 Kubiatowicz CS258 ©UCB Spring 2008
• John Hennessy, President, Stanford University, 1/07:“…when we start talking about parallelism and ease of use of trulyparallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”
“A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07.
• 100% failure rate of Parallel Computer Companies – Convex, Encore, Inmos (Transputer), MasPar, NCUBE, Kendall
Square Research, Sequent, (Silicon Graphics), Thinking Machines, …
• What if IT goes from a growth industry to areplacement industry?
– If SW can’t effectively use 32, 64, ... cores per chip ⇒ SW no faster on new computer ⇒ Only buy if computer wears out 0
50
100
150
200
250
300
1985 1995 2005 2015
Millions of PCs / year
P.S. Parallel Revolution May Fail
Lec 1.351/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Can programmers handle parallelism?• Historically: Humans not as good at parallel programming as they would like to think!– Need good model to think of machine– Architects pushed on instruction-level parallelism really hard, because it is “transparent”
• Can compiler extract parallelism? – Sometimes
• How do programmers manage parallelism?? – Language to express parallelism?– How to schedule varying number of processors?
• Is communication Explicit (message-passing) or Implicit (shared memory)?– Are there any ordering constraints on communication?
Lec 1.361/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Is it obvious that more cores⇒More performance?
• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-
processor Paragon, 891 on 128-processor Cray T3D
Lec 1.371/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Granularity:• Is communication fine or coarse grained?
– Small messages vs big messages• Is parallelism fine or coarse grained
– Small tasks (frequent synchronization) vs big tasks
• If hardware handles fine-grained parallelism, then easier to get incremental scalability
• Fine-grained communication and parallelism harder than coarse-grained:
– Harder to build with low overhead– Custom communication architectures often needed
• Ultimate course grained communication:– GIMPS (Great Internet Mercenne Prime Search)– Communication once a month
Lec 1.381/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Current Commercial Computing targets
• Relies on parallelism for high end– Computational power determines scale of business that can be
handled
• Databases, online-transaction processing, decision support, data mining, data warehousing ...
• Google, Yahoo, ….• TPC benchmarks (TPC-C order entry, TPC-D decision
support)– Explicit scaling criteria provided– Size of enterprise scales with size of system– Problem size not fixed as p increases.– Throughput is performance measure (transactions per minute or tpm)
Lec 1.391/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Scientific Computing Demand
Lec 1.401/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Engineering Computing Demand• Large parallel machines a mainstay in many
industries– Petroleum (reservoir analysis)– Automotive (crash simulation, drag analysis, combustion
efficiency), – Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism), – Computer-aided design– Pharmaceuticals (molecular modeling)– Visualization
» in all of the above» entertainment (films like Toy Story)» architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)– etc.
Lec 1.411/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Can anyone afford high-end MPPs???
• ASCI (Accellerated Strategic Computing Initiative)
ASCI White: Built by IBM– 12.3 TeraOps, 8192 processors (RS/6000)– 6TB of RAM, 160TB Disk– 2 basketball courts in size– Program it??? Message passing
Lec 1.421/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Need New class of applications
• Handheld devices with ManyCoreprocessors!
– Great Potential, right?
• Human Interface applications very important:“The Laptop/handheld is the Computer”
– ’07: HP number laptops > desktops– 1B+ Cell phones/yr, increasing in function– Obtellini demoed “Universal Communicator”
(Combination cell phone, PC, and Video Device)– Apple iPhone
• User wants Increasing Performance, Weeks or Months of Battery Power
Lec 1.431/23/08 Kubiatowicz CS258 ©UCB Spring 2008
1980 1985 1990 1995
1 MIPS
10 MIPS
100 MIPS
1 GIPS
Sub-BandSpeech Coding
200 WordsIsolated SpeechRecognition
SpeakerVeri¼cation
CELPSpeech Coding
ISDN-CD StereoReceiver
5,000 WordsContinuousSpeechRecognition
HDTV Receiver
CIF Video
1,000 WordsContinuousSpeechRecognitionTelephone
NumberRecognition
10 GIPS
• Also CAD, Databases, . . .
Applications: Speech and Image Processing
Lec 1.441/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Compelling Laptop/Handheld Apps• Meeting Diarist
– Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting
Teleconference speaker identifier, speech helperL/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said
Lec 1.451/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Compelling Laptop/Handheld Apps• Health Coach
– Since laptop/handheld always with you, Record images of all meals, weigh plate before and after, analyze calories consumed so far
» “What if I order a pizza for my next meal? A salad?”
– Since laptop/handheld always with you, record amount of exercise so far, show how body would look if maintain this exercise and diet pattern next 3 months
» “What would I look like if I regularly ran less? Further?”
• Face Recognizer/Name Whisperer– Laptop/handheld scans faces, matches
image database, whispers name in ear (relies on Content Based Image Retrieval)
Lec 1.461/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Content-Based Image Retrieval(Kurt Keutzer)
Relevance Feedback
ImageImageDatabaseDatabase
Query by example
SimilarityMetric
CandidateResults Final ResultFinal Result
• Built around Key Characteristics of personal databases– Very large number of pictures (>5K)– Non-labeled images– Many pictures of few people– Complex pictures including people, events, places, and objects
1000’s of images
See “Porting MapReduce to a GPU” Thursday 11:30
Lec 1.471/23/08 Kubiatowicz CS258 ©UCB Spring 2008
What About….?• Many other applications might make sense• If Only…..
– Parallel programming is already really hard– Who is going to write these apps???
• Domain Expert is not necessarily qualified to write complex parallel programs!
Lec 1.481/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Need a Fresh Approach to Parallelism• Berkeley researchers from many backgrounds meeting since
Feb. 2005 to discuss parallelism– Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John
Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, …
– Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis
• Tried to learn from successes in high performance computing (LBNL) and parallel embedded (BWRC)
• Led to “Berkeley View” Tech. Report and new Parallel Computing Laboratory (“Par Lab”)
• Goal: Productive, Efficient, Correct Programming of 100+ cores & scale as double cores every 2 years (!)
Lec 1.491/23/08 Kubiatowicz CS258 ©UCB Spring 2008
12
48
1632
6464128
256512
1
10
100
1000
2003 2005 2007 2009 2011 2013 2015
Why Target 100+ Cores?• 5-year research program aim 8+ years out• Multicore: 2X / 2 yrs ⇒ ≈ 64 cores in 8 years• Manycore: 8X to 16X multicore
Automatic Parallelization,Thread Level Speculation
Lec 1.501/23/08 Kubiatowicz CS258 ©UCB Spring 2008
4 Themes of View 2.0/ Par Lab
1. Applications� Compelling apps drive top-down research agenda
2. Identify Common Computational Patterns� Breaking through disciplinary boundaries
3. Developing Parallel Software with Productivity, Efficiency, and Correctness
� 2 Layers + Coordination & Composition Language + Autotuning
4. OS and Architecture� Composable primitives, not packaged solutions� Deconstruction, Fast barrier synchronization, Partitions
Lec 1.511/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Personal Health
Image Retrieval
Hearing, Music Speech Parallel
BrowserMotifs/Dwarfs
Sketching
Legacy Code Schedulers Communication &
Synch. PrimitivesEfficiency Language Compilers
Par Lab Research OverviewEasy to write correct programs that run efficiently on manycore
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivity
Layer
Efficiency
Layer Cor
rect
ness
Applications
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debuggingwith Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Lec 1.521/23/08 Kubiatowicz CS258 ©UCB Spring 2008
• How do compelling apps relate to 13 motif/dwarfs?
“Motifs" Popularity (Red Hot → Blue CoolBlue Cool)
Embe
d
SPEC
DB
Gam
es
ML
HPC Health Image Speech Music Browser
1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body
10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid
Lec 1.531/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Developing Parallel Software
• 2 types of programmers ⇒ 2 layers• Efficiency Layer (10% of today’s programmers)
– Expert programmers build Frameworks & Libraries, Hypervisors, …
– “Bare metal” efficiency possible at Efficiency Layer• Productivity Layer (90% of today’s programmers)
– Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries
– Frameworks & libraries composed to form app frameworks• Effective composition techniques allows the efficiency
programmers to be highly leveraged ⇒Create language for Composition and Coordination (C&C)
Lec 1.541/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Architectural Trends• Greatest trend in VLSI generation is increase in
parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
» slows after 32 bit » adoption of 64-bit now under way, 128-bit far (not
performance issue)» great inflection point when 32-bit micro and cache fit on a
chip– Mid 80s to mid 90s: instruction level parallelism
» pipelining and simple instruction sets, + compiler advances (RISC)
» on-chip caches and functional units => superscalar execution
» greater sophistication: out of order execution, speculation, prediction
• to deal with control transfer and latency problems– Next step: ManyCore. – Also: Thread level parallelism? Bit-level parallelism?
Lec 1.551/23/08 Kubiatowicz CS258 ©UCB Spring 2008
0 1 2 3 4 5 6+0
5
10
15
20
25
30
0 5 10 150
0.5
1
1.5
2
2.5
3
Frac
tion
of to
tal c
ycle
s (%
)
Number of instructions issued
Spe
edup
Instructions issued per cycle
Can ILP go any farther?
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
– real caches and non-zero miss latencies
Lec 1.561/23/08 Kubiatowicz CS258 ©UCB Spring 2008No. of processors in fully configured commercial shared-memory systems
Thread-Level Parallelism “on board”
• Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop• Alternative: many PCs sharing one complicated pipe• Faster processors began to saturate bus, then bus
technology advanced– today, range of sizes for bus-based systems, desktop to large servers
Proc Proc Proc Proc
MEM
Lec 1.571/23/08 Kubiatowicz CS258 ©UCB Spring 2008
What about Storage Trends?• Divergence between memory capacity and speed even more
pronounced– Capacity increased by 1000x from 1980-95, speed only 2x– Gigabit DRAM by c. 2000, but gap with processor speed much greater
• Larger memories are slower, while processors get faster– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?
• Parallelism increases effective size of each level of hierarchy,without increasing access time
• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast pipelined
transfer across narrower interface– Buffer caches most recently accessed data– Processor in memory?
• Disks too: Parallel disks plus caching
Lec 1.581/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Summary: Why Parallel Architecture?• Desperate situation for Hardware Manufacturers!
– Economics, technology, architecture, application demand– Perhaps you can help this situation!
• Increasingly central and mainstream• Parallelism exploited at many levels
– Instruction-level parallelism– Multiprocessor servers– Large-scale multiprocessors (“MPPs”)
• Focus of this class: multiprocessor level of parallelism
• Same story from memory system perspective– Increase bandwidth, reduce average latency with many local
memories• Spectrum of parallel architectures make sense
– Different cost, performance and scalability