computer architecture is … parallel computer architecture

15
CS 258 Parallel Computer Architecture Lecture 1 Introduction to Parallel Architecture January 23, 2008 Prof John D. Kubiatowicz Lec 1.2 1/23/08 Kubiatowicz CS258 ©UCB Spring 2008 Computer Architecture Is … the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE SOFTWARE Lec 1.3 1/23/08 Kubiatowicz CS258 ©UCB Spring 2008 The Instruction Set: a Critical Interface instruction set software hardware Properties of a good abstraction Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels Changes very slowly! (Although this is increasing) Is there a solid interface for multiprocessors? No standard hardware interface Lec 1.4 1/23/08 Kubiatowicz CS258 ©UCB Spring 2008 What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems Some broad issues: Models of computation: PRAM? BSP? Sequential Consistency? Resource Allocation: » how large a collection? » how powerful are the elements? » how much memory? Data access, Communication and Synchronization » how do the elements cooperate and communicate? » how are data transmitted between processors? » what are the abstractions and primitives for cooperation? Performance and Scalability » how does it all translate into performance? » how does it scale?

Upload: others

Post on 03-Jun-2022

33 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Architecture Is … Parallel Computer Architecture

CS 258Parallel Computer Architecture

Lecture 1

Introduction to Parallel Architecture

January 23, 2008Prof John D. Kubiatowicz

Lec 1.21/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Computer Architecture Is …the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.

Amdahl, Blaaw, and Brooks, 1964

SOFTWARESOFTWARE

Lec 1.31/23/08 Kubiatowicz CS258 ©UCB Spring 2008

The Instruction Set: a Critical Interface

instruction set

software

hardware

• Properties of a good abstraction– Lasts through many generations (portability)– Used in many different ways (generality)– Provides convenient functionality to higher levels– Permits an efficient implementation at lower levels– Changes very slowly! (Although this is increasing)

• Is there a solid interface for multiprocessors?– No standard hardware interface

Lec 1.41/23/08 Kubiatowicz CS258 ©UCB Spring 2008

What is Parallel Architecture?• A parallel computer is a collection of processing

elements that cooperate to solve large problems• Some broad issues:

– Models of computation: PRAM? BSP? Sequential Consistency?– Resource Allocation:

» how large a collection? » how powerful are the elements?» how much memory?

– Data access, Communication and Synchronization» how do the elements cooperate and communicate?» how are data transmitted between processors?» what are the abstractions and primitives for cooperation?

– Performance and Scalability» how does it all translate into performance?» how does it scale?

Page 2: Computer Architecture Is … Parallel Computer Architecture

Lec 1.51/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Topologies of Parallel Machines• Symmetric Multiprocessor

– Multiple processors in box with shared memory communication

– Current MultiCore chips like this– Every processor runs copy of OS

• Non-uniform shared-memory with separate I/O through host

– Multiple processors » Each with local memory» general scalable network

– Extremely light “OS” on node provides simple services

» Scheduling/synchronization– Network-accessible host for I/O

• Cluster– Many independent machine connected with

general network – Communication through messages

P P P P

Bus

Memory

P/M P/M P/M P/M

P/M P/M P/M P/M

P/M P/M P/M P/M

P/M P/M P/M P/M

Host

Network Lec 1.61/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Conventional Wisdom (CW) in Computer Architecture

1. Old CW: Power is free, but transistors expensive• New CW is the “Power wall”:

Power is expensive, but transistors are “free”– Can put more transistors on a chip than have the

power to turn on2. Old CW: Only concern is dynamic power• New CW: For desktops and servers, static power due

to leakage is 40% of total power 3. Old CW: Monolithic uniprocessors are reliable

internally, with errors occurring only at pins • New CW: As chips drop below 65 nm feature sizes,

they will have high soft and hard error rates

Lec 1.71/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Conventional Wisdom (CW)in Computer Architecture

4. Old CW: By building upon prior successes, continue raising level of abstraction and size of HW designs

• New CW: Wire delay, noise, cross coupling, reliability, clock jitter, design validation, …stretch development time and cost of large designs at ≤65 nm

5. Old CW: Researchers demonstrate new architectures by building chips

• New CW: Cost of 65 nm masks, cost of ECAD, and design time for GHz clocks ⇒ Researchers no longer build believable chips

6. Old CW: Performance improves latency & bandwidth• New CW: BW improves > (latency improvement)2

Lec 1.81/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Conventional Wisdom (CW) in Computer Architecture

7. Old CW: Multiplies slow, but loads and stores fast• New CW is the “Memory wall”:

Loads and stores are slow, but multiplies fast – 200 clocks to DRAM, but even FP multiplies only 4 clocks

8. Old CW: We can reveal more ILP via compilers and architecture innovation

– Branch prediction, OOO execution, speculation, VLIW, …• New CW is the “ILP wall”:

Diminishing returns on finding more ILP9. Old CW: 2X CPU Performance every 18 months• New CW is Power Wall + Memory Wall + ILP Wall =

Brick Wall

Page 3: Computer Architecture Is … Parallel Computer Architecture

Lec 1.91/23/08 Kubiatowicz CS258 ©UCB Spring 2008

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Perfo

rman

ce (v

s. V

AX-1

1/78

0)

25%/year

52%/year

??%/year

Uniprocessor Performance (SPECint)

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006

⇒ Sea change in chip design: multiple “cores” or processors per chip

3X

Lec 1.101/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Sea Change in Chip Design• Intel 4004 (1971): 4-bit processor,

2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip

• RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

• 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache

– RISC II shrinks to ≈ 0.02 mm2 at 65 nm– Caches via DRAM or 1 transistor SRAM

(www.t-ram.com) or 3D chip stacking– Proximity Communication via capacitive coupling at > 1

TB/s ? Ivan Sutherland @ Sun / Berkeley)

• Processor is the new transistor?

Lec 1.111/23/08 Kubiatowicz CS258 ©UCB Spring 2008

ManyCore Chips: The future is here

• “ManyCore” refers to many processors/chip– 64? 128? Hard to say exact boundary

• How to program these?– Use 2 CPUs for video/audio– Use 1 for word processor, 1 for browser– 76 for virus checking???

• Something new is clearly needed here…

• Intel 80-core multicore chip (Feb 2007)– 80 simple cores– Two floating point engines /core– Mesh-like "network-on-a-chip“– 100 million transistors– 65nm feature size

Frequency Voltage Power Bandwidth Performance3.16 GHz 0.95 V 62W 1.62 Terabits/s 1.01 Teraflops5.1 GHz 1.2 V 175W 2.61 Terabits/s 1.63 Teraflops5.7 GHz 1.35 V 265W 2.92 Terabits/s 1.81 Teraflops

Lec 1.121/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Conventional Wisdom (CW) in Computer Architecture

10. Old CW: Increasing clock frequency is primary method of performance improvement

• New CW: Processors Parallelism is primary method of performance improvement

11. Old CW: Don’t bother parallelizing app, just wait and run on much faster sequential computer

• New CW: Very long wait for faster sequential CPU– 2X uniprocessor performance takes 5 years?– End of La-Z-Boy Programming Era

12. Old CW: Less than linear scaling ⇒ failure• New CW: Given the switch to parallel hardware, even

sublinear speedups are beneficial13 New Moore’s Law is 2X processors (“cores”) per chip every

technology generation, but same clock rate– “This shift toward increasing parallelism is not a triumphant stride

forward based on breakthroughs …; instead, this … is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional solutions.”

The Parallel Computing Landscape: A Berkeley View, Dec 2006

Page 4: Computer Architecture Is … Parallel Computer Architecture

Lec 1.131/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Déjà vu all over again?• Multiprocessors imminent in 1970s, ‘80s, ‘90s, …• “… today’s processors … are nearing an impasse as technologies

approach the speed of light..”David Mitchell, The Transputer: The Time Is Now (1989)

• Transputer was premature ⇒ Custom multiprocessors strove to lead uniprocessors⇒ Procrastination rewarded: 2X seq. perf. / 1.5 years

• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”

Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies switch to

multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs) ⇒ Procrastination penalized: 2X sequential perf. / 5 yrs⇒ Biggest programming challenge: 1 to 2 CPUs

Lec 1.141/23/08 Kubiatowicz CS258 ©UCB Spring 2008

CS258: InformationInstructor: Prof John D. Kubiatowicz

Office: 673 Soda HallPhone: 643-6817 Email: [email protected] Hours: Wed 1:00 - 2:00 or by appt.Class: Mon, Wed 2:30-4:00pm 310 Soda Hall

Web page: http://www.cs/~kubitron/courses/cs258-S08/Lectures available online <Noon day of lecture

Email: [email protected] signup link on web page (as soon as it is up)

Lec 1.151/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Computer Architecture Topics (252+)

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation

Addressing,Protection,Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

RAID

VLSI

Input/Output and Storage

MemoryHierarchy

Pipelining and Instruction Level Parallelism

NetworkCommunication

Oth

er P

roce

ssor

s

Lec 1.161/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Computer Architecture Topics (258)

M

Interconnection NetworkS

PMPMPMP ° ° °

Topologies,Routing,Bandwidth,Latency,Reliability

Network Interfaces

Shared Memory,Message Passing,Data ParallelismTransactional MemoryCheckpoint/Restart

Processor-Memory-Switch

MultiprocessorsNetworks and InterconnectionsProgramming Models/Communications StylesReliability/Fault ToleranceEverything in previous slide but more so!

Page 5: Computer Architecture Is … Parallel Computer Architecture

Lec 1.171/23/08 Kubiatowicz CS258 ©UCB Spring 2008

What will you get out of CS258?• In-depth understanding of the design and engineering

of modern parallel computers– technology forces– Programming models– fundamental architectural issues

» naming, replication, communication, synchronization– basic design techniques

» cache coherence, protocols, networks, pipelining, …– methods of evaluation

• from moderate to very large scale• across the hardware/software boundary• Study of REAL parallel processors

– Research papers, white papers• Natural consequences??

– Massive Parallelism ⇒ Reconfigurable computing?– Message Passing Machines ⇒ NOW ⇒ Peer-to-peer systems?

Lec 1.181/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Will it be worthwhile?• Absolutely!

– Now, more than ever, industry trying to figure out how to build these new multicore chips….

• The fundamental issues and solutions translate across a wide spectrum of systems.

– Crisp solutions in the context of parallel machines.

• Pioneered at the thin-end of the platform pyramid on the most-demanding applications

– migrate downward with time

• Understand implications for software

• Network attachedstorage, MEMs, etc?

SuperServers

Departmenatal Servers

Workstations

Personal Computers

Workstations

Lec 1.191/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Role of a computer architect:To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.

Parallelism:• Provides alternative to faster clock for performance• Applies at all levels of system design• Is a fascinating perspective from which to view

architecture• Is increasingly central in information processing• How is instruction-level parallelism related to course-grained

parallelism??

Why Study Parallel Architecture?

Lec 1.201/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Is Parallel Computing Inevitable?This was certainly not clear just a few years ago

Today, however: YES!• Industry is desperate for solutions!• Application demands: Our insatiable need for

computing cycles• Technology Trends: Easier to build• Architecture Trends: Better abstractions• Current trends:

– Today’s microprocessors are multiprocessors and/or have multiprocessor support

– Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!...

Page 6: Computer Architecture Is … Parallel Computer Architecture

Lec 1.211/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Why Me? The Alewife Multiprocessor

• Cache-coherence Shared Memory– Partially in Software!

• User-level Message-Passing• Rapid Context-Switching• Asynchronous network• One node/board

Lec 1.221/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Lecture style

• 1-Minute Review • 20-Minute Lecture/Discussion• 5- Minute Administrative Matters• 25-Minute Lecture/Discussion• 5-Minute Break (water, stretch)• 25-Minute Lecture/Discussion• Instructor will come to class early & stay after to

answer questions

Attention

Time

20 min. Break “In Conclusion, ...”

Lec 1.231/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Course Methodology• Study existing designs through research papers

– We will read about a number of real multiprocessors– We will discuss network router designs– We will discuss cache coherence protocols– Etc…

• High-level goal:– Understand past solutions so that…– We can make proposals for ManyCore designs

• This is a critical point in parallel architecture design– Industry is really looking for suggestions– Perhaps you can make them listen to your ideas???

Lec 1.241/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Research Paper Reading

• As graduate students, you are now researchers.• Most information of importance to you will be in

research papers.• Ability to rapidly scan and understand research

papers is key to your success.

• So: you will read lots of papers in this course!– Quick 1 paragraph summaries will be due in class– Students will take turns discussing papers

• Papers will be scanned and on web page.

Page 7: Computer Architecture Is … Parallel Computer Architecture

Lec 1.251/23/08 Kubiatowicz CS258 ©UCB Spring 2008

TextBook: Two leaders in fieldText: Parallel Computer Architecture:

A Hardware/Software Approach, By: David Culler & Jaswinder Singh

Covers a range of topics

We will not necessarily cover them in order.

Lec 1.261/23/08 Kubiatowicz CS258 ©UCB Spring 2008

How will grading work?No TA This Term!

Rough Breakdown:• 20% Paper Summaries/Presentations• 30% One Midterm• 40% Research Project (work in pairs)

– meet 3 times with me to see progress– give oral presentation– give poster session– written report like conference paper– 6 weeks work full time for 2 people– Opportunity to do “research in the small” to help make transition

from good student to research colleague

• 10% Class Participation

Lec 1.271/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Application Trends

• Application demand for performance fuels advances in hardware, which enables new appl’ns, which...

– Cycle drives exponential increase in microprocessor performance– Drives parallel architecture harder

» most demanding applications

• Programmers willing to work really hard to improve high-end applications

• Need incremental scalability:– Need range of system performance with progressively increasing cost

New ApplicationsMore Performance

Lec 1.281/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

Transistors WiresPins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second

And What about: Programmability, Reliability, Energy?

Page 8: Computer Architecture Is … Parallel Computer Architecture

Lec 1.291/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Speedup

• Speedup (p processors) =

• Common mistake: – Compare parallel program on 1 processor to parallel program

on p processors

• Wrong!:– Should compare uniprocessor program on 1 processor to

parallel program on p processors

• Why? Keeps you honest– It is easy to parallelize overhead.

Time (1 processor)

Time (p processors)

Lec 1.301/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Amdahl's Law

Speedup due to enhancement E:ExTime w/o E Performance w/ E

Speedup(E) = ------------- = -------------------ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

Lec 1.311/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Amdahl’s Law for parallel programs?

( )( )par

overhead

parparallel

ExTimestuffp,ExTime

pFraction

Fraction

1 Speedup ++−

=1

Best you could ever hope to do:

( )parmaximum Fraction - 1

1 Speedup =

( ) ( )stuffp,ExTimep

FractionFraction ExTime ExTime overhead

parparserpar +⎥

⎤⎢⎣

⎡+−×= 1

Worse: Overhead may kill your performance!

Lec 1.321/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Where is Parallel Arch Going?

Application Software

SystemSoftware SIMD

Message PassingShared MemoryDataflow

SystolicArrays Architecture

• Uncertainty of direction paralyzed parallel software development!

Old view: Divergent architectures, no predictable pattern of growth.

Page 9: Computer Architecture Is … Parallel Computer Architecture

Lec 1.331/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Today• Extension of “computer architecture” to support

communication and cooperation– Instruction Set Architecture plus Communication Architecture

• Defines – Critical abstractions, boundaries, and primitives (interfaces)

– Organizational structures that implement interfaces (hw or sw)

• Compilers, libraries and OS are important bridges today• Still – not enough standardization!

– Nothing close to the stable x86 instruction set!

Lec 1.341/23/08 Kubiatowicz CS258 ©UCB Spring 2008

• John Hennessy, President, Stanford University, 1/07:“…when we start talking about parallelism and ease of use of trulyparallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”

“A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07.

• 100% failure rate of Parallel Computer Companies – Convex, Encore, Inmos (Transputer), MasPar, NCUBE, Kendall

Square Research, Sequent, (Silicon Graphics), Thinking Machines, …

• What if IT goes from a growth industry to areplacement industry?

– If SW can’t effectively use 32, 64, ... cores per chip ⇒ SW no faster on new computer ⇒ Only buy if computer wears out 0

50

100

150

200

250

300

1985 1995 2005 2015

Millions of PCs / year

P.S. Parallel Revolution May Fail

Lec 1.351/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Can programmers handle parallelism?• Historically: Humans not as good at parallel programming as they would like to think!– Need good model to think of machine– Architects pushed on instruction-level parallelism really hard, because it is “transparent”

• Can compiler extract parallelism? – Sometimes

• How do programmers manage parallelism?? – Language to express parallelism?– How to schedule varying number of processors?

• Is communication Explicit (message-passing) or Implicit (shared memory)?– Are there any ordering constraints on communication?

Lec 1.361/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Is it obvious that more cores⇒More performance?

• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-

processor Paragon, 891 on 128-processor Cray T3D

Page 10: Computer Architecture Is … Parallel Computer Architecture

Lec 1.371/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Granularity:• Is communication fine or coarse grained?

– Small messages vs big messages• Is parallelism fine or coarse grained

– Small tasks (frequent synchronization) vs big tasks

• If hardware handles fine-grained parallelism, then easier to get incremental scalability

• Fine-grained communication and parallelism harder than coarse-grained:

– Harder to build with low overhead– Custom communication architectures often needed

• Ultimate course grained communication:– GIMPS (Great Internet Mercenne Prime Search)– Communication once a month

Lec 1.381/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Current Commercial Computing targets

• Relies on parallelism for high end– Computational power determines scale of business that can be

handled

• Databases, online-transaction processing, decision support, data mining, data warehousing ...

• Google, Yahoo, ….• TPC benchmarks (TPC-C order entry, TPC-D decision

support)– Explicit scaling criteria provided– Size of enterprise scales with size of system– Problem size not fixed as p increases.– Throughput is performance measure (transactions per minute or tpm)

Lec 1.391/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Scientific Computing Demand

Lec 1.401/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Engineering Computing Demand• Large parallel machines a mainstay in many

industries– Petroleum (reservoir analysis)– Automotive (crash simulation, drag analysis, combustion

efficiency), – Aeronautics (airflow analysis, engine efficiency, structural

mechanics, electromagnetism), – Computer-aided design– Pharmaceuticals (molecular modeling)– Visualization

» in all of the above» entertainment (films like Toy Story)» architecture (walk-throughs and rendering)

– Financial modeling (yield and derivative analysis)– etc.

Page 11: Computer Architecture Is … Parallel Computer Architecture

Lec 1.411/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Can anyone afford high-end MPPs???

• ASCI (Accellerated Strategic Computing Initiative)

ASCI White: Built by IBM– 12.3 TeraOps, 8192 processors (RS/6000)– 6TB of RAM, 160TB Disk– 2 basketball courts in size– Program it??? Message passing

Lec 1.421/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Need New class of applications

• Handheld devices with ManyCoreprocessors!

– Great Potential, right?

• Human Interface applications very important:“The Laptop/handheld is the Computer”

– ’07: HP number laptops > desktops– 1B+ Cell phones/yr, increasing in function– Obtellini demoed “Universal Communicator”

(Combination cell phone, PC, and Video Device)– Apple iPhone

• User wants Increasing Performance, Weeks or Months of Battery Power

Lec 1.431/23/08 Kubiatowicz CS258 ©UCB Spring 2008

1980 1985 1990 1995

1 MIPS

10 MIPS

100 MIPS

1 GIPS

Sub-BandSpeech Coding

200 WordsIsolated SpeechRecognition

SpeakerVeri¼cation

CELPSpeech Coding

ISDN-CD StereoReceiver

5,000 WordsContinuousSpeechRecognition

HDTV Receiver

CIF Video

1,000 WordsContinuousSpeechRecognitionTelephone

NumberRecognition

10 GIPS

• Also CAD, Databases, . . .

Applications: Speech and Image Processing

Lec 1.441/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Compelling Laptop/Handheld Apps• Meeting Diarist

– Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting

Teleconference speaker identifier, speech helperL/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said

Page 12: Computer Architecture Is … Parallel Computer Architecture

Lec 1.451/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Compelling Laptop/Handheld Apps• Health Coach

– Since laptop/handheld always with you, Record images of all meals, weigh plate before and after, analyze calories consumed so far

» “What if I order a pizza for my next meal? A salad?”

– Since laptop/handheld always with you, record amount of exercise so far, show how body would look if maintain this exercise and diet pattern next 3 months

» “What would I look like if I regularly ran less? Further?”

• Face Recognizer/Name Whisperer– Laptop/handheld scans faces, matches

image database, whispers name in ear (relies on Content Based Image Retrieval)

Lec 1.461/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Content-Based Image Retrieval(Kurt Keutzer)

Relevance Feedback

ImageImageDatabaseDatabase

Query by example

SimilarityMetric

CandidateResults Final ResultFinal Result

• Built around Key Characteristics of personal databases– Very large number of pictures (>5K)– Non-labeled images– Many pictures of few people– Complex pictures including people, events, places, and objects

1000’s of images

See “Porting MapReduce to a GPU” Thursday 11:30

Lec 1.471/23/08 Kubiatowicz CS258 ©UCB Spring 2008

What About….?• Many other applications might make sense• If Only…..

– Parallel programming is already really hard– Who is going to write these apps???

• Domain Expert is not necessarily qualified to write complex parallel programs!

Lec 1.481/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Need a Fresh Approach to Parallelism• Berkeley researchers from many backgrounds meeting since

Feb. 2005 to discuss parallelism– Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John

Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, …

– Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis

• Tried to learn from successes in high performance computing (LBNL) and parallel embedded (BWRC)

• Led to “Berkeley View” Tech. Report and new Parallel Computing Laboratory (“Par Lab”)

• Goal: Productive, Efficient, Correct Programming of 100+ cores & scale as double cores every 2 years (!)

Page 13: Computer Architecture Is … Parallel Computer Architecture

Lec 1.491/23/08 Kubiatowicz CS258 ©UCB Spring 2008

12

48

1632

6464128

256512

1

10

100

1000

2003 2005 2007 2009 2011 2013 2015

Why Target 100+ Cores?• 5-year research program aim 8+ years out• Multicore: 2X / 2 yrs ⇒ ≈ 64 cores in 8 years• Manycore: 8X to 16X multicore

Automatic Parallelization,Thread Level Speculation

Lec 1.501/23/08 Kubiatowicz CS258 ©UCB Spring 2008

4 Themes of View 2.0/ Par Lab

1. Applications� Compelling apps drive top-down research agenda

2. Identify Common Computational Patterns� Breaking through disciplinary boundaries

3. Developing Parallel Software with Productivity, Efficiency, and Correctness

� 2 Layers + Coordination & Composition Language + Autotuning

4. OS and Architecture� Composable primitives, not packaged solutions� Deconstruction, Fast barrier synchronization, Partitions

Lec 1.511/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Personal Health

Image Retrieval

Hearing, Music Speech Parallel

BrowserMotifs/Dwarfs

Sketching

Legacy Code Schedulers Communication &

Synch. PrimitivesEfficiency Language Compilers

Par Lab Research OverviewEasy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

HypervisorOS

Arch.

Productivity

Layer

Efficiency

Layer Cor

rect

ness

Applications

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debuggingwith Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Lec 1.521/23/08 Kubiatowicz CS258 ©UCB Spring 2008

• How do compelling apps relate to 13 motif/dwarfs?

“Motifs" Popularity (Red Hot → Blue CoolBlue Cool)

Embe

d

SPEC

DB

Gam

es

ML

HPC Health Image Speech Music Browser

1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body

10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid

Page 14: Computer Architecture Is … Parallel Computer Architecture

Lec 1.531/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Developing Parallel Software

• 2 types of programmers ⇒ 2 layers• Efficiency Layer (10% of today’s programmers)

– Expert programmers build Frameworks & Libraries, Hypervisors, …

– “Bare metal” efficiency possible at Efficiency Layer• Productivity Layer (90% of today’s programmers)

– Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries

– Frameworks & libraries composed to form app frameworks• Effective composition techniques allows the efficiency

programmers to be highly leveraged ⇒Create language for Composition and Coordination (C&C)

Lec 1.541/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Architectural Trends• Greatest trend in VLSI generation is increase in

parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit

» slows after 32 bit » adoption of 64-bit now under way, 128-bit far (not

performance issue)» great inflection point when 32-bit micro and cache fit on a

chip– Mid 80s to mid 90s: instruction level parallelism

» pipelining and simple instruction sets, + compiler advances (RISC)

» on-chip caches and functional units => superscalar execution

» greater sophistication: out of order execution, speculation, prediction

• to deal with control transfer and latency problems– Next step: ManyCore. – Also: Thread level parallelism? Bit-level parallelism?

Lec 1.551/23/08 Kubiatowicz CS258 ©UCB Spring 2008

0 1 2 3 4 5 6+0

5

10

15

20

25

30

0 5 10 150

0.5

1

1.5

2

2.5

3

Frac

tion

of to

tal c

ycle

s (%

)

Number of instructions issued

Spe

edup

Instructions issued per cycle

Can ILP go any farther?

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

– real caches and non-zero miss latencies

Lec 1.561/23/08 Kubiatowicz CS258 ©UCB Spring 2008No. of processors in fully configured commercial shared-memory systems

Thread-Level Parallelism “on board”

• Micro on a chip makes it natural to connect many to shared memory

– dominates server and enterprise market, moving down to desktop• Alternative: many PCs sharing one complicated pipe• Faster processors began to saturate bus, then bus

technology advanced– today, range of sizes for bus-based systems, desktop to large servers

Proc Proc Proc Proc

MEM

Page 15: Computer Architecture Is … Parallel Computer Architecture

Lec 1.571/23/08 Kubiatowicz CS258 ©UCB Spring 2008

What about Storage Trends?• Divergence between memory capacity and speed even more

pronounced– Capacity increased by 1000x from 1980-95, speed only 2x– Gigabit DRAM by c. 2000, but gap with processor speed much greater

• Larger memories are slower, while processors get faster– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?

• Parallelism increases effective size of each level of hierarchy,without increasing access time

• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast pipelined

transfer across narrower interface– Buffer caches most recently accessed data– Processor in memory?

• Disks too: Parallel disks plus caching

Lec 1.581/23/08 Kubiatowicz CS258 ©UCB Spring 2008

Summary: Why Parallel Architecture?• Desperate situation for Hardware Manufacturers!

– Economics, technology, architecture, application demand– Perhaps you can help this situation!

• Increasingly central and mainstream• Parallelism exploited at many levels

– Instruction-level parallelism– Multiprocessor servers– Large-scale multiprocessors (“MPPs”)

• Focus of this class: multiprocessor level of parallelism

• Same story from memory system perspective– Increase bandwidth, reduce average latency with many local

memories• Spectrum of parallel architectures make sense

– Different cost, performance and scalability