asymmetric chip multi-core: the future chip multiprocessorfrankel/altcomfeb09/... · 25th american...

42
Uri Weiser EE Technion Alternative Computing Day February 9 th , 2009 Asymmetric Chip Multi-Core: The Future Chip Multiprocessor 1

Upload: others

Post on 06-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Uri WeiserEE Technion

Alternative Computing Day February 9th, 2009

Asymmetric Chip Multi-Core:The Future Chip Multiprocessor

1

Page 2: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Sailing Basicswind

Buoy

2

Page 3: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Sailing - wind shiftwind

Buoy

3

a

a

a

Page 4: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

25th American Cap – September 23rd 1983Liberty I against Australia II

7 rounds race, Score status: 3:0 to Liberty I, 4th round started

4

wind

Buoy

Start line

Page 5: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Sailing competitiongetting first

Boat 1

Boat 2

wind

Buoy

- What is the Strategy of Boat 1?

- What is the Strategy of Boat 2?

Do not follow Invent5

Page 6: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

OutlineI selected the other way, but not as far as alternative computing

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

6

Page 7: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Cores implicationsCMP is ubiquitous since Paul Otellini announced the P-Due-2 in June 2006

Reasons: Power wall

- Single Core Performance/power trend - Process technology

Open questions

– How many Cores? 10s 1000s? Implications!

– Cache architecture?

– Are the cores Symmetric (aka Homogenous) or Asymmetric?

– If Asymmetric

– Same ISA

– Different ISA extensions

– Application specific accelerators

– Task/threads scheduling

7

Page 8: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

The outside world

NUMA implicationssecond boat chose NUMA in 2007

Reasons: Integration and Performance

Implications

– Synchronize Memory Controllers (MC)

– Future Multiple MC on die

Open issues/opportunities

– All memory traffic runs through the CPU

– Task/Threads assignments and scheduling

Core$Mem

Core

Core

Core

Core $ Mem

Core

Core

Core

8

Page 9: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Outline

Today's environment– On die: Cores; Platform: NUMA

The application environment

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

9

Page 10: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

The Computer runs multiple programs

CPUs & Caches

Memory

Mass storage

?

Fast I/Oe.g. camera, Accelerators

Slow I/Oe.g. human interface

Display

Communication

Computation and “scratch pad” The “outside” world

Computation enginese.g. Graphics

engine, Media, Communication

engines,

10

Page 11: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Flying machinesAre they all the same?

11

Page 12: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

12

Flying machinesAre they all the same?

Page 13: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

13

Flying machinesAre they all the same?

Page 14: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Requirements and program behavior

QoS– Application’s response Latency + Performance

– Differential services in multiple applications environment

– Priority of Programs

– Priority of Data

– Lossless or loss consent (e.g. Music, video, communication…)

– Power constrains

– Reliability

14

Page 15: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Parallelism

– Threads level parallelism– Many/Huge number of threads 10s or 1000s

– Data level parallelism vector|SIMD

– Instruction level parallelism ILP

– Pipelining (one dimension/few dimensions graphics/systolic algorithms)

Requirements and program behavior

Remember Amdahl law15

Page 16: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

– Memory BW

– Commutation vs. memory access e.g. G Byte/Flop, M

0.05 Byte/Flop

– Streaming e.g. Media, communication, stock exchange data

– Computation behavior FP, integer, branches, loops

– Locality of program/data impact on caches, TLBs, memory

– Sharing

– Foot print Program/data

Requirements and program behavior (cont’)

16

Page 17: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Outline

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

Memory Wall

NUMA implications

Scheduling

Conclusions

17

Page 18: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Gain

Frequency

CMP implication:Same ISA/ISA extensions/Application Specific Accelerators

Analog Circuit Paradigm

GBWP = Gain Bandwidth Product = constant @ a given technology

BW1

Gain1

BW2Gain2

e.g. Gain1*BW1= Gain2*BW2

18

Page 19: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Gain

Frequency

Analog Circuit Paradigm (cont.)

BW1 BWn Gain

19

Page 20: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Application domainSame ISA/ISA extensions/Application Specific Accelerators

Performance

Apps range

General purpose

Application specific – Accelerators (not our generic ISA)

Application specific engines achieve higher

performance/power than general purpose engines

20

Page 21: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Application domainSame ISA/ISA extensions/Application Specific Accelerators

Performance

Apps range

Original

ISA

general purpose engines with ISA extensions

ISA extension A ISA extension A

21

Page 22: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

The memory Wall

Many execution engines more

performancemore memory accesses

How to reduce the memory latency

performance impact?

– Caches

–Threads

22

Page 23: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

The memory Wallthe Cache solution

The “MultiCore” machines solution use cache

as a memory access “filter”

Implications:

– Number of cores: Cache area limits the number of cores

– BW: Reduces memory access Bandwidth

– Coherency: In flat memory structure cache needs coherency mechanism

– Cache Access time (see example: Nahalal)

23

Core Mem

Core

Core

Core

M

C

Inte

rcon

ne

ct

$

Page 24: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Aerial view of Nahalal cooperative village

Example: Nahalal; Shared $ vs. Private $

Overview of Nahalal cache organization

P0 P1

P2

P3P4P5

P6

P7

P0 P1

P2

P3P4P5

P6

P7S$

P$

P$

P$

P$P$

P$

P$

P$

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

P$

S$

P$ P$

P$P$

P$

P$

P$

24

Page 25: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Nahalal Cache Performance vs. CIM

27.41% improvement in average cache access time– 41.1% in apache

L2 hit time

0

5

10

15

20

25

30

35

40

45

50

equake fma3d barnes water apache zeus specjbb

Cy

cle

s CIM

NAHALAL

# c

ycl

es

Average Cache hit access Time

(cycles)

3.9% 8.57%

40.53%

41.1%

29.06%29.35%39.4%

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

CIM

25

Page 26: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

The memory Wallthe Threads solution

The “MultiTHreaded” machines solution

Hide memory latency behind many many threads

When 1000s threads exist NO CACHE is valuable/needed

Implications:

– Applications: 1000s of concurrent threads

– BW: Major impact on Memory access Bandwidth

– Copies: Need to keep many threads’ RFs and states

– Exception handling: complex structure

Page 27: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Outline

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

27

Page 28: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

CPU

Priva

te C

ach

e

Priva

te C

ach

e

Sh

are

d C

ach

e

Priva

te C

ach

e

NAHLAL Cache structure Asymmetric NUMA Architecture

CPU CPUCPU

Bridge

“Sh

are

d M

em

ory

Priva

te M

em

ory

Priva

te M

em

ory

CPUs clusters

CPU CPU CPU CPU CPU CPU

28

Priva

te C

ach

e

Page 29: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Memory

NUMA implicationa platform view

CPUs

IOsAccelerators

Communication

Storage

External devices

Bridge

NUMA

29

Page 30: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Memory

Platform NUMA

CPUs

IOsAccelerators

Communication

Storage

External devices

Bridge

Platform NUMA

• Can we improve platform data movements

• Reduce unnecessary traffic

• Efficient on-loading

Memory (aka M3)

30

Page 31: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Outline

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

NUMA implications

Scheduling

Conclusions

31

Page 32: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Task Scheduling

Current platform tasks’ scheduling is being performed by OS based on:

– Available resources

– Ready tasks

Are there other dimensions?

– Scheduling based on thread’s essence

– Can we impact scheduling based on “The data content”?*

*Commex Technologies Patent pending 32

Page 33: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

33

ACCMP - Asymmetric Cluster CMP

Big cores and small cores same ISA

Serial phases execute on large core

Parallel phases execute on all cores

βaSerial

Parallel

Page 34: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

34

ACCMP - Analysis

ACCMP Performance Vs. Power

Symmetric Upper

Bound

a =1, β =4

a =0.33, β =6

1

2

3

4

5

0 5 10 15 20 25

Relative Power

Rel

ati

ve

Per

form

an

ce

Symmetric Upper Bound

Asymmetric (a=1, β=4)

Asymmetric (a=0.33, β=6)

1

112

1 2

Perf .

1

ACCMP

a

P Pv k nk a

a a

ACCMP delivers more

performance per unit

power!

Page 35: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

35

Multiple Applications Scheduling

App-A App-B

t

Observation: Serial threads are the bottleneck of applications

Current OS grants serial threads the same priority as parallel threads, resulting in lower system throughput and unfairness

Solution: Boost priority of serial threads

ACCMP: Serial threads will be granted more powerful cores

Page 36: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Benchmark Throughput

36

∞:1

8:1

4:1

2:1

1:1

1:2

1:4

1:8

1:∞

∞:1 8:1 4:1 2:1 1:1 1:2 1:4 1:8 1:∞Sy

nth

eti

c B

en

chm

ark

"A"

Synthetic Benchmark "B"Benchmark “B” Serial to Parallel ratio

Ben

ch

mark

“A

” S

eri

al

to P

ara

llel ra

tio

Page 37: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Application’s Data’s requirements

Differential services

– Some data is latency non-tolerant(e.g. banking transactions)

– Some data is BW non-tolerant(e.g. Video)

– Some data is more “valuable”…

Why not to handle data according to its “value” (in addition its processing requirement)?

37

Page 38: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Data Content Aware (DCA) Architecture*

Task assignment based on Data Content

DCA

CA

System without DCA System with DCA

38*Commex Technologies Patent pending

Page 39: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Latency

620 640770810 850 905 900 900 900 940 940 940 940 940

580666

780840940

1200

1517

1900

2740

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 500000 1000000 1500000 2000000

Late

ncy [

µsec]

Load [PPS]

Commex Classification

Symmetric

*Test does not include UDP zero copy

No Service

W/O Content

Aware

39

Page 40: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Latency vs. load

40

Page 41: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Conclusions

The environment is changing

– Many open issues interesting research topics

– Core homogeneity?

– Computation assignment and scheduling?– Thread essence base?

– Data Content base?

– Cache or not Cache?

– Cache structure?

– …

41

Thanks:Research is related to the work of: Zvika Guz, Tomer Morad, Leonid Azriel, Commex-Technologies

Page 42: Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American Cap – September 23rd 1983 Liberty I against Australia II 7 rounds race, Score

Thank you

42