www.huawei.com huawei technologies co., ltd. smart memory bill lynch sailesh kumar and team

17
www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

Upload: gerald-robertson

Post on 19-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

www.huawei.com

HUAWEI TECHNOLOGIES Co., Ltd.

Smart Memory

Bill Lynch

Sailesh Kumar

and Team

Page 2: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 2

Overview

High Performance Packet Processing

Challenges

Solution – Smart Memory

Smart Memory Architecture

Page 3: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 3

Packet Processing Workload Challenges Sequential memory references

› For lookups (L2, L3, L4, and L7)› Finite automata traversal

Read-modify-write› Statistics, counters, token-bucket, mutex, etc

Pointer and link-list management› Buffer management, packet queues, etc

Traditional implementations use› Commodity memory to store data› NPs and ASICs to process data in memory

P P P P P P

P P P P P P

Memory Memory Memory Memory

Performance Barriers:

1. Memory and chip I/O bandwidth

2. Memory latency

3. Lock for atomic access

Tons of memory referencesand minimal compute

Page 4: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 4

Illustration of Performance Barrier I

P P P P P P IP lookup tree

Requires several transactions between memory and processors

Memory Memory Memory0 1

0 1

0

0

1

2 3

54

7

9

P2

P5

16

P318

P4

P1

High lookup delay due to high interconnect and memory latency

Interconnection network

P P P P P P

Memory

Low IPCNeed moreprocessors

More latencyIn interconnect

Page 5: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 5

Illustration of Performance Barrier II Lookups are read-only so relati

vely easy Link-list, counters, policers, etc

are read-modify-write Requires per memory address l

ock in multi-core systems

P P P P P P

Memory Memory Memory

Interconnection network

P P P P P P

Lock free-list

Get free node

Unlock free-list

Unlock list tail

Lock list tail

Link free node

Read list tail

Update list tail

EnqueueDequeue

Lock counter

Read counter

Unlock counter

Write counter

CountersLocks often kept in memory

Requires another transaction

Adds significant latency

Single queue or single counteroperations are extremely slow

Page 6: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 6

Overview

High Performance Packet Processing Challenges› Memory bandwidth and latency› Limited I/O bandwidth

Solution – Smart Memory› Attach simple compute with data› Attach lock with data› Enable local memory communication

Smart Memory Architecture› Hybrid memory – eDRAM + DDR3-DRAM› Serial I/O

Page 7: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 7

Introduction to Smart Memory

What is the real problem?› Compute occurs far away from data

› Lock acquire/release occurs far from data

Solution: Make memory smarter by:

P P P P P P

Memory Memory Memory

Interconnection network

P P P P P P

Fortunately, compute for packetprocessing jobs are very modest!

compute compute compute compute

P P P P P P

Interconnection network

P P P P P P

Memory Memory Memory Memory

Keeping compute close to data

Managing lock close to data

Enabling local communication

Page 8: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 8

Introduction to Smart Memory

What is the real problem?› Compute occurs far away from data

› Lock acquire/release occurs far from

dataP P P P P P

Memory Memory Memory

Interconnection network

P P P P P P

Fortunately, compute for packetprocessing jobs are very modest!

compute compute compute compute

P P P P P P

Interconnection network

P P P P P P

Memory Memory Memory MemorySmart Memory Advantages(Get more off fewer transactions!)

1. Lower I/O bandwidth2. Lower processing latency3. Higher IPC4. Significantly higher single

counter/queue performance

Page 9: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 9

Overview

High Performance Packet Processing Challenges› Memory bandwidth and latency› Limited I/O bandwidth

Solution – Smart Memory› Attach simple compute with data› Attach lock with data› Enable local memory communication

Smart Memory Architecture› Hybrid memory – eDRAM + DDR3-DRAM› Serial chip I/O

Page 10: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 10

Smart Memory Capacity and Bandwidth @100G

DPI (regex)

BasicLayer2

2 4 8 16 32 64 128 256 512+

40

20

10

5

2.5

1.25

.62

.31

.15

Layer2fwding

FIB(algorithmic)

ACL (algorithmic)

Statistics/Counter

PacketBuffer

Queuing/Scheduling

DPI (string)

Memory Capacity (MB)

Mem

ory

band

wid

th (

Bill

ion

acce

sses

/ p

acke

t)

VideoBuffer

Page 11: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 11

ACL (algorithmic)

DPI (regex)

Smart Memory Capacity and Bandwidth @100G

2 4 8 16 32 64 128 256 512+

40

20

10

5

2.5

1.25

.62

.31

.15

Memory Capacity (MB)

Mem

ory

band

wid

th (

Bill

ion

acce

sses

/ p

acke

t)

BasicLayer2

Layer2fwding

FIB(algorithmic)

Statistics/Counter

PacketBuffer

Queuing/Scheduling

DPI (string)

VideoBuffer

64 banks eDRAM

8 channels of DDR3-DRAM

Smart Memory usesintelligent algorithms tosplit the data-structures

Page 12: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 12

Smart Memory High Level ArchitecturePacket processor complex DDR3

DRAMP P P P

P P P P

P P P P

P P P P

DRAM SMEngine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

Smart Memory complex

Local interconnect:provides local communicationbetween smart memory

blocks

Global interconnect:provides fair communicationbetween processors andsmart memory

Page 13: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 13

Packet processor complex

P P P P

P P P P

P P P P

P P P P

DRAM SMEngine

SM engine

eDRAM

SM engine

eDRAM

SM engine

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

eDRAM

SM engine

Smart Memory complex

Smart Memory High Level Architecture

Split tables intoeDRAM and DRAM

DDR3DRAM

eDRAM eDRAM

eDRAM

read

read

SM engineFIB key

data

FIB key

resultComputation occurs closeto memory reducing latency

Requires fewer memorytransactions

Page 14: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 14

I/O Technology Choice in Smart Memory

Based on MoSys data

- Bandwidth, latency and I/O bandwidth gap is growing

- On-chip bandwidth is much higher than memory I/O

Smart Memory use serial I/O

- 4X throughput than RLDRAM and QDR

- 3X fewer pins than DDR3 and DDR4

- 2.5X reduces I/O power

Smart Memory reduces the chip I/O bandwidth significantly How to further optimize it?

Page 15: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 15

High Speed Line Card with Smart Memory

M A C

PHY

PHY

PHY

PHY

D D R 3

NP

TCAM

SRAM

D D R 3

D D R 3

D D R 3

D D R 3

NP

TCAMSRAM

D D R 3

D D R 3

D D R 3 D

D R 3

NP

TCAMSRAM

D D R 3

D D R 3

D D R 3

D D R 3

NP

TCAM

SRAM

D D R 3

D D R 3

D D R 3

TM

TM

SRAM

SRAM

SRAM

SRAM

F I C

C P U

D D R 3

D D R 3

To S

witch

Fa

bric

DDR3 memory10+ DIMM , 900+ pins

Control Plane Memory

M A C

P H Y

SM

F I C

CPU

D D R 3

D D R 3

To Switch Fabric

NP

SM

SM SM

P H Y P H Y P H Y

Traditional Line Card Line Card with SM

540+ w 212- wPower

2-3 times

Cost5600+ $ 2520 $472+ cm2 148 cm2Area

Page 16: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 16

Concluding Remarks Packet Processing Bottlenecks

› Data away from compute

› I/O and memory bandwidth

Smart Memory› Keep compute close to data

› Keep locking close to data

› Provide inter-memory connect

Advantages› Reduced chip I/O bandwidth

› High performance and low latency

› Feature rich, flexible and programmable

› Lower cost

› One chip for several functions

Page 17: Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team

HUAWEI TECHNOLOGIES CO., LTD. Page 17

Thank You

www.huawei.com