hybrid threaded processing for sparse data kernels › wp-content › uploads › 2018 › ... ·...

©2016 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject to

change without notice. All information is provided on an “AS IS” basis without warranties of any kind.

Statements regarding products, including regarding their features, availability, functionality, or compatibility,

are provided for informational purposes only and do not modify the warranty, if any, applicable to any

product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron trademarks are the

property of Micron Technology, Inc. All other trademarks are the property of their respective owners.

Title Slide

Primary design

for the first slide

in the deck.

Hybrid Threaded Processing

for Sparse Data Kernels

©2015 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject to

change without notice. All information is provided on an “AS IS” basis without warranties of any kind.

Statements regarding products, including regarding their features, availability, functionality, or compatibility,

are provided for informational purposes only and do not modify the warranty, if any, applicable to any

product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron trademarks are the

property of Micron Technology, Inc. All other trademarks are the property of their respective owners.

Tony Brewer

Chief Architect, Advanced Computing Solutions

May 9, 2018

Distribution Statement “A”: Approved for Public Release, Distribution Unlimited

“This research was, in part, funded by the U.S. Government. The views and conclusions

contained in this document are those of the authors and should not be interpreted as

representing the official policies, either expressed or implied, of the U.S. Government.

© 2016 Micron Technology, Inc. |

Title and Content

The primary layout

used for standard

slides. The

placeholder can be

used to create text,

tables, or charts.

The Challenge

Sparse data sets that greatly exceed a processor’s cache size are a

challenge for most systems

– Processor’s are typically optimized for high cache hit rate (>90%)

Low cache hit rate results in idle cores

– Memory accesses are cache line size (64B)

Sparse data sets result in memory accesses where

the majority of accessed data is not used

May 9, 2018 2


Agenda

Two-column layout,

to be used with any

number of items.

Hybrid Threading Processor (HTP)

RISC-V ISA (RV64G)

– Extensions for thread and message

management

High thread count barrel processor

– Similar to Cray’s MTA architecture

– One instruction per thread per

scheduling interval (avoids register

hazard checking)

Event driven processor

– Pause for memory response

– Pause for thread join

– Pause for message reception

3

Efficient memory usage

– Memory access size 8, 16, 32 or 64B

Software managed coherency

– Small cache per thread

– Atomics performed at memory

User space only

– Host processor provides system

support

Standard GCC compiler

– Runtime provides access to new

instructions

May 9, 2018


Title and Content

The primary layout

used for standard

slides. The

placeholder can be


tables, or charts.

New RISC-V Instructions

Thread Management

– Thread Create, Return, Join

Message Management Instructions

– Message Send, Broadcast, Receive, Listen

Non-Cached Loads and Stores

– Integer and Float

May 9, 2018 4


Title and Content

The primary layout

used for standard

slides. The

placeholder can be


tables, or charts.

System Architecture

May 9, 2018 5

Hybrid Threading

Processor (HTP)

Hybrid Threading

Processor (HTP)

DRAMCtrl

DRAM

SCMCtrl

SCM

SRAM Cache

AtomicOperations

DRAMCtrl

DRAM

SCMCtrl

SCM

SRAM Cache

AtomicOperations

Network on Chip (NOC)

StorageClass

Memory


Title and Content

The primary layout

used for standard

slides. The

placeholder can be


tables, or charts.

System Architecture

May 9, 2018 6

Hybrid Threading

Processor (HTP)

Hybrid Threading

Processor (HTP)

DRAMCtrl

DRAM

SCMCtrl

SCM

SRAM Cache

AtomicOperations

DRAMCtrl

DRAM

SCMCtrl

SCM

SRAM Cache

AtomicOperations

Network on Chip (NOC)

MemoryController

(MC) Chiplet

ComputeNear Memory(CNM) Chiplet


Title and Content

The primary layout

used for standard

slides. The

placeholder can be


tables, or charts.

Modeled Configurations

Focused on two primary configurations:

– 1x1 CNM Config.

1 CNM Chiplet

2 MC Chiplets

4 GDDR6 Memories

– 2x2 CNM Config.

4 CNM Chiplet

8 MC Chiplets

16 GDDR6 Memories

May 9, 2018 7

NOC

NO

C

HTP

NOC NOC

NO

C

HTP HTP

HTP

HTP

HTP

HTP

HTP

MC

MC

MC

MC

NOC

NO

C

HTP

NOC NOC

NO

C

HTP HTP

HTP

HTP

HTP

HTP

HTP

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

NOC

NO

C

HTP

NOC NOC

NO

C

HTP HTP

HTP

HTP

HTP

HTP

HTP

NOC

NO

C

HTP

NOC NOC

NO

C

HTP HTP

HTP

HTP

HTP

HTP

HTP

MC

MC

MC

MC

MC

MC

MC

MC

GDDR6DRAM

GDDR6DRAM

GDDR6DRAM

GDDR6DRAM

GDDR6DRAM

GDDR6DRAM

GDDR6DRAM

GDDR6DRAM

GD

DR

6D

RA

MG

DD

R6

DR

AM

GD

DR

6D

RA

MG

DD

R6

DR

AM

GD

DR

6D

RA

MG

DD

R6

DR

AM

GD

DR

6D

RA

MG

DD

R6

DR

AM

2x2 CNM Configuration

MemoryController

Chiplet

ComputeNear Memory

Chiplet

MC

MC

MC

MC

CPICPI

GDDR6GDDR6

MC

MC

MC

MC

CPI CPI

GDDR6 GDDR6

Interposer

1x1 CNM Configuration

CPI CPICPI

CP

IC

PI

CP

I

CP

IC

PI

CP

I

CPI CPICPI

NOC

NO

C

HTP

NOC NOC

NO

C

HTP HTP

HTP

HTP

HTP

HTP

HTP


Title and Content

The primary layout

used for standard

slides. The

placeholder can be


tables, or charts.

Performance and Power

Simulator models functionality and performance

– Clocked simulation model

– Models functionality

– Models data paths, arbitration and queueing

Power Estimation Methodology

– Break design into major components: NOC, HTP, MC, GDDR6

– Identify all ram structures, ALUs, I/O, long signal runs (NOC), etc.

– Determine power through foundry power estimation tools

– Determine application activity factors through simulation

– Determine power for each chiplet and total solution power

May 9, 2018 8


Title/Subtitle

and Content

Identical to main

layout but includes

the addition of a

subtitle directly

below the title.

Graph Spectral Clustering

Community detection using spectral

methods uses linear algebraic to

compute eigenvalues for the adjacency

matrix associated with a graph. The

lowest eigenvalues can be used to

partition the graph.

Sparse data structures store the graph

vertices, edges and properties.

GRAPH COMMUNITY DETECTION USING SPECTRAL METHODS

9 May 9, 2018

Profile on an X86 system

Overhead Symbol

13.82% [.] svd_ATxb

13.36% [.] svd_ATxb2

10.70% [.] svd_Axb

10.01% [.] substruct

9.93% [.] svd_Axb2

6.59% [.] _IO_vfscanf


Title/Subtitle

and Content

Identical to main

layout but includes

the addition of a

subtitle directly

below the title.


Sensitivity analysis to determine

optimal configuration

SENSITIVITY ANALYSIS

10 May 9, 2018

0

5

10

15

20

25

0 50 100 150 200

Mil

lio

n E

dg

es

per

Seco

nd

Edges per Vertex

Memory Access Size

8B

16B

32B

64B

0

5

10

15

20

25

0 50 100 150 200

Mil

lio

n E

dg

es

per

Seco

nd

Edges per Vertex

Total Thread Count

1024

512

256

128 Other parameters

– Clock rate

– Cores per HTP processor


Title/Subtitle

and Content

Identical to main

layout but includes

the addition of a

subtitle directly

below the title.


Provides

insights into

run time

dynamics of

application

THREAD STATE MONITORING

11 May 9, 2018

0

100

200

300

400

500

600

700

800

900

0.00E+00 2.00E+07 4.00E+07 6.00E+07

Th

read

Co

un

t

Simulation Time(ns)

1x1 1.2Ghz 70ms

Idle

RdyToRun

PausedMem

PausedEvent


Title/Subtitle

and Content

Identical to main

layout but includes

the addition of a

subtitle directly

below the title.

Graph Spectral Clustering COMPARISON TO REFERENCE PLATFORMS

12 May 9, 2018

Haswell

8-threads

1-socket

Nvidia K80

(Host + 1 GPU)

Nvidia DGX-1

(Host + P100)

1 P100 time

NOC 1x1 Config

1.2Ghz

Simulated

NOC 2x2 Config

1.2Ghz

Simulated

13.5 Sec

140 Watts

1890

Joules

27.4x

5.0 Sec1

340 Watts

1703 Joules

24.7x

3.95 Sec1

(Note – System

was at a cloud

provider, no

power info)

2.90 Sec

23.8 Watt

69 Joules

1.0x

0.814 Sec

90.9 Watt

74 Joules

1.07x

CIPI

Ho

st

HIF

PC

Ie

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

NoCEdge

NoCEdge

NocEdge

NocHub

NoCEdge

HTFCluster

HTFCluster

HTFCluster

HTFCluster

CIP

IC

IPI

CIP

I

CIPI CIPI CIPI

CIP

IC

IPI

CIP

I

CIPICIPICIPI

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

CIPI

Ho

st

HIF

PC

Ie

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GDDR6DRAM

CIPI

MC

MC

GD

DR

6D

RA

M

CIP

I

MC

MC

GD

DR

6D

RA

M

CIP

I

MC

MC

GD

DR

6D

RA

M

CIP

I

MC

MC

GD

DR

6D

RA

M

CIP

I

MC

MCG

DD

R6

DR

AM

CIP

I

MC

MC

GD

DR

6D

RA

M

CIP

I

MC

MC

GD

DR

6D

RA

M

CIP

I

MC

MC

GD

DR

6D

RA

M

CIP

I

MC

MC

NoCEdge

NoCEdge

NoCEdge

NoCHub

NoCEdge

HTFCluster

HTFCluster

HTFCluster

HTFCluster

CIP

IC

IPI

CIP

I

CIPI CIPI CIPI

CIP

IC

IPI

CIP

I

CIPICIPICIPI

NoCEdge

NoCEdge

NoCEdge

NoCHub

NoCEdge

HTFCluster

HTFCluster

HTFCluster

HTFCluster

CIP

IC

IPI

CIP

I

CIPI CIPI CIPI

CIP

IC

IPI

CIP

I

CIPICIPICIPI

NoCEdge

NoCEdge

NoCEdge

NoCHub

NoCEdge

HTFCluster

HTFCluster

HTFCluster

HTFCluster

CIP

IC

IPI

CIP

I

CIPI CIPI CIPI

CIP

IC

IPI

CIP

I

CIPICIPICIPI

NoCEdge

NoCEdge

NoCEdge

NoCHub

NoCEdge

HTFCluster

HTFCluster

HTFCluster

HTFCluster

CIP

IC

IPI

CIP

I

CIPI CIPI CIPI

CIP

IC

IPI

CIP

I

CIPICIPICIPI

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTP

HTPHTP

HTPHTP

HTPHTP

HTPHTP

HTPHTP

HTPHTP

HTPHTP

HTPHTP

Notes:

1. Times reported for GPUs do not include time to copy graph from host to GPU (120 sec).

Blue Ending Slide

| May 9, 2018

13

hybrid threaded processing for sparse data kernels › wp-content › uploads › 2018 › ... ·...

Documents