hybrid threaded processing for sparse data kernels › wp-content › uploads › 2018 › ... ·...
TRANSCRIPT
©2016 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject to
change without notice. All information is provided on an “AS IS” basis without warranties of any kind.
Statements regarding products, including regarding their features, availability, functionality, or compatibility,
are provided for informational purposes only and do not modify the warranty, if any, applicable to any
product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron trademarks are the
property of Micron Technology, Inc. All other trademarks are the property of their respective owners.
Title Slide
Primary design
for the first slide
in the deck.
Hybrid Threaded Processing
for Sparse Data Kernels
©2015 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject to
change without notice. All information is provided on an “AS IS” basis without warranties of any kind.
Statements regarding products, including regarding their features, availability, functionality, or compatibility,
are provided for informational purposes only and do not modify the warranty, if any, applicable to any
product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron trademarks are the
property of Micron Technology, Inc. All other trademarks are the property of their respective owners.
Tony Brewer
Chief Architect, Advanced Computing Solutions
May 9, 2018
Distribution Statement “A”: Approved for Public Release, Distribution Unlimited
“This research was, in part, funded by the U.S. Government. The views and conclusions
contained in this document are those of the authors and should not be interpreted as
representing the official policies, either expressed or implied, of the U.S. Government.
© 2016 Micron Technology, Inc. |
Title and Content
The primary layout
used for standard
slides. The
placeholder can be
used to create text,
tables, or charts.
The Challenge
Sparse data sets that greatly exceed a processor’s cache size are a
challenge for most systems
– Processor’s are typically optimized for high cache hit rate (>90%)
Low cache hit rate results in idle cores
– Memory accesses are cache line size (64B)
Sparse data sets result in memory accesses where
the majority of accessed data is not used
May 9, 2018 2
© 2016 Micron Technology, Inc. |
Agenda
Two-column layout,
to be used with any
number of items.
Hybrid Threading Processor (HTP)
RISC-V ISA (RV64G)
– Extensions for thread and message
management
High thread count barrel processor
– Similar to Cray’s MTA architecture
– One instruction per thread per
scheduling interval (avoids register
hazard checking)
Event driven processor
– Pause for memory response
– Pause for thread join
– Pause for message reception
3
Efficient memory usage
– Memory access size 8, 16, 32 or 64B
Software managed coherency
– Small cache per thread
– Atomics performed at memory
User space only
– Host processor provides system
support
Standard GCC compiler
– Runtime provides access to new
instructions
May 9, 2018
© 2016 Micron Technology, Inc. |
Title and Content
The primary layout
used for standard
slides. The
placeholder can be
used to create text,
tables, or charts.
New RISC-V Instructions
Thread Management
– Thread Create, Return, Join
Message Management Instructions
– Message Send, Broadcast, Receive, Listen
Non-Cached Loads and Stores
– Integer and Float
May 9, 2018 4
© 2016 Micron Technology, Inc. |
Title and Content
The primary layout
used for standard
slides. The
placeholder can be
used to create text,
tables, or charts.
System Architecture
May 9, 2018 5
Hybrid Threading
Processor (HTP)
Hybrid Threading
Processor (HTP)
DRAMCtrl
DRAM
SCMCtrl
SCM
SRAM Cache
AtomicOperations
DRAMCtrl
DRAM
SCMCtrl
SCM
SRAM Cache
AtomicOperations
Network on Chip (NOC)
StorageClass
Memory
© 2016 Micron Technology, Inc. |
Title and Content
The primary layout
used for standard
slides. The
placeholder can be
used to create text,
tables, or charts.
System Architecture
May 9, 2018 6
Hybrid Threading
Processor (HTP)
Hybrid Threading
Processor (HTP)
DRAMCtrl
DRAM
SCMCtrl
SCM
SRAM Cache
AtomicOperations
DRAMCtrl
DRAM
SCMCtrl
SCM
SRAM Cache
AtomicOperations
Network on Chip (NOC)
MemoryController
(MC) Chiplet
ComputeNear Memory(CNM) Chiplet
© 2016 Micron Technology, Inc. |
Title and Content
The primary layout
used for standard
slides. The
placeholder can be
used to create text,
tables, or charts.
Modeled Configurations
Focused on two primary configurations:
– 1x1 CNM Config.
1 CNM Chiplet
2 MC Chiplets
4 GDDR6 Memories
– 2x2 CNM Config.
4 CNM Chiplet
8 MC Chiplets
16 GDDR6 Memories
May 9, 2018 7
NOC
NO
C
HTP
NOC NOC
NO
C
HTP HTP
HTP
HTP
HTP
HTP
HTP
MC
MC
MC
MC
NOC
NO
C
HTP
NOC NOC
NO
C
HTP HTP
HTP
HTP
HTP
HTP
HTP
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
NOC
NO
C
HTP
NOC NOC
NO
C
HTP HTP
HTP
HTP
HTP
HTP
HTP
NOC
NO
C
HTP
NOC NOC
NO
C
HTP HTP
HTP
HTP
HTP
HTP
HTP
MC
MC
MC
MC
MC
MC
MC
MC
GDDR6DRAM
GDDR6DRAM
GDDR6DRAM
GDDR6DRAM
GDDR6DRAM
GDDR6DRAM
GDDR6DRAM
GDDR6DRAM
GD
DR
6D
RA
MG
DD
R6
DR
AM
GD
DR
6D
RA
MG
DD
R6
DR
AM
GD
DR
6D
RA
MG
DD
R6
DR
AM
GD
DR
6D
RA
MG
DD
R6
DR
AM
2x2 CNM Configuration
MemoryController
Chiplet
ComputeNear Memory
Chiplet
MC
MC
MC
MC
CPICPI
GDDR6GDDR6
MC
MC
MC
MC
CPI CPI
GDDR6 GDDR6
Interposer
1x1 CNM Configuration
CPI CPICPI
CP
IC
PI
CP
I
CP
IC
PI
CP
I
CPI CPICPI
NOC
NO
C
HTP
NOC NOC
NO
C
HTP HTP
HTP
HTP
HTP
HTP
HTP
© 2016 Micron Technology, Inc. |
Title and Content
The primary layout
used for standard
slides. The
placeholder can be
used to create text,
tables, or charts.
Performance and Power
Simulator models functionality and performance
– Clocked simulation model
– Models functionality
– Models data paths, arbitration and queueing
Power Estimation Methodology
– Break design into major components: NOC, HTP, MC, GDDR6
– Identify all ram structures, ALUs, I/O, long signal runs (NOC), etc.
– Determine power through foundry power estimation tools
– Determine application activity factors through simulation
– Determine power for each chiplet and total solution power
May 9, 2018 8
© 2016 Micron Technology, Inc. |
Title/Subtitle
and Content
Identical to main
layout but includes
the addition of a
subtitle directly
below the title.
Graph Spectral Clustering
Community detection using spectral
methods uses linear algebraic to
compute eigenvalues for the adjacency
matrix associated with a graph. The
lowest eigenvalues can be used to
partition the graph.
Sparse data structures store the graph
vertices, edges and properties.
GRAPH COMMUNITY DETECTION USING SPECTRAL METHODS
9 May 9, 2018
Profile on an X86 system
Overhead Symbol
13.82% [.] svd_ATxb
13.36% [.] svd_ATxb2
10.70% [.] svd_Axb
10.01% [.] substruct
9.93% [.] svd_Axb2
6.59% [.] _IO_vfscanf
© 2016 Micron Technology, Inc. |
Title/Subtitle
and Content
Identical to main
layout but includes
the addition of a
subtitle directly
below the title.
Graph Spectral Clustering
Sensitivity analysis to determine
optimal configuration
SENSITIVITY ANALYSIS
10 May 9, 2018
0
5
10
15
20
25
0 50 100 150 200
Mil
lio
n E
dg
es
per
Seco
nd
Edges per Vertex
Memory Access Size
8B
16B
32B
64B
0
5
10
15
20
25
0 50 100 150 200
Mil
lio
n E
dg
es
per
Seco
nd
Edges per Vertex
Total Thread Count
1024
512
256
128 Other parameters
– Clock rate
– Cores per HTP processor
© 2016 Micron Technology, Inc. |
Title/Subtitle
and Content
Identical to main
layout but includes
the addition of a
subtitle directly
below the title.
Graph Spectral Clustering
Provides
insights into
run time
dynamics of
application
THREAD STATE MONITORING
11 May 9, 2018
0
100
200
300
400
500
600
700
800
900
0.00E+00 2.00E+07 4.00E+07 6.00E+07
Th
read
Co
un
t
Simulation Time(ns)
1x1 1.2Ghz 70ms
Idle
RdyToRun
PausedMem
PausedEvent
© 2016 Micron Technology, Inc. |
Title/Subtitle
and Content
Identical to main
layout but includes
the addition of a
subtitle directly
below the title.
Graph Spectral Clustering COMPARISON TO REFERENCE PLATFORMS
12 May 9, 2018
Haswell
8-threads
1-socket
Nvidia K80
(Host + 1 GPU)
Nvidia DGX-1
(Host + P100)
1 P100 time
NOC 1x1 Config
1.2Ghz
Simulated
NOC 2x2 Config
1.2Ghz
Simulated
13.5 Sec
140 Watts
1890
Joules
27.4x
5.0 Sec1
340 Watts
1703 Joules
24.7x
3.95 Sec1
(Note – System
was at a cloud
provider, no
power info)
2.90 Sec
23.8 Watt
69 Joules
1.0x
0.814 Sec
90.9 Watt
74 Joules
1.07x
CIPI
Ho
st
HIF
PC
Ie
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
NoCEdge
NoCEdge
NocEdge
NocHub
NoCEdge
HTFCluster
HTFCluster
HTFCluster
HTFCluster
CIP
IC
IPI
CIP
I
CIPI CIPI CIPI
CIP
IC
IPI
CIP
I
CIPICIPICIPI
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
CIPI
Ho
st
HIF
PC
Ie
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GDDR6DRAM
CIPI
MC
MC
GD
DR
6D
RA
M
CIP
I
MC
MC
GD
DR
6D
RA
M
CIP
I
MC
MC
GD
DR
6D
RA
M
CIP
I
MC
MC
GD
DR
6D
RA
M
CIP
I
MC
MCG
DD
R6
DR
AM
CIP
I
MC
MC
GD
DR
6D
RA
M
CIP
I
MC
MC
GD
DR
6D
RA
M
CIP
I
MC
MC
GD
DR
6D
RA
M
CIP
I
MC
MC
NoCEdge
NoCEdge
NoCEdge
NoCHub
NoCEdge
HTFCluster
HTFCluster
HTFCluster
HTFCluster
CIP
IC
IPI
CIP
I
CIPI CIPI CIPI
CIP
IC
IPI
CIP
I
CIPICIPICIPI
NoCEdge
NoCEdge
NoCEdge
NoCHub
NoCEdge
HTFCluster
HTFCluster
HTFCluster
HTFCluster
CIP
IC
IPI
CIP
I
CIPI CIPI CIPI
CIP
IC
IPI
CIP
I
CIPICIPICIPI
NoCEdge
NoCEdge
NoCEdge
NoCHub
NoCEdge
HTFCluster
HTFCluster
HTFCluster
HTFCluster
CIP
IC
IPI
CIP
I
CIPI CIPI CIPI
CIP
IC
IPI
CIP
I
CIPICIPICIPI
NoCEdge
NoCEdge
NoCEdge
NoCHub
NoCEdge
HTFCluster
HTFCluster
HTFCluster
HTFCluster
CIP
IC
IPI
CIP
I
CIPI CIPI CIPI
CIP
IC
IPI
CIP
I
CIPICIPICIPI
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTP
HTPHTP
HTPHTP
HTPHTP
HTPHTP
HTPHTP
HTPHTP
HTPHTP
HTPHTP
Notes:
1. Times reported for GPUs do not include time to copy graph from host to GPU (120 sec).
Blue Ending Slide
| May 9, 2018
13