exploring hardware support for scaling irregular...
TRANSCRIPT
Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures
MARCO CERIANI† SIMONE SECCHI∗ ANTONINO TUMEO‡ ORESTE VILLA‡ GIANLUCA PALERMO†
December 6, 2013 CARL 2013 1
†Politecnico di Milano - DEI, 20133, Milano, Italy. {mceriani,gpalermo}@elet.polimi.it ∗Universita’ degli Studi di Cagliari - DIEE, 09123, Cagliari, Italy. [email protected] ‡Pacific Northwest National Laboratory, Richland, WA. [email protected]
New generation of irregular HPC applications
Complex Networks
Community Detec5on
Bioinforma5cs
Knowledge Discovery
Seman5c Databases
Language Understanding PaBern Recogni5on
Big Science
December 6, 2013 2
Characteristics of Emerging Irregular Applications
December 6, 2013 3
! Use pointer or linked list-based data structures ! Graphs, unbalanced trees, unstructured grids ! Fine grained data accesses
! Very large datasets ! Way more than what is currently available for single cluster nodes ! Very difficult to partition without generating load unbalancing
! Very poor spatial and temporal locality ! Unpredictable network and memory accesses ! Memory and network bandwidth limited!
! Large amounts of parallelism (e.g., each vertex, each edge in the graph) ! But irregularity in control
! If (vertex==x) z; else k
Objective
! We aim at designing a full-system architecture for irregular applications starting from off-the-shelf cores ! Big datasets imply a multi-node architecture
! We do it by: ! Introducing custom hardware and software components that optimize the
architecture for executing multi-node irregular applications ! Employing a FPGA prototype to validate the approach
December 6, 2013 4
Supporting Irregular Applications
December 6, 2013 5
Fast
Context Switching
Fine-‐grain Global
Address Space
Hardware Synch
! Fast context switching: tolerates latencies
! Fine-grain global address space: removes partitioning requirements, simplify code development
! Hardware synch: increase performance with synchronization intensive workloads
Why a prototype?
! Hardware components designed at the register transfer level ! Stronger validation than a simulator ! Enable capturing primary performance issues ! Expose hardware implementation challenges
! Higher speed than a simulation infrastructure ! Allows faster iterations between hardware and software
! Software layer can be co-developed and evaluated with the hardware
December 6, 2013 6
Node Architecture Overview
December 6, 2013 7
! MicroBlaze processors ! Connected to private
scratchpads ! All access a shared external
DDR3 memory ! Internal interconnection: AXI ! External interconnection: Aurora ! Three custom hardware
components ! GMAS: Global Memory Access
Scheduler ! GNI: Global Network Interface ! GSYNC: Global
SYNChronization module ! Support for lightweight software
multithreading
Programming model
! Global address space ! shared-memory programming model on top of a distributed memory machine ! Developer allocates and frees memory areas in the global address space by using
standard memory allocation primitives. ! The Application Programming Interface (API) provides:
! Extended malloc and free primitives that support allocation in the shared global memory space and the node-local memory space
! POSIX-like thread management: thread creation, join, yield ! Synchronization routines: lock, spinning lock, unlock, barrier
! Application developed with a Single Program Multiple Data (SPMD) approach. ! Each thread executes the same code on different elements of the dataset
! In the current prototype, contexts of the thread are stored in private scratchpads and do not migrate ! Potential load imbalance, faster context switching ! Alternative approach: storing contexts in the global address space, prefetching in
the scratchpads
December 6, 2013 8
Quad-Board Prototyping Platform
! 4 Xilinx Virtex-6 ML605 boards – Virtex-6 LX240T devices ! Xilinx ISE Embedded Design Suite 13.4 ! Prototyped a quad-node systems
December 6, 2013 9
GMAS
December 6, 2013 10
! One per core ! Forwards memory
operations from the cores to the memories
! Enables scrambled global address space support
! Hosts Load Store Queues for long latency memory operations
! Provides thread ids to the core
! Provides interface to the GSYNC
GMAS Operation
! When a core emits a memory operation ! The GMAS descrambles it and verify its destination
! If it is local (local memories, local part of the global address space) ! It is directly forwarded to the destination memory
! If it is remote ! The request is sent to the GNI ! The related information of the memory operation are saved in the LSQ
block, the pending is set ! A canary value is sent to the core, setting the redo bit ! An interrupt is triggered, starting a context switch
! When the reply to the remote reference comes back ! The pending bit is reset, allowing the source thread to be scheduled ! When the thread is scheduled, it re-executes the memory operation and
the redo bit is reset
December 6, 2013 11
GNI
December 6, 2013 12
! A GNI for each node ! Interfaces AXI with the
Network (Aurora) ! Translates internal network
protocol to external network protocol and viceversa
! Packet contains: header with source node, original AXI transaction
! The destination GNI translates the incoming transaction, executes the memory operation, and sends back the result
GSYNC
! A GSYNC for each node ! Implements a lock table of configurable size
! Each GSYNC stores locks for the addresses on its own node ! Direct Mapping: multiple addresses share the same lock (aliasing)
! When a core write on the lock register of the GMAS ! A load is sent to the GSYNC addressing the related lock bit ! The GSYNC handles the load as a bit swap, and returns the current value
in the slot ! Locks not taken are retried in software
! When a core writes on the unlock register of the GMAS ! A store with value 0 is sent to the GSYNC addressing the related lock bit
! Remote GSYNC are accessed through the GNI as normal remote memory operations
December 6, 2013 13
Experimental setup
! 4 nodes ! From 1 to 32 MicroBlazes per node ! From 1 to 4 threads per MicroBlaze ! 512 MB per node, 32 MB as local memory, the rest exposed in the global address
space for a total of 1920 MB ! Scrambling: 8 bytes - GNSYNC Lock table: 8196 entries ! Bandwidth: 1.5 Gbps (500 Mbit per channel), 1/3 overhead for headers (1 Gbps
effective) ! Frequency: 100 MHz
! Delays: ! Context switch: 232 cycles (41 ISR launch, 65 save context, 20 launch scheduler,
50 load context, 24 interrupt reset, 50 exit ISR) ! Round trip for a remote memory reference: 403 cycles
! Applications ! Pointer chasing ! Breadth First Search (BFS)
December 6, 2013 14
Experimental results - Pointer Chasing
December 6, 2013 16
! BW utilization increases with the number of cores
! BW utilization also increases with the number of threads ! However, system is
saturated with 3 threads
! Utilization decreases with 3 and 4 threads and 32 cores wrt 16 cores because of higher contention on the internal interconnection
Experimental results - BFS
December 6, 2013 17
! 100,000 vertices ! 80 neighbors in
average ! 3,998,706 traversed
edges ! Throughput increases
with the number of cores ! Biggest increase from 4
to 8 cores ! Increasing the number of
threads from 1 to 3 increases performance
! However, with 4 threads performance decreases ! Increased contention
on the GSYNC for the locks (BFS is synch intensive)
Conclusions
! Presented the set of hardware and software components that enable efficient execution of irregular applications on a manycore multinode system, Starting from off-the-shelf cores ! Support for global address space and long latency remote memory
operation (GMAS) ! Fine-grained hardware synchronization (GSYNC) ! Integrated network interface (GNI) ! Fast software multithreading (with hardware supported scheduling)
! Introduced an FPGA prototype of the proposed design ! Validated the prototype with two typical irregular kernels
! Scaling in bandwidth utilization and performance when increasing cores and threads
December 6, 2013 18