isca final presentation - queuing model
TRANSCRIPT
HSA QUEUING MODELHAKAN PERSSON, SENIOR PRINCIPAL ENGINEER, ARM
HSA QUEUEING, MOTIVATION
© Copyright 2014 HSA Foundation. All Rights Reserved
MOTIVATION (TODAY’S PICTURE)
Application OS GPU
Transfer buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish JobSchedule
ApplicationGet Buffer
Copy/MapMemory
HSA QUEUEING: REQUIREMENTS
© Copyright 2014 HSA Foundation. All Rights Reserved
REQUIREMENTS
Three key technologies are used to build the user mode queueing mechanism
Shared Virtual Memory System Coherency Signaling
AQL (Architected Queueing Language) enables any agent enqueue tasks
SHARED VIRTUAL MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY) Multiple Virtual memory address spaces
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
© Copyright 2014 HSA Foundation. All Rights Reserved
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA) Common Virtual Memory for all HSA agents
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
Advantages No mapping tricks, no copying back-and-forth between different PA
addresses Send pointers (not data) back and forth between HSA agents.
Implications Common Page Tables (and common interpretation of architectural semantics
such as shareability, protection, etc). Common mechanisms for address translation (and servicing address
translation faults) Concept of a process address space (PASID) to allow multiple, per process
virtual address spaces within the system.
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
Specifics Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems. HSA agents may reserve VA ranges for internal use via system
software. All HSA agents other than the host unit must use the lowest privilege
level If present, read/write access flags for page tables must be
maintained by all agents. Read/write permissions apply to all HSA agents, equally.
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING THERE …
Application OS GPU
Transfer buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish JobSchedule
ApplicationGet Buffer
Copy/MapMemory
CACHE COHERENCY
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (1/3)
Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance.
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (2/3)
Advantages Composability Reduced SW complexity when communicating between agents Lower barrier to entry when porting software
Implications Hardware coherency support between all HSA agents Can take many forms
Stand alone Snoop Filters / Directories Combined L3/Filters Snoop-based systems (no filter) Etc …
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (3/3)
Specifics No requirement for instruction memory accesses to be coherent Only applies to the Primary memory type. No requirement for HSA agents to maintain coherency to any
memory location where the HSA agents do not specify the same memory attributes
Read-only image data is required to remain static during the execution of an HSA kernel.
No double mapping (via different attributes) in order to modify. Must remain static
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING CLOSER …
Application OS GPU
Transfer buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish JobSchedule
ApplicationGet Buffer
Copy/MapMemory
SIGNALING
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (1/3)
HSA agents support the ability to use signaling objects All creation/destruction signaling objects occurs via HSA
runtime APIs From an HSA Agent you can directly access signaling objects.
Signaling a signal object (this will wake up HSA agents waiting upon the object)
Query current object Wait on the current object (various conditions supported).
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (2/3)
Advantages Enables asynchronous events between HSA agents,
without involving the kernel Common idiom for work offload Low power waiting
Implications Runtime support required Commonly implemented on top of cache coherency flows
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (3/3)
Specifics Only supported within a PASID Supported wait conditions are =, !=, < and >= Wait operations may return sporadically (no guarantee against false
positives) Programmer must test.
Wait operations have a maximum duration before returning. The HSAIL atomic operations are supported on signal objects. Signal objects are opaque
Must use dedicated HSAIL/HSA runtime operations
© Copyright 2014 HSA Foundation. All Rights Reserved
ALMOST THERE…
Application OS GPU
Transfer buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish JobSchedule
ApplicationGet Buffer
Copy/MapMemory
USER MODE QUEUING
© Copyright 2014 HSA Foundation. All Rights Reserved
ONE BLOCK LEFT
Application OS GPU
Transfer buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish JobSchedule
ApplicationGet Buffer
Copy/MapMemory
© Copyright 2014 HSA Foundation. All Rights Reserved
USER MODE QUEUEING (1/3)
User mode Queueing Enables user space applications to directly, without OS
intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.
Queues are created/destroyed via calls to the HSA runtime.
One (or many) agents enqueue packets, a single agent dequeues packets.
Requires coherency and shared virtual memory.
© Copyright 2014 HSA Foundation. All Rights Reserved
USER MODE QUEUEING (2/3)
Advantages Avoid involving the kernel/driver when dispatching work for an Agent. Lower latency job dispatch enables finer granularity of offload Standard memory protection mechanisms may be used to protect communication with
the consuming agent.
Implications Packet formats/fields are Architected – standard across vendors!
Guaranteed backward compatibility Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signaling) More on this later……
© Copyright 2014 HSA Foundation. All Rights Reserved
SUCCESS!
Application OS GPU
Transfer buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish JobSchedule
ApplicationGet Buffer
Copy/MapMemory
© Copyright 2014 HSA Foundation. All Rights Reserved
SUCCESS!
Application OS GPU
Queue Job
Start Job
Finish Job
ARCHITECTED QUEUEING LANGUAGE, QUEUES
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUEING LANGUAGE HSA Queues look just like standard shared
memory queues, supporting multi-producer, single-consumer
Single producer variant defined with some optimizations possible.
Queues consist of storage, read/write indices, ID, etc.
Queues are created/destroyed via calls to the HSA runtime
“Packets” are placed in queues directly from user mode, via an architected protocol
Packet format is architected
Producer Producer
Consumer
Read Index
Write Index
Storage in coherent, shared memory
Packets
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUING LANGUAGE Packets are read and dispatched for execution from the queue in order, but
may complete in any order. There is no guarantee that more than one packet will be processed in parallel at a
time
There may be many queues. A single agent may also consume from several queues.
Any HSA agent may enqueue packets CPUs GPUs Other accelerators
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTUREOffset (bytes) Size (bytes) Field Notes
0 4 queueType Differentiate different queues
4 4 queueFeatures Indicate supported features
8 8 baseAddress Pointer to packet array
16 16 doorbellSignal HSA signaling object handle
24 4 size Packet array cardinality
28 4 queueId Unique per process
32 8 serviceQueue Queue for callback services
intrinsic 8 writeIndex Packet array write index
intrinsic 8 readIndex Packet array read index
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE VARIANTS
queueType and queueFeatures together define queue semantics and capabilities
Two queueType values defined, other values reserved: MULTI – queue supports multiple producers SINGLE – queue supports single producer
queueFeatures is a bitfield indicating capabilities DISPATCH (bit 0) if set then queue supports DISPATCH packets AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets All other bits are reserved and must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE DETAILS
Queue doorbells are HSA signaling objects with restrictions Created as part of the queue – lifetime tied to queue object Atomic read-modify-write not allowed
size field value must be aligned to a power of 2
serviceQueue can be used by HSA kernel for callback services Provided by application when queue is created Can be mapped to HSA runtime provided serviceQueue, an application serviced
queue, or NULL if no serviceQueue required
© Copyright 2014 HSA Foundation. All Rights Reserved
READ/WRITE INDICES
readIndex and writeIndex properties are part of the queue, but not visible in the queue structure Accessed through HSA runtime API and HSAIL operations HSA runtime/HSAIL operations defined to
Read readIndex or writeIndex property Write readIndex or writeIndex property Add constant to writeIndex property (returns previous writeIndex value) CAS on writeIndex property
readIndex & writeIndex operations treated as atomic in memory model relaxed, acquire, release and acquire-release variants defined as applicable
readIndex and writeIndex never wrap
PacketID – the index of a particular packet Uniquely identifies each packet of a queue
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET ENQUEUE
Packet enqueue follows a few simple steps: Reserve space
Multiple packets can be reserved at a time Write packet to queue Mark packet as valid
Producer no longer allowed to modify packet Consumer is allowed to start processing packet
Notify consumer of packet through the queue doorbell Multiple packets can be notified at a time Doorbell signal should be signaled with last packetID notified
On small machine model the lower 32 bits of the packetID are used
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET RESERVATION
Two flows envisaged Atomic add writeIndex with number of packets to reserve
Producer must wait until packetID < readIndex + size before writing to packet Queue can be sized so that wait is unlikely (or impossible) Suitable when many threads use one queue
Check queue not full first, then use atomic CAS to update writeIndex Can be inefficient if many threads use the same queue Allows different failure model if queue is congested
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE OPTIMIZATIONS
Queue behavior is loosely defined to allow optimizations
Some potential producer behavior optimizations: Keep local copy of readIndex, update when required For single producer queues:
Keep local copy of writeIndex Use store operation rather than add/cas atomic to update writeIndex
Some potential consumer behavior optimizations: Use packet format field to determine whether a packet has been submitted rather than writeIndex property Speculatively read multiple packets from the queue Not update readIndex for each packet processed Rely on value used for doorbellSignal to notify new packets
Especially useful for single producer queues
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL MULTI-PRODUCER ALGORITHM
// Allocate packetuint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);
// Wait until the queue is no longer full. uint64_t rdIdx;do { rdIdx = hsa_queue_load_read_index_relaxed(q);} while (packetID >= (rdIdx + q->size));
// calculate indexuint32_t arrayIdx = packetID & (q->size-1);
// copy over the packet, the format field is INVALIDq->baseAddress[arrayIdx] = pkt;
// Update format field with release semanticsq->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);
// ring doorbell, with release semantics (could also amortize over multiple packets)hsa_signal_send_relaxed(q->doorbellSignal, packetID);
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL CONSUMER ALGORITHM// Get location of next packetuint64_t readIndex = hsa_queue_load_read_index_relaxed(q);
// calculate the index uint32_t arrayIdx = readIndex & (q->size-1);
// spin while empty (could also perform low-power wait on doorbell)while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }
// copy over the packetpkt = q->baseAddress[arrayIdx];
// set the format field to invalidq->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);
// Update the readIndex using HSA intrinsichsa_queue_store_read_index_relaxed(q, readIndex+1);
// Now process <pkt>!
ARCHITECTED QUEUEING LANGUAGE, PACKETS
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKETS
Packets come in three main types with architected layouts Always reserved & Invalid
Do not contain any valid tasks and are not processed (queue will not progress) Dispatch
Specifies kernel execution over a grid Agent Dispatch
Specifies a single function to perform with a set of parameters Barrier
Used for task dependencies
© Copyright 2014 HSA Foundation. All Rights Reserved
COMMON PACKET HEADERStart Offset
(Bytes)Format Field Name Description
0 uint16_t
format:8Contains the packet type (Always reserved, Invalid, Dispatch, Agent Dispatch, and Barrier). Other values are reserved and should not be used.
barrier:1If set then processing of packet will only begin when all preceding packets are complete.
acquireFenceScope:2
Determines the scope and type of the memory fence operation applied before the packet enters the active phase.Must be 0 for Barrier Packets.
releaseFenceScope:2Determines the scope and type of the memory fence operation applied after kernel completion but before the packet is completed.
reserved:3 Must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
DISPATCH PACKETStart
Offset (Bytes)
Format Field Name Description
0 uint16_t header Packet header
2 uint16_tdimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.
reserved:14 Must be 0.4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).10 uint16_t reserved2 Must be 0.12 uint32_t gridSize.x x dimension of grid (measured in work-items).16 uint32_t gridSize.y y dimension of grid (measured in work-items).20 uint32_t gridSize.z z dimension of grid (measured in work-items).24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).
32 uint64_t kernelObjectAddress Address of an object in memory that includes an implementation-defined executable ISA image for the kernel.
40 uint64_t kernargAddress Address of memory containing kernel arguments.48 uint64_t reserved3 Must be 0.56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT DISPATCH PACKETStart Offset
(Bytes)Format Field Name Description
0 uint16_t header Packet header
2 uint16_t type
The function to be performed by the destination Agent. The type value is split into the following ranges: 0x0000:0x3FFF – Vendor specific 0x4000:0x7FFF – HSA runtime 0x8000:0xFFFF – User registered function
4 uint32_t reserved2 Must be 0.8 uint64_t returnLocation Pointer to location to store the function return value in.16 uint64_t arg[0]
64-bit direct or indirect arguments.24 uint64_t arg[1]32 uint64_t arg[2]40 uint64_t arg[3]48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
© Copyright 2014 HSA Foundation. All Rights Reserved
BARRIER PACKET
Used for specifying dependences between packets
HSA agent will not launch any further packets from this queue until the barrier packet signal conditions are met
Used for specifying dependences on packets dispatched from any queue. Execution phase completes only when all of the dependent signals (up to five) have
been signaled (with the value of 0). Or if an error has occurred in one of the packets upon which we have a dependence.
© Copyright 2014 HSA Foundation. All Rights Reserved
BARRIER PACKETStart Offset
(Bytes)Format Field Name Description
0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16).
2 uint16_t reserved2 Must be 0.
4 uint32_t reserved3 Must be 0.
8 uint64_t depSignal0
Address of dependent signaling objects to be evaluated by the packet processor.
16 uint64_t depSignal1
24 uint64_t depSignal2
32 uint64_t depSignal3
40 uint64_t depSignal4
48 uint64_t reserved4 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
© Copyright 2014 HSA Foundation. All Rights Reserved
DEPENDENCES
A user may never assume more than one packet is being executed by an HSA agent at a time.
Implications: Packets can’t poll on shared memory values which will be set by packets issued from
other queues, unless the user has ensured the proper ordering. To ensure all previous packets from a queue have been completed, use the Barrier bit. To ensure specific packets from any queue have completed, use the Barrier packet.
HSA QUEUEING, PACKET EXECUTION
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET EXECUTION
Launch phase Initiated when launch conditions are met
All preceding packets in the queue must have exited launch phase If the barrier bit in the packet header is set, then all preceding packets in the queue must have
exited completion phase Includes memory acquire fence
Active phase Execute the packet Barrier packets remain in Active phase until conditions are met.
Completion phase First step is memory release fence – make results visible. completionSignal field is then signaled with a decrementing atomic.
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET EXECUTION – BARRIER BIT
Pkt1Launch
Pkt2Launch
Pkt1Execute
Pkt2Execute
Pkt1Complete
Pkt3Launch (barrier=1)
Pkt2Complete
Pkt3Execute
Time
Pkt3 launches whenall packets in the queue have completed.
© Copyright 2014 HSA Foundation. All Rights Reserved
PUTTING IT ALL TOGETHER (FFT)Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6Barrier Barrier
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
Time
© Copyright 2014 HSA Foundation. All Rights Reserved
PUTTING IT ALL TOGETHERAQL Pseudo Code
// Send the packets to do the first stage. aql_dispatch(pkt1);aql_dispatch(pkt2);
// Send the next two packets, setting the barrier bit so we// know packets 1 & 2 will be complete before 3 and 4 are // launched. aql_dispatch_with _barrier_bit(pkt3); aql_dispatch(pkt4);
// Same as above (make sure 3 & 4 are done before issuing 5// & 6) aql_dispatch_with_barrier_bit(pkt5); aql_dispatch(pkt6);
// This packet will notify us when 5 & 6 are complete)aql_dispatch_with_barrier_bit(finish_pkt);
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET EXECUTION – BARRIER PACKET
Barrier T2Q2
T1Q1
Signal Xinit to 1
depSignal0
completionSignal
Time
Decrements signal X
BarrierLaunch
T1Launch
BarrierExecute
T1Execute
BarrierComplete
T1Complete
T2Launch
T2Execute
T2Complete
Barrier completes when signal X signalled with 0
T2 launches once barrier complete
© Copyright 2014 HSA Foundation. All Rights Reserved
DEPTH FIRST CHILD TASK EXECUTION
Consider two generations of child tasks Task T submits tasks T.1 & T.2 Task T.1 submits tasks T.1.1 & T.1.2 Task T.2 submits tasks T.2.1 & T.2.2
Desired outcome Depth first child task execution
I.e. T T1 T.1.1 T.1.2 T.2 T.2.1 T.2.2 T passed signal (allComplete) to decrement when all tasks are complete (T and its
children etc)
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2
© Copyright 2014 HSA Foundation. All Rights Reserved
HOW TO DO THIS WITH HSA QUEUES?
Use a separate user mode queue for each recursion level Task T submits to queue Q1 Tasks T.1 & T.2 submits tasks to queue Q2 Queues could be passed in as parameters to task T
Depth first requires ordering of T.1, T.2 and their children Use additional signal object (childrenComplete) to track completion of the children of
T.1 & T.2 childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2
© Copyright 2014 HSA Foundation. All Rights Reserved
A PICTURE SAYS MORE THAN 1000 WORDS
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2 T.1 Barrier T.2 BarrierQ1
Wait on childrenComplete
Signal allComplete
T.1.1 T.1.2 T.2.1 T.2.2Q2
SUMMARY
© Copyright 2014 HSA Foundation. All Rights Reserved
© Copyright 2014 HSA Foundation. All Rights Reserved
KEY HSA TECHNOLOGIES HSA combines several mechanisms to enable low overhead task
dispatch Shared Virtual Memory System Coherency Signaling AQL
User mode queues – from any compatible agent Architected packet format
Rich dependency mechanism Flexible and efficient signaling of completion
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved