vectorization and mapping of software defined radio ...dspcad.umd.edu/papers/bhat2014x3.pdf ·...
TRANSCRIPT
Vectorization and Mapping of Software Defined Radio Applications on GPU Platforms
GPU Summit at the UMD/NVIDIA CUDA Center for Excellence, College Park MD, October 27, 2014
Shuvra S. Bhattacharyya Professor, ECE and UMIACS
University of Maryland at College Park [email protected] , http://www.ece.umd.edu/~ssb
With Contributions from G. Zaki, W. Plishker, C. Clancy, and J. Kuykendall
2
Outline
• Introduction
• Contribution: Novel Vectorization and Mapping Workflow.
• Evaluation
• Summary
3
Outline
• Introduction
• Contribution: Novel Vectorization and Mapping Workflow.
• Evaluation
• Summary
4
DSPCAD Methodologies: Computer-Aided Design (CAD) for Digital Signal Processing (DSP) Systems
FPGA
Programmable DSP
GPU
Platforms Applications and Tasks [Bhattacharyya 2013] Image: medical, computer vision, feature detection, etc.
Video: coding, compression, etc.
Audio: sample rate conversion, speech, etc.
Color processing Prediction Transformation &
Quantization Entropy Coding
Imaging device
Data preprocessing
Image reconstruction
Post reconstruction
Advanced image analysis
Image visualization
Audio device
Data preprocessing
Feature extraction
Data postprocessing
Microcontroller Wireless communication systems
Source encoding
Channel encoding
Digital modulation
D/A conversion RF
Back-end
5
Motivation • Diversity of platforms:
– ASIC, FPGA, DSPs, GPUs, GPP
• Complex application environments – GNU Radio
• Exposing parallelism
– Task, Data and Pipeline
• Difficult Mapping Problem
• Multi-objective (throughput, Latency)
6
Background: GNU Radio
• A software development framework that provides software defined radio (SDR) developers a rich library and a customized runtime engine to design and test radio applications [Blossom 2004]
7
DSP-oriented Dataflow Models of Computation
• Applica(on is modeled as a directed graph – Nodes (actors) represent func(ons of arbitrary complexity – Edges represent communica(on channels between func(ons – Nodes produce and consume data from edges – Edges buffer data (logically) in a FIFO (first-‐in, first-‐out) fashion
• Data-‐driven execu(on model – An actor can execute whenever it has sufficient data on its input edges.
– The order in which actors execute is not part of the specifica9on.
– The order is typically determined by the compiler, the hardware, or both.
• Itera(ve execu(on – Body of loop to be iterated a large or infinite number of (mes
8
DSP-oriented Dataflow Graphs
• Ver(ces (actors) represent computa(onal modules • Edges represent FIFO buffers • Edges may have delays, implemented as ini(al tokens • Tokens are produced and consumed on edges • Different models have different rules for produc(on (SDF à fixed, CSDF à periodic, BDF à dynamic)
X Y 5 Z p1,i c1,i
p2,i c2,i
e1 e2
9
Dataflow Production and Consumption Rates
X Y 5 Z p1,i c1,i
p2,i c2,i
e1 e2
10
Dataflow Graph Scheduling
• Assigning actors to processors, and ordering actor subsets that share common processors
• Here, a “processor” means a hardware resource for actor execution on which assigned actors are time-multiplexed
• Scheduling objectives include – Exploiting parallelism – Buffer management – Minimizing power/energy consumption
11
Background: Contemporary Architectures
Graphics Processing Units(GPUs) Vector Operations in General Purpose Processors (GPPs)
12
Primary Contribution
A novel workflow for scheduling SDF graphs while taking into account
– Actor execution times. – Efficient vectorization. – Heterogeneous multiprocessors.
Demonstration system – Applications described in a domain specific language. – Systematic integration of precompiled libraries. – Targeted to architectures consisting of GPPs and GPUs.
13
Previous Work
• Automatic SIMDzation [Hormati, 2010] -Based on StreamIT compiler.
• Hierarchical models for SDR [Lin, 2007] -Targeted towards special architectures.
• Multi-processor scheduling [Stuijk, 2007]: - Formulation towards special objectives.
• Vectorization [Ritz,1992]: - Single processor block processing optimization.
14
Outline
• Introduction
• Contribution: Novel Vectorization and Mapping Workflow.
• Evaluation
• Summary
15
DIF-GR-GPU Workflow
• Start from a model-based application description.
• Use tools to optimize scheduling, assignment.
• Generate an accelerated system.
Dataflow Scheduler
Data Flow Graph
Throughput, Latency
Constraints Application
Graph
Multiprocessor Scheduler
Mapping and Ordering Schedule
GNU Radio Engine
Final Implementation
Platform Description
Actor Profiles
Library of Actor Implementations
GNU Radio
16
Workflow Goals
• Adequately make use of all sources of parallelism in order to utilize the underlying architecture.
Sources of parallelism:
– Data Parallelism (prod and cons in SDF)
– Task Parallelism (implicit in DFG)
– Pipeline Parallelism (Looped schedules)
A B 100 1
A B
C
17
SDF Scheduling Preliminaries
• An SDF graph G = (V,E) has a valid (periodic) schedule if it is deadlock-free and is sample rate consistent (i.e., it has a periodic schedule that fires each actor at least once and produces no net change in the number of tokens on each edge).
• For each actor v in a consistent SDF graph, there is a unique repetition count q(v), which gives the number of times that v must be executed in a minimal valid schedule.
A B 2 3
q(A) = 3 q(B) = 2
Some Possible Schedules: (1) AABAB (2) AAA BB
18
DIF-GR-GPU: Dataflow Scheduler Objective:
– Optimize exploitation of data and pipeline parallelism à Higher throughput.
Flat Schedule: Executes an SDF graph as a cascade of distinct loops with no inter-actor nesting of loops.
Vectorization of a schedule S: A unique positive integer B, called the blocking factor of S, such that S invokes each actor v exactly (B x q(v)) times.
Original SDF Graph Corresponding BPDAG, B = 10
Data Parallelism
Pipeline Parallelism
10 10 20 20
10 10 20 20 60
60
60 60
19
DIF-GR-GPU Workflow
• Start from a model-based application description
• Use tools to optimize scheduling, assignment.
• Generate an accelerated system.
Dataflow Scheduler
Data Flow Graph
Throughput, Latency
Constraints Application
Graph
Multiprocessor Scheduler
Mapping and Ordering Schedule
GNU Radio Engine
Final Implementation
Platform Description
Actor Profiles
Library of Actor Implementations
GNU Radio
20
Heterogeneous Multiprocessor Scheduler
• Objective: Utilize available multiprocessors in the platform.
• Architecture Descriptions: The platform is described by a set P of processors and a set B of all to all communication buses.
• Execution times depend on the blocking factor.
• Every processor is assumed to have a shared memory.
Task Parallelism
GPU0
Com
mun
icat
ion
Bus
GPU1
GPU N
0
1
N
21
Scheduler Inputs
• Architecture description: set P of processors and a set B of communication buses.
• Application description: The application model (input BPDAG) consists of a set T of tasks, and a set E of edges.
• Task and edge profiles: These profiles are described by two functions:
- RTP(t ∈ T, p ∈ P) → R defines the execution time of task t on processor p,
- REB(e ∈ E, b ∈ B) → R defines the execution time of edge e on bus b.
• Dependency analysis: Task t1 is said to be dependent on task t2 if there is a path that starts at t1 and ends at t2 . If no such path exists between t1 and t2, then they are called parallel tasks. A similar concept can be applied to edges.
22
Multiprocessor Scheduler
• The basic scheduler functionality is to – Map every task to a given processor. – Order the execution of parallel actors assigned to
the same processor. – “Zero” out the communication cost of collocated
dependent actors .
• The scheduler objective is: Minimize the latency LB of B graph iterations.
23
MLP formulation • Why?
– Offline analysis of SDF graphs. – Coarse grain nature of SDF graphs. – Solver gives a bound from optimal solution.
• Basic Variables: – Mapping: XT[t, p] = 1 if task t is assigned to processor p;
XT[t, p] = 0 otherwise. – Ordering: For all parallel tasks t1, t2 that are assigned to
the same processor YT[t1, t2] = 1 if t1 is ordered before t2; YT[t1, t2] = 1 otherwise.
– Running time: RT[t] = actual (platform dependent) execution time of the task t depending on its mapping.
– Start time: ST[t] is the start time for execution of task t.
24
MLP formulation (continued)
• Constraints: – Assignment: – Dataflow dependency: – Zero cost communication:
• Objective: Minimize M
25
Outline
• Introduction
• Contribution: Novel Vectorization and Mapping Workflow.
• Evaluation
• Summary
26
DIF-GR-GPU Workflow
• Start from a model-based application description using the dataflow interchange format (DIF)
• Use tools to optimize scheduling, assignment.
• Generate an accelerated system.
Dataflow Scheduler
Dataflow Graph
Throughput, Latency
Constraints Application
Graph
Multiprocessor Scheduler
Mapping and Ordering Schedule
GNU Radio Engine
Final Implementation
Platform Description
Actor Profiles
Library of Actor Implementations
GNU Radio
27
GPU interface GRGPU [Plishker 2011]
GRGPU GN
U R
adio
GPU Kernel in CUDA
(.cu)
CUDA Synthesizer
(nvcc)
CU
DA
Libr
arie
s
C++ Block with call to
GPU Kernel (.cc)
GPU Kernel in
C++ (.cc)
libtool
Standalone Python package
libcudart libcutil
device_work()
source H2D sink D2H FIR (GPU accelerated)
28
MP-Sched Benchmark (GNU Radio)
SRC
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR FIR FIR
SNK
# of Stages #
of P
ipel
ines
29
Realization of a 2x5 MP-Sched Graph
Application: 2x5 mp-sched graph Platform: 1 GPP (Intel XeonCPU 3GHz), 1 GPU (a NVidia GTX 260), and a PCI Blocking factor B : 2048. Amount of
improvement All GPP All GPU
Analytical 55% 19% Empirical 39% 21%
30
Latency vs. Throughput Trade-offs
0
1
2
3
4
5
6
7
8
9
10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1K 2K 4K 8K 16K 32K 64K
(microsecond
s)
Blocking Factor (B)
Latency per itera(on (primary axis)
Overall Latency of B itera(ons. (secondary axis)
(millisecon
ds)
Each point an optimized assignment for each blocking factor.
31
MLP Solver Running Time
• Problem written in MathProg. • Solved using the IBM ILOG CPLEX optimizer • On Intel Core 2 Duo processor at 3 GHz.
32
Outline
• Introduction
• Contribution: Novel Vectorization and Mapping Workflow.
• Evaluation
• Summary
33
Summary • Diversity of platforms:
– ASIC, FPGA, DSPs, GPUs, GPP
• Complex application environments – GNU Radio
• Exposing parallelism
– Task, Data and Pipeline
• Difficult Mapping Problem
• Multi-objective (throughput, Latency)
34
To Probe Further … • G. Zaki, W. Plishker, S. Bhattacharyya, C. Clancy, and J. Kuykendall.
Vectorization and mapping of software defined radio applications on heterogeneous multi-processor platforms. In Proceedings of the IEEE Workshop on Signal Processing Systems, pages 31-36, Beirut, Lebanon, October 2011.
• G. Zaki, W. Plishker, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. Integration of dataflow-based heterogeneous multiprocessor scheduling techniques in GNU radio. Journal of Signal Processing Systems, 70(2):177-191, February 2013. DOI:10.1007/s11265-012-0696-0.
35
To Probe Even Further
• Foreword by S. Y. Kung • Part 1: Applications • Part 2: Architectures • Part 3: Programming
and Simulation Tools • Part 4: Design Methods
First edition, 2010, Second edition, 2013
36
• This research was supported in part by the Laboratory for Telecommunications Sciences.
• For more details on this project, other projects in the Maryland DSPCAD Research Group, and associated publications: http://www.ece.umd.edu/DSPCAD/home/dspcad.htm.
Acknowledgements
37
References 1 • [Bhattacharyya 2013] S. S. Bhattacharyya, E. Deprettere, R. Leupers, and
J. Takala, editors. Handbook of Signal Processing Systems. Springer, second edition, 2013. ISBN: 978-1-4614-6858-5 (Print); 978-1-4614-6859-2 (Online).
• [Plishker 2011] W. Plishker, G. Zaki, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. Applying graphics processor acceleration in a software defined radio prototyping environment. In Proceedings of the International Symposium on Rapid System Prototyping, pages 67-73, Karlsruhe, Germany, May 2011.
• [Hormati 2010] A. H. Hormati, Y. Choi, M. Woh, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. MacroSS: macro-SIMDization of streaming applications. In Symposium on Architectural Support for Programming Languages and Operating Systems, pages 285-296, 2010.
• [Lin 2007] Y. Lin, M. Kudlur, S. Mahlke, and T. Mudge. Hierarchical coarse-grained stream compilation for software defined radio. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems, pages 115-124, 2007.
38
References 2
• [Stuijk 2007] S. Stuijk, T. Basten, M. C. W. Geilen, and H. Corporaal. Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs. In Proceedings of the Design Automation Conference, 2007.
• [Blossom 2004] E. Blossom. GNU radio: tools for exploring the radio frequency spectrum. Linux Journal, June 2004.
• [Ritz 1992] S. Ritz, M. Pankert, and H. Meyr. High level software synthesis for signal processing systems. In Proceedings of the International Conference on Application Specific Array Processors, August 1992.