vectorization and mapping of software defined radio ...dspcad.umd.edu/papers/bhat2014x3.pdf ·...

Vectorization and Mapping of Software Defined Radio Applications on GPU Platforms

GPU Summit at the UMD/NVIDIA CUDA Center for Excellence, College Park MD, October 27, 2014

Shuvra S. Bhattacharyya Professor, ECE and UMIACS

University of Maryland at College Park [email protected] , http://www.ece.umd.edu/~ssb

With Contributions from G. Zaki, W. Plishker, C. Clancy, and J. Kuykendall

2

Outline

•  Introduction

•  Contribution: Novel Vectorization and Mapping Workflow.

•  Evaluation

•  Summary

3

Outline

•  Introduction


•  Evaluation

•  Summary

4

DSPCAD Methodologies: Computer-Aided Design (CAD) for Digital Signal Processing (DSP) Systems

FPGA

Programmable DSP

GPU

Platforms Applications and Tasks [Bhattacharyya 2013] Image: medical, computer vision, feature detection, etc.

Video: coding, compression, etc.

Audio: sample rate conversion, speech, etc.

Color processing Prediction Transformation &

Quantization Entropy Coding

Imaging device

Data preprocessing

Image reconstruction

Post reconstruction

Advanced image analysis

Image visualization

Audio device

Data preprocessing

Feature extraction

Data postprocessing

Microcontroller Wireless communication systems

Source encoding

Channel encoding

Digital modulation

D/A conversion RF

Back-end

5

Motivation •  Diversity of platforms:

–  ASIC, FPGA, DSPs, GPUs, GPP

•  Complex application environments –  GNU Radio

•  Exposing parallelism

–  Task, Data and Pipeline

•  Difficult Mapping Problem

•  Multi-objective (throughput, Latency)

6

Background: GNU Radio

•  A software development framework that provides software defined radio (SDR) developers a rich library and a customized runtime engine to design and test radio applications [Blossom 2004]

7

DSP-oriented Dataflow Models of Computation

•  Applica(on is modeled as a directed graph –  Nodes (actors) represent func(ons of arbitrary complexity –  Edges represent communica(on channels between func(ons –  Nodes produce and consume data from edges –  Edges buffer data (logically) in a FIFO (first-‐in, first-‐out) fashion

•  Data-‐driven execu(on model –  An actor can execute whenever it has sufficient data on its input edges.

–  The order in which actors execute is not part of the specifica9on.

–  The order is typically determined by the compiler, the hardware, or both.

•  Itera(ve execu(on –  Body of loop to be iterated a large or infinite number of (mes

8

DSP-oriented Dataflow Graphs

•  Ver(ces (actors) represent computa(onal modules •  Edges represent FIFO buffers •  Edges may have delays, implemented as ini(al tokens •  Tokens are produced and consumed on edges •  Different models have different rules for produc(on (SDF à fixed, CSDF à periodic, BDF à dynamic)

X Y 5 Z p1,i c1,i

p2,i c2,i

e1 e2

9

Dataflow Production and Consumption Rates

X Y 5 Z p1,i c1,i

p2,i c2,i

e1 e2

10

Dataflow Graph Scheduling

•  Assigning actors to processors, and ordering actor subsets that share common processors

•  Here, a “processor” means a hardware resource for actor execution on which assigned actors are time-multiplexed

•  Scheduling objectives include –  Exploiting parallelism –  Buffer management – Minimizing power/energy consumption

11

Background: Contemporary Architectures

Graphics Processing Units(GPUs) Vector Operations in General Purpose Processors (GPPs)

12

Primary Contribution

A novel workflow for scheduling SDF graphs while taking into account

–  Actor execution times. –  Efficient vectorization. –  Heterogeneous multiprocessors.

Demonstration system –  Applications described in a domain specific language. –  Systematic integration of precompiled libraries. –  Targeted to architectures consisting of GPPs and GPUs.

13

Previous Work

•  Automatic SIMDzation [Hormati, 2010] -Based on StreamIT compiler.

•  Hierarchical models for SDR [Lin, 2007] -Targeted towards special architectures.

•  Multi-processor scheduling [Stuijk, 2007]: - Formulation towards special objectives.

•  Vectorization [Ritz,1992]: - Single processor block processing optimization.

14

Outline

•  Introduction


•  Evaluation

•  Summary

15

DIF-GR-GPU Workflow

•  Start from a model-based application description.

•  Use tools to optimize scheduling, assignment.

•  Generate an accelerated system.

Dataflow Scheduler

Data Flow Graph

Throughput, Latency

Constraints Application

Graph

Multiprocessor Scheduler

Mapping and Ordering Schedule

GNU Radio Engine

Final Implementation

Platform Description

Actor Profiles

Library of Actor Implementations

GNU Radio

16

Workflow Goals

•  Adequately make use of all sources of parallelism in order to utilize the underlying architecture.

Sources of parallelism:

–  Data Parallelism (prod and cons in SDF)

–  Task Parallelism (implicit in DFG)

–  Pipeline Parallelism (Looped schedules)

A B 100 1

A B

C

17

SDF Scheduling Preliminaries

•  An SDF graph G = (V,E) has a valid (periodic) schedule if it is deadlock-free and is sample rate consistent (i.e., it has a periodic schedule that fires each actor at least once and produces no net change in the number of tokens on each edge).

•  For each actor v in a consistent SDF graph, there is a unique repetition count q(v), which gives the number of times that v must be executed in a minimal valid schedule.

A B 2 3

q(A) = 3 q(B) = 2

Some Possible Schedules: (1) AABAB (2) AAA BB

18

DIF-GR-GPU: Dataflow Scheduler Objective:

–  Optimize exploitation of data and pipeline parallelism à Higher throughput.

Flat Schedule: Executes an SDF graph as a cascade of distinct loops with no inter-actor nesting of loops.

Vectorization of a schedule S: A unique positive integer B, called the blocking factor of S, such that S invokes each actor v exactly (B x q(v)) times.

Original SDF Graph Corresponding BPDAG, B = 10

Data Parallelism

Pipeline Parallelism

10 10 20 20

10 10 20 20 60

60

60 60

19

DIF-GR-GPU Workflow

•  Start from a model-based application description



Dataflow Scheduler

Data Flow Graph

Throughput, Latency


Graph



GNU Radio Engine



Actor Profiles


GNU Radio

20

Heterogeneous Multiprocessor Scheduler

•  Objective: Utilize available multiprocessors in the platform.

•  Architecture Descriptions: The platform is described by a set P of processors and a set B of all to all communication buses.

•  Execution times depend on the blocking factor.

•  Every processor is assumed to have a shared memory.

Task Parallelism

GPU0

Com

mun

icat

ion

Bus

GPU1

GPU N

0

1

N

21

Scheduler Inputs

•  Architecture description: set P of processors and a set B of communication buses.

•  Application description: The application model (input BPDAG) consists of a set T of tasks, and a set E of edges.

•  Task and edge profiles: These profiles are described by two functions:

- RTP(t ∈ T, p ∈ P) → R defines the execution time of task t on processor p,

- REB(e ∈ E, b ∈ B) → R defines the execution time of edge e on bus b.

•  Dependency analysis: Task t1 is said to be dependent on task t2 if there is a path that starts at t1 and ends at t2 . If no such path exists between t1 and t2, then they are called parallel tasks. A similar concept can be applied to edges.

22


•  The basic scheduler functionality is to – Map every task to a given processor. – Order the execution of parallel actors assigned to

the same processor. –  “Zero” out the communication cost of collocated

dependent actors .

•  The scheduler objective is: Minimize the latency LB of B graph iterations.

23

MLP formulation •  Why?

–  Offline analysis of SDF graphs. –  Coarse grain nature of SDF graphs. –  Solver gives a bound from optimal solution.

•  Basic Variables: –  Mapping: XT[t, p] = 1 if task t is assigned to processor p;

XT[t, p] = 0 otherwise. –  Ordering: For all parallel tasks t1, t2 that are assigned to

the same processor YT[t1, t2] = 1 if t1 is ordered before t2; YT[t1, t2] = 1 otherwise.

–  Running time: RT[t] = actual (platform dependent) execution time of the task t depending on its mapping.

–  Start time: ST[t] is the start time for execution of task t.

24

MLP formulation (continued)

•  Constraints: –  Assignment: –  Dataflow dependency: –  Zero cost communication:

•  Objective: Minimize M

25

Outline

•  Introduction


•  Evaluation

•  Summary

26

DIF-GR-GPU Workflow

•  Start from a model-based application description using the dataflow interchange format (DIF)



Dataflow Scheduler

Dataflow Graph

Throughput, Latency


Graph



GNU Radio Engine



Actor Profiles


GNU Radio

27

GPU interface GRGPU [Plishker 2011]

GRGPU GN

U R

adio

GPU Kernel in CUDA

(.cu)

CUDA Synthesizer

(nvcc)

CU

DA

Libr

arie

s

C++ Block with call to

GPU Kernel (.cc)

GPU Kernel in

C++ (.cc)

libtool

Standalone Python package

libcudart libcutil

device_work()

source H2D sink D2H FIR (GPU accelerated)

28

MP-Sched Benchmark (GNU Radio)

SRC

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR FIR FIR

SNK

# of Stages #

of P

ipel

ines

29

Realization of a 2x5 MP-Sched Graph

Application: 2x5 mp-sched graph Platform: 1 GPP (Intel XeonCPU 3GHz), 1 GPU (a NVidia GTX 260), and a PCI Blocking factor B : 2048. Amount of

improvement All GPP All GPU

Analytical 55% 19% Empirical 39% 21%

30

Latency vs. Throughput Trade-offs

0

1

2

3

4

5

6

7

8

9

10

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1K 2K 4K 8K 16K 32K 64K

(microsecond

s)

Blocking Factor (B)

Latency per itera(on (primary axis)

Overall Latency of B itera(ons. (secondary axis)

(millisecon

ds)

Each point an optimized assignment for each blocking factor.

31

MLP Solver Running Time

•  Problem written in MathProg. •  Solved using the IBM ILOG CPLEX optimizer •  On Intel Core 2 Duo processor at 3 GHz.

32

Outline

•  Introduction


•  Evaluation

•  Summary

33

Summary •  Diversity of platforms:

–  ASIC, FPGA, DSPs, GPUs, GPP

•  Complex application environments –  GNU Radio

•  Exposing parallelism

–  Task, Data and Pipeline

•  Difficult Mapping Problem

•  Multi-objective (throughput, Latency)

34

To Probe Further … •  G. Zaki, W. Plishker, S. Bhattacharyya, C. Clancy, and J. Kuykendall.

Vectorization and mapping of software defined radio applications on heterogeneous multi-processor platforms. In Proceedings of the IEEE Workshop on Signal Processing Systems, pages 31-36, Beirut, Lebanon, October 2011.

•  G. Zaki, W. Plishker, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. Integration of dataflow-based heterogeneous multiprocessor scheduling techniques in GNU radio. Journal of Signal Processing Systems, 70(2):177-191, February 2013. DOI:10.1007/s11265-012-0696-0.

35

To Probe Even Further

•  Foreword by S. Y. Kung •  Part 1: Applications •  Part 2: Architectures •  Part 3: Programming

and Simulation Tools •  Part 4: Design Methods

First edition, 2010, Second edition, 2013

36

•  This research was supported in part by the Laboratory for Telecommunications Sciences.

•  For more details on this project, other projects in the Maryland DSPCAD Research Group, and associated publications: http://www.ece.umd.edu/DSPCAD/home/dspcad.htm.

Acknowledgements

37

References 1 •  [Bhattacharyya 2013] S. S. Bhattacharyya, E. Deprettere, R. Leupers, and

J. Takala, editors. Handbook of Signal Processing Systems. Springer, second edition, 2013. ISBN: 978-1-4614-6858-5 (Print); 978-1-4614-6859-2 (Online).

•  [Plishker 2011] W. Plishker, G. Zaki, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. Applying graphics processor acceleration in a software defined radio prototyping environment. In Proceedings of the International Symposium on Rapid System Prototyping, pages 67-73, Karlsruhe, Germany, May 2011.

•  [Hormati 2010] A. H. Hormati, Y. Choi, M. Woh, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. MacroSS: macro-SIMDization of streaming applications. In Symposium on Architectural Support for Programming Languages and Operating Systems, pages 285-296, 2010.

•  [Lin 2007] Y. Lin, M. Kudlur, S. Mahlke, and T. Mudge. Hierarchical coarse-grained stream compilation for software defined radio. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems, pages 115-124, 2007.

38

References 2

•  [Stuijk 2007] S. Stuijk, T. Basten, M. C. W. Geilen, and H. Corporaal. Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs. In Proceedings of the Design Automation Conference, 2007.

•  [Blossom 2004] E. Blossom. GNU radio: tools for exploring the radio frequency spectrum. Linux Journal, June 2004.

•  [Ritz 1992] S. Ritz, M. Pankert, and H. Meyr. High level software synthesis for signal processing systems. In Proceedings of the International Conference on Application Specific Array Processors, August 1992.

vectorization and mapping of software defined radio ...dspcad.umd.edu/papers/bhat2014x3.pdf ·...

Documents