arm techcon keynote 2012: sensor integration and improved user experiences at even lower power –...

40
Sensor Integration and Improved User Experiences at Lower Power Phil Rogers President, HSA Foundation Corporate Fellow, AMD

Upload: hsa-foundation

Post on 19-Jun-2015

1.947 views

Category:

Technology


0 download

DESCRIPTION

HSA is a new computing platform architecture being standardized by the HSA Foundation which has as Founding members, AMD, ARM, Imagination, TI, Mediatek, Samsung and Qualcomm. HSA is intended to make the use of heterogeneous programming widespread by making purpose built architectures as easy to program as modern CPUs are. We start off by doing this with the GPU, the most widely deployed companion processor to the CPU and one which especially complements the CPU in low power and performance workloads. This requires some hardware architecture changes, that we have been working on for some time (in particular those that enable user mode scheduling, unified address space, unified shared memory, compute context switching, etc.) and which we have encapsulated into the spec currently under review by the HSA Foundation. In short, HSA codifies the hardware architecture changes that are needed to enable mainstream programmers to develop heterogeneous application with the same facility that they do CPU only applications by seamlessly integrating the sequential programming capability of the CPU with the parallel compute capability of the GPU. We describe the software stacks that are needed for HSA, the benefits that accrue to both developers as well as end users, and describe our vision of the how HSA will help unify the ecosystems of the smartphone and tablet platforms as well as bring it closer to that of the traditional PC market. We will provide analysis of several examples which arise in applications and present data to validate the performance per watt benefit of HSA.

TRANSCRIPT

Page 1: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

Sensor Integration and Improved

User Experiences at Lower Power

Phil Rogers

President, HSA Foundation

Corporate Fellow, AMD

Page 2: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

Sensors on all new devices, going up in

resolution and data rates

Extract meaning from video, audio and

location

Mix of scalar and parallel workloads

Process locally or in the cloud

Power must keep going down

SENSOR INTEGRATION CHALLENGES

© Copyright 2012 HSA Foundation. All Rights Reserved. 2

Page 3: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

ABUNDANT PARALLEL WORKLOADS

Biometric Recognition

Secure, fast, accurate: face, voice, fingerprints

Beyond HD Experiences

Streaming media, new codecs, 3D, transcode, audio

Augmented Reality

Superimpose graphics, audio, and other digital information as a virtual overlay

AV Content Management

Searching, indexing and tagging of video & audio. multimedia data mining

Natural UI & Gestures

Touch, gesture, and voice

Content Everywhere

Content from any source to any display seamlessly

© Copyright 2012 HSA Foundation. All Rights Reserved. 3

Page 4: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

A NEW ERA OF PROCESSOR

PERFORMANCE

?

Sin

gle

-thre

ad

Perf

orm

ance

Time

we are

here

Enabled by: Moore’s

Law

Voltage Scaling

Constrained by:

Power

Complexity

Single-Core Era

Modern

Applic

ation

Perf

orm

ance

Time (Data-parallel exploitation)

we are

here

Heterogeneous

Systems Era

Enabled by: Abundant data

parallelism

Power efficient

GPUs

Temporarily

Constrained by: Programming

models

Comm.overhead

Thro

ughput

Perf

orm

ance

Time (# of processors)

we are

here

Enabled by: Moore’s Law

SMP

architecture

Constrained by: Power

Parallel SW

Scalability

Multi-Core Era

Assembly C/C++ Java … pthreads OpenMP TBB Shader CUDA OpenCL

C++ and Java

© Copyright 2012 HSA Foundation. All Rights Reserved. 4

Page 5: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

EVOLUTION OF HETEROGENEOUS

COMPUTING A

rch

ite

ctu

re M

atu

rity

& P

rog

ram

me

r A

cce

ssib

ility

Po

or

Ex

ce

lle

nt

2012 - 2020 2009 - 2011 2002 - 2008

Graphics & Proprietary

Driver-based APIs

Proprietary Drivers Era

“Adventurous” programmers

Exploit early programmable

“shader cores” in the GPU

Make your program look like

“graphics” to the GPU

CUDA™, Brook+, etc

OpenCL™, DirectCompute

Driver-based APIs

Standards Drivers Era

Expert programmers

C and C++ subsets

Compute centric APIs , data

types

Multiple address spaces with

explicit data movement

Specialized work queue based

structures

Kernel mode dispatch

Heterogeneous System Architecture

GPU Peer Processor

Architected Era

Mainstream programmers

Full C++

GPU as a co-processor

Unified coherent address space

Task parallel runtimes

Nested Data Parallel programs

User mode dispatch

Pre-emption and context

switching

2002 - 2008 2012 - 2020 2009 - 2011

© Copyright 2012 HSA Foundation. All Rights Reserved. 5

Page 6: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

HSA FEATURE ROADMAP

System

Integration

GPU compute

context switch

GPU graphics

pre-emption

Quality of Service

Architectural

Integration

Unified Address Space

for CPU and GPU

Fully coherent memory

between CPU & GPU

GPU uses pageable

system memory via

CPU pointers

Optimized

Platforms

Bi-Directional Power

Mgmt between CPU

and GPU

GPU Compute C++

support

User mode scheduling

Physical

Integration

Integrate CPU & GPU

in silicon

Unified Memory

Controller

Common

Manufacturing

Technology

© Copyright 2012 HSA Foundation. All Rights Reserved. 6

Page 7: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

HETEROGENEOUS SYSTEM

ARCHITECTURE – AN OPEN PLATFORM

Open Architecture, published specifications

HSAIL virtual ISA

HSA memory model

HSA system architecture

ISA agnostic for both CPU and GPU

HSA Foundation formed in June 2012

Inviting partners to join us, in all areas

Hardware companies

Operating Systems

Tools and Middleware

Applications

© Copyright 2012 HSA Foundation. All Rights Reserved. 7

Page 8: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

HSA INTERMEDIATE LAYER - HSAIL

HSAIL is a virtual ISA for parallel programs

Finalized to ISA by a JIT compiler or “Finalizer”

ISA independent by design

Explicitly parallel

Designed for data parallel programming

Support for exceptions, virtual functions,

and other high level language features

Syscall methods

GPU code can call directly to system services,

IO, printf, etc

Debugging support

© Copyright 2012 HSA Foundation. All Rights Reserved. 8

Page 9: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

HSA MEMORY MODEL

Designed to be compatible with C++11, Java and

.NET Memory Models

Relaxed consistency memory model for parallel

compute performance

Loads and stores can be re-ordered by the

finalizer

Visibility controlled by:

Load.Acquire

Store.Release

Barriers

© Copyright 2012 HSA Foundation. All Rights Reserved. 9

Page 10: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

THE HSA FOUNDATION

Promoters

Founders

Contributors

Supporters

Associates

Page 11: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

HSA SOFTWARE STACKS APPLICATIONS, SYSTEM SOFTWARE AND

PROGRAMMING MODELS

Page 12: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

© Copyright 2012 HSA Foundation. All Rights Reserved. 12

Hardware - APUs, CPUs, GPUs

Driver Stack

Domain Libraries

OpenCL™ 1.x, DX Runtimes,

User Mode Drivers

Graphics Kernel Mode Driver

Apps Apps

Apps Apps

Apps Apps

HSA Software Stack

Task Queuing

Libraries

HSA Domain Libraries

HSA Kernel

Mode Driver

HSA Runtime

HSA JIT

Apps Apps

Apps Apps

Apps Apps

User mode component Kernel mode component Components contributed by third parties

Page 13: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

HSA SOLUTION STACK

CPU(s) GPU(s) Other

Accelerators

HSA Finalizer

Legacy Drivers

Application

Domain Specific Libs (Bolt, OpenCV™, … many others)

HSA Runtime

Application SW

Drivers

Differentiated HW

DirectX Runtime

Other Runtime

HSAIL

GPU ISA

OpenCL™ Runtime

HSA Software

Knl Driver

Ctl

Page 14: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

OPENCL™ AND HSA

HSA is an optimized platform architecture for

OpenCL™

Not an alternative to OpenCL™

OpenCL™ on HSA will benefit from

Avoidance of wasteful copies

Low latency dispatch

Improved memory model

Pointers shared between CPU and GPU

HSA also exposes a lower level programming interface, for

those that want the ultimate in control and performance

Can be exposed through OpenCL extensions

© Copyright 2012 HSA Foundation. All Rights Reserved. 14

Page 15: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

MAKE GPUS EASIER TO PROGRAM:

PRIMARY PROGRAMMING MODELS

• Microsoft C++AMP

• Address large population of Windows developers

• Integrated in Visual Studio and WinRT

• Microsoft Community Promise License for open platform use

• Java acceleration

• Aparapi on OpenCL today

• Project Sumatra to add HSA support in an OpenJDK for Java 8

• Driving to have Sumatra absorbed into Java 9

© Copyright 2012 HSA Foundation. All Rights Reserved. 15

Page 16: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

AMD’S OPEN SOURCE COMMITMENT TO HSA We will open source our linux execution and compilation stack

Jump start the ecosystem

Allow a single shared implementation where appropriate

Enable university research in all areas

© Copyright 2012 HSA Foundation. All Rights Reserved. 16

Component Name AMD

Specific

Rationale

HSA Bolt Library No Enable understanding and debug

HSAIL Code Generator No Enable research

LLVM Contributions No Industry and academic collaboration

HSA Assembler No Enable understanding and debug

HSA Runtime No Standardize on a single runtime

HSA Finalizer Yes Enable research and debug

HSA Kernel Driver Yes For inclusion in linux distros

Page 17: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

ACCELERATED WORKLOADS CLIENT AND SERVER EXAMPLES

Page 18: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

HAAR Face Detection CORNERSTONE TECHNOLOGY

FOR COMPUTERVISION

Page 19: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

LOOKING FOR FACES ONE PLACE AT A TIME

Quick HD Calculations

Search square = 21 x 21

Pixels = 1920 x 1080 = 2,073,600

Search squares = 1900 x 1060 = ~2 Million

Page 20: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

LOOKING FOR DIFFERENT SIZE FACES –

BY SCALING THE VIDEO FRAME

© Copyright 2012 HSA Foundation. All Rights Reserved. 20

More HD Calculations

70% scaling in H and V

Total Pixels = 4.07 Million

Search squares = 3.8 Million

Page 21: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

Feature l

Feature m

Feature p

Feature r

Feature q

HAAR CASCADE STAGES Feature k

Stage N

Stage N+1

Face still possible? Yes

No

REJECT FRAME

© Copyright 2012 HSA Foundation. All Rights Reserved. 21

Page 22: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

22 CASCADE STAGES,

EARLY OUT BETWEEN EACH

STAGE 22 STAGE 21 STAGE 2 STAGE 1

NO FACE

FACE CONFIRMED

Final HD Calculations

Search squares = 3.8 million

Average features per square = 124

Calculations per feature = 100

Calculations per frame = 47 GCalcs

Calculation Rate

30 frames/sec = 1.4TCalcs/second

60 frames/sec = 2.8TCalcs/second

…and this only gets front-facing faces

© Copyright 2012 HSA Foundation. All Rights Reserved. 22

Page 23: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

CASCADE DEPTH ANALYSIS

0

5

10

15

20

25

Cascade Depth

20-25

15-20

10-15

5-10

0-5

© Copyright 2012 HSA Foundation. All Rights Reserved. 23

Page 24: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9-22

Tim

e (

ms)

“Trinity” A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)

GPU

CPU

PROCESSING TIME/STAGE

AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,

6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)

Cascade Stage

© Copyright 2012 HSA Foundation. All Rights Reserved. 24

Page 25: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8 22

Imag

es/S

ec

Number of Cascade Stages on GPU

“Trinity” A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)

CPU

HSA

GPU

PERFORMANCE CPU-VS-GPU

AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,

6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)

© Copyright 2012 HSA Foundation. All Rights Reserved. 25

Page 26: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

HAAR SOLUTION – RUN DIFFERENT

CASCADES ON GPU AND CPU

By seamlessly sharing data between CPU and GPU,

HSA allows the right processor to handle its appropriate

workload

+2.5x

-2.5x

INCREASED

PERFORMANCE DECREASED ENERGY

PER FRAME

© Copyright 2012 HSA Foundation. All Rights Reserved. 26

Page 27: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

ACCELERATING B+TREE

SEARCHES CLOUD SERVER WORKLOAD

Page 28: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

B+TREE SEARCHES

B+Trees are used by enterprise DB applications

SQL: SQLite, MySQL, Oracle, and others

No-SQL: CouchDB, Tokyo Cabinet, and others

Audio search, video copy detection

DB Size: Can be many times larger than GPU memory!

B+Trees are a fundamental data structure

Used to reduce memory & disk access to locate a key

Can support index- and range-based queries

Can be updated efficiently

© Copyright 2012 HSA Foundation. All Rights Reserved. 28

A simple B+Tree linking the keys 1-7. The

linked list (red) allows rapid in-order traversal.

Page 29: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

PARALLEL B+TREE SEARCHES ON HSA

© Copyright 2012 HSA Foundation. All Rights Reserved. 29

HSA increases performance versus Multi

Threaded CPU, even for tree structures that

reside in pinned host memory.

M. Daga, and M. Nutter, “Exploiting Coarse-Grained Parallelism in B+Tree Searches on an APU”, Accepted at ”Second Workshop on Irregular Applications: Algorithms and Architectures, (IA3)” November 2012.

Platform Size < 3 GB Size > 3 GB

dGPU (memory size = 3GB)

✓ ✗

HSA ✓ ✓

With HSA, DB can be larger than GPU

memory, and can be shared with CPU.

HSA lets us move compute to data

Parallel search can move to GPU

Sequential updates can remain on CPU

AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM

0

10

20

30

40

50

60

70

80

4 8 16 32 64 128

MQ

PS

“Order” of B+Tree Node

Millions of Queries Per Second (MPQ) by B+Tree “order”

CPU

APU

Page 30: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

ACCELERATING SUFFIX ARRAY

CONSTRUCTION CLOUD SERVER WORKLOAD

Page 31: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

SUFFIX ARRAYS

Suffix Arrays are used to accelerate in-memory cloud workloads

Full text index search

Lossless data compression

Bio-informatics

Suffix Arrays are a fundamental data structure

Designed for efficient searching of a large text

Quickly locate every occurrence of a substring S in a text T

Suffix Array Construction contains parallel & sequential steps

Sorting, ideal for GPU

Nested recursion, ideal for CPU

© Copyright 2012 HSA Foundation. All Rights Reserved. 31

Page 32: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

ACCELERATED SUFFIX ARRAY

CONSTRUCTION ON HSA

By offloading data parallel computations to

GPU, HSA increases performance and

reduces energy for Suffix Array Construction

versus single threaded execution.

+5.8x

-5x

INCREASED

PERFORMANCE DECREASED

ENERGY

© Copyright 2012 HSA Foundation. All Rights Reserved. 32

M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.

By efficiently sharing data between CPU and

GPU, HSA lets us move compute to data

without penalty of intermediate copies.

Merge Sort::GPU

Radix Sort::GPU

Compute SA::CPU

Lexical Rank::CPU

Radix Sort::GPU

Skew Algorithm for Compute SA

AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM

Page 33: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

VASCULAR IMAGE

ENHANCEMENT EMBEDDED MEDICAL WORKLOAD

Page 34: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

VASCULATURE IMAGE ENHANCEMENT

© Copyright 2012 HSA Foundation. All Rights Reserved. 34

Input Image Output Image

Automatic Image Enhancement

Reduces expertise required to analyze the image

Reduces time to diagnosis

CPU::Analysis

GPU::Vascular Network Compute

Intermediate copied to CPU

GPU::Eigen Decomposition

Intermediate copied to CPU

GPU::Hessian Matrix Compute

Intermediate copied to CPU

GPU::Gaussian Convolution

Vasculature Image Enhancement:

Page 35: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

IMPROVED PERFORMANCE AND ENERGY

HSA increases performance and reduces energy versus single

threaded execution by offloading data parallel compute to GPU

+14x

-14x

INCREASED

PERFORMANCE DECREASED

ENERGY

© Copyright 2012 HSA Foundation. All Rights Reserved. 35

AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,

6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)

Page 36: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

EASE OF PROGRAMMING CODE COMPLEXITY VS. PERFORMANCE

Page 37: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

0

50

100

150

200

250

300

350

LO

C

LINES-OF-CODE AND PERFORMANCE FOR

DIFFERENT PROGRAMMING MODELS

Copy-back Algorithm Launch Copy Compile Init Performance

Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt

Perfo

rman

ce

35.00

30.00

25.00

20.00

15.00

10.00

5.00

0 Copy-back

Algorithm

Launch

Copy

Compile

Init.

Copy-back

Algorithm

Launch

Copy

Compile

Copy-back

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

(Exemplary ISV “Hessian” Kernel)

AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.

Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

© Copyright 2012 HSA Foundation. All Rights Reserved. 37

Page 38: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

THE HSA FUTURE

Highly productive programmers

+ Scalable performance

+ Power efficiency

= AMAZING USER EXPERIENCES

© Copyright 2012 HSA Foundation. All Rights Reserved. 38

Page 39: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

THE HSA FOUNDATION

Promoters

Founders

Contributors

Supporters

Associates

Page 40: ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

THANK YOU!

© Copyright 2012 HSA Foundation. All Rights Reserved. 40

www.hsafoundation.com