"understanding adaptive machine learning vision algorithms and implementing them on gpus and...

Copyright © 2015 Advanced Micro Devices 1

Harris Gasparakis, Ph.D.

12 May 2015

Understanding Adaptive Machine Learning Vision

Algorithms and Implementing them on GPUs and

Heterogeneous Platforms


• Machine Learning (ML)

• Constrained optimization problems

• Heterogeneous computing

• OpenCL2.0, HSA

• Synthesis

• OpenCL programming tips for ML

• Conclusions

Agenda


Can you find an algorithm to describe an object, and detect it?

Why Machine Learning (ML)?



Sometimes Not Needed…



Most Often Indispensable!

Vidit Jain and Erik Learned-Miller.

FDDB: A Benchmark for Face Detection in Unconstrained Settings.

http://vis-www.cs.umass.edu/fddb/fddb.pdf


• Learn from examples!

• Model the universe using functions with (possibly many) parameters “w”

that you learn from training data

• “x” is a (multi-dimensional) function of the image data

• Pixel patches

• A priori Features

• Features in a learned dictionary (basis)

• PCA

• Sparse coding/LASSO/LARS

• DNN

• “y” is our value judgment on the data

• Object category

• Object identity, etc.

Formalism


Tune parameters “w” to best explain the “N” observations (𝑦𝑛, 𝑥𝑛)

Machine learning typically involves constrained functional minimization

• Bias/variance

• Overcompleteness/sparsity

• How much learning is too much?

• N = ? |w| = ?

• Graphical models/subspace updates

Formalism

𝐸 𝑤 = 𝐷 𝑦𝑛, 𝑥𝑛; 𝑤 + λ𝐶 𝑤 +⋯

𝑁

𝑛=1


It is a Jungle of Minima!

Start with initial guess:

𝑤0

Iteratively improve it:

𝑤𝑡 = 𝑤𝑡−1 + 𝛿𝑤𝑡

Local minima, with

Basins of attraction


• Second order methods:

𝛿𝑤𝑡 = −𝐻 𝑤𝑡−1 𝑔(𝑤𝑡−1)

• First order methods:

𝛿𝑤𝑡 = −κ 𝑔(𝑤𝑡−1)

• Tweaks:

Line minimization, momentum, heat, homotopy, multiresolution

• Modern first order methods (AdaGrad, AdaDelta, etc):

𝛿𝑤𝑡 = −𝐻 (𝑔1:𝑡)𝑔(𝑤𝑡−1)

History


• Start from a pool of multiple initial conditions, and multiple update

rules (“configurations”)

• Explore them simultaneously (GPU thread)

• On each update step, reason about the progress of each (CPU threads)

• Eliminate configurations:

• Dead ends

• in the same basin of attraction

• Replace them with other random configurations

• Give preference to configurations that progress the most

What if?


Let’s Explore the Jungle!


Some Dead Ends…


Reinitialize them!


Continue Exploring…


One Visitor Per Attractor is Enough...


• CPU as adaptive GPU supervisor

• GPU computes an ensemble of updates

• CPU reasons about the ensemble of updates

• Coalesce if in the same basin of attraction

• Prune or “kick” if trapped in local minimum

• Test and rank according to generalization error

• Is it practical?

The Master Adaptive Strategy


Know Thy (HSA) Hardware!

CPU HSA iGPU

Physical Memory

Unified (Bidirectionally Coherent, Pageable) Virtual Memory

L2 L2

CC

L1

CC

L1

CC

L1

CC

L1/LDS

CC

L1/LDS

CC

L1/LDS

CC

L1/LDS

Scheduler Scheduler

hUMA

Heterogeneous System Architecture (exposed via OpenCL 2.0)


CPU HSA iGPU

Physical Memory

Unified (Bidirectionally Coherent, Pageable) Virtual Memory

L2 L2

CC

L1

CC

L1

CC

L1

CC

L1

CC

L1

CC

L1

CC

L1

hQ

Scheduler Scheduler

Dynamic parallelism,

Context switching,

Preemption,

Concurrent execution

Know Thy (HSA) Hardware!


• Scope

• Thread, workgroup, device, all HSA devices

• Semantics

• Acquire (require that memory writes of other threads within the

scope become visible in current thread)

• Release (writes of current thread become visible to other threads in

current scope)

C++11 Atomics/Opencl 2.0 Atomics


• Initialize pool of 𝑤𝑡 as fine grain SMV with atomics enabled:

clSVMAlloc (…, CL_MEM_READ_WRITE |

CL_MEM_SVM_FINE_GRAIN_BUFFER |

CL_MEM_SVM_ATOMICS,…);

• CPU waits for GPU to finish an iteration:

done = std::atomic_load_explicit (..,

std::memory_order_acquire );

• GPU kernel “signals” when done with an iteration:

atomic_store_explicit ( (global atomic_int *)(…), …

memory_order_release,

memory_scope_all_svm_devices );

C++11 Atomics/OpenCL 2.0 Atomics


• The optimal partitioning of problem to threads may be non-obvious

• Depends a lot on cache line size

• Do not incur memory latency multiple times, align threads with

cache lines.

OpenCL Tips


• X0,0, X0,1, … X0,15 ,…, X0,127

• X1,0, X1,1, … , X0,15 ,…, X1,127

• X2,0, X2,1, … , X0,15 ,…, X2,127

• XN-1,0, XN-1,1, … , XN,15 ,…, XN,127

K=2 Means, N=10000, in F=128 dims

• M0,0, M0,2, … , M0,15 ,…, M0,127

• M1,0, X1,2, … , X0,15 ,…, M1,127


• The optimal partitioning of problem to threads may be non-obvious

• Depends a lot on cache line size

• Depends a lot on L2 size (and for virtual memory, on page size)

• Don’t jump around virtual pages

• Ensure you stay within L2

Know Thy hardware!


Device/Main memory

Device/Main memory Input

Kernel 1

Kernel 2

Device/Main memory Output

Programmer’s View

Virtual memory


Input

Kernel 1

Kernel 2

Output

L2

L2 Device/Main memory

Device/Main memory

L2

Device/Main

memory

Ideal Physical View


Device/Main

memory L2

L2 Device/Main memory Input

Kernel 1

Kernel 2

Device/Main memory Output

L2

L2

Be Mindful of your L2


• Consumer/producer paradigm…

• GPU: number crunching producer

• CPU: supervises GPU to global convergence

• mediated via C++11 platform atomics

• Very easy to transition to OpenCL right NOW!

• Replace all malloc code with:

clSVMAlloc and clEnqueueSVMMap (if needed)

• That’s it! No need to change any CPU code, and you can start

writing kernels!

Conclusions


• Ready for prime time in real time!

• Detection

• Recognition

• Tracking

• Real-time learning

Conclusions


The information presented in this document is for informational purposes only and may contain technical inaccuracies,

omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not

limited to product and roadmap changes, component and motherboard version changes, new model and/or product

releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the

like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right

to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify

any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO

RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN

NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES

ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY

OF SUCH DAMAGES.

ATTRIBUTION

© 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks

of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes

only and may be trademarks of their respective owners. OpenCL is a trademark of Apple Inc. used by permission by

Khronos.

Disclaimer & Attribution

"understanding adaptive machine learning vision algorithms and implementing them on gpus and...

Technology

advanced micro devices

jungle of minima

random configurations

order methods adagrad

adaptive gpu supervisor

gpu thread

cpu threads

dead ends