hardware-accelerated computing and rendering with nodefiles.meetup.com › 2048391 ›...

MOTOROLA and the Stylized M Logo are trademarks or registered trademarks of Motorola Trademark Holdings, LLC. All other trademarks are the property of their respective owners. © 2010 Motorola Mobility, Inc. All rights reserved.

Hardware-accelerated computing and rendering with Node.JS (node-webcl, node-webgl, node-glfw, node-image)

Mikaël Bourges-Sévenier, Motorola Mobility September 6, 2012

ORGANIZATION DETAIL 2 Page

Content

§  Mo#va#on §  General-‐Purpose compu#ng on GPU §  Why Node.JS? §  Notes on mul#-‐threading

§  Overview of my Node.JS modules §  Architecture §  Installa#on §  Examples / Demos

§  Understanding WebCL / OpenCL §  OpenCL features §  OpenCL model §  WebCL API §  “Hello World” code walkthrough

§  Perspec#ves

2012-09-06 © 2012 Motorola Mobility, Inc.


Motivation: General-Purpose Computing on GPU

§  More and more data to process §  Signal & Image processing §  Data mining, paRern matching, sta#s#cs… §  Machine intelligence §  Financial analysis §  Physics engines, ray-‐tracing…

§  CPU tend to have up to 16 cores §  General purpose §  Launch a thread per hardware execu#on context

§  GPUs have 100s of cores §  Persistent threads §  Launch a workgroup per hardware execu#on context §  Designed for data-‐parallel computa#ons §  Originally developed for 3D vector graphics §  More transistors devoted to processing than caching

and control 2012-09-06 © 2012 Motorola Mobility, Inc.

ControlALU ALU

ALU ALU

Cache

DRAM

CPU

DRAM

GPUDavid Luebke, The democratization of Parallel Computing, SC07


Motivation: CPU-GPU Systems-on-a-Chip (SOCs)


AMD Trinity

Intel Ivy Bridge

Nvidia Kepler


Motivation: Why Node.JS?

§  V8 JavaScript engine, as on Chrome browsers, cross-‐pla[orms

§  Fast prototyping §  For app development §  Great to test features before adding them to browsers

§  Modular, Easily extensible §  Tons of great modules §  Great for developing/tes#ng/maintaining modular apps §  Great for developing new features

§  For Server-‐side applica#ons (no GUI or web browser as GUI) §  With lots of data to process (great for OpenCL)

§  For client-‐side development §  Same JS code running on Node.JS and browsers §  Faster than browsers (less layers)



Notes on multi-threading

§  JavaScript has no threading support: it is an event-‐based language §  Node.JS is an implementa#on of the Reactor paRern

§  Opera#ons run in worker threads

§  For (large) data intensive tasks that can be parallelized, a GPU offers more processing power than a CPU


Event Loop(single thread)

Register callback

Responses via callback

Responses sent to clients Long operations are deferred

to worker threads

Threads are handled by node.js internally.


OVERVIEW OF MY NODE.JS MODULES



My Node.JS modules


FreeImageGLFWAntTweakBar

OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency


My Node.JS modules



OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

node-webgl

- started from Blue Lava demo for WebOS http://minimason.no.de/- Emulates WebGL 1.x API- Uses desktop OpenGL- Requires node-glfw- Implements DOM document- Implements DOM mouse & key events- Implements HTML <Image>


My Node.JS modules



OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

node-glfw

- wrapper around GLFW- cross-platform library for opening a window, creating an OpenGL context, and managing input- Relies on GLEW to get the right OpenGL extensions- Over time, added AntTweakBar to get a simple menu system


My Node.JS modules



OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

node-image

- wraps FreeImage- optimized native buffers for node-webcl and node-webgl


My Node.JS modules



OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

node-webcl

- implements WebCL API- implements WebCL - WebGL interop.- follows WebCL spec on a weekly basis


Installation §  Requires Node.JS >= 0.7.x (due to TypedArrays) §  Mac OSX 10.7, Microso/ Windows 7, Ubuntu 10.10+

§  Make sure node-‐gyp is installed

§  Get latest OpenCL 1.1+ / OpenGL 2+ drivers for your GPU §  Intel OpenCL SDK, AMD AMP SDK, Nvidia CUDA 4.x SDK

§  Install latest na#ve libraries in your library and include paths §  GLFW hRp://www.glfw.org/ §  GLEW hRp://glew.sourceforge.net/ §  FreeImage hRp://freeimage.sourceforge.net/ §  AntTweakBar hRp://www.an#sphere.com/Wiki/tools:anRweakbar

§  Install node-‐webcl or node-‐webgl §  npm install node-‐webcl will also install node-‐webgl §  npm install node-‐webgl will also install node-‐glfw

§  For examples, also install: npm install opDmist 2012-09-06 © 2012 Motorola Mobility, Inc.


Usage


for node-webgl

WebGL = require('node-webgl');Image = WebGL.Image;document = WebGL.document();window = document;canvas = document.createElement("my_canvas");gl = canvas.getContext("experimental-webgl");

for node-webcl

WebCL = require('node-webcl');


Test your installation (node-webgl) §  GL = Graphics Language

§  cd node-‐webgl

§  node examples/lightgl/shadowmap.js

§  node test/cube.js


examples/lightgl/shadowmap.js

test/lesson08.js (node-image)

test/cube.js (AntTweakBar)


Test your installation (node-webcl) §  CL = Compute Language

§  cd node-‐webcl

§  node examples/DeviceQuery.js

§  node examples/VectorAdd.js


test/image.js (PBO)

examples/sine.js (FBO) examples/apple/qjulia/qjulia.js (PBO, AntTweakBar)


UNDERSTANDING WEBCL & OPENCL



What is WebCL?

§  WebCL brings parallel compu#ng to the Web through a secure JavaScript binding to OpenCL 1.1 (2011) §  Open standard, royalty-‐free §  Pla[orm independent §  Device independent §  being standardized by Khronos

§  First public working dran April 2012 §  hRp://www.khronos.org/webcl/



OpenCL overview

§  OpenCL framework has 2 parts 1.  Host API: C-‐based, cross-‐pla[orm, object-‐oriented

•  Commands to control send/receive data and control execu#on on devices

2.  Kernels •  Run on devices •  use a subset of C99 and extensions •  Vector extensions (<type>N) •  No recursion, no func#on pointers •  No dynamic memory (malloc, free…), no standard libc methods (memcpy…) •  Kernels are akin to shaders in WebGL

§  Features §  Well-‐defined numerical accuracy both for integers and floats §  Rich-‐set of built-‐in func#ons (e.g. as GLSL and more)

•  But no random method §  Close to the hardware

•  Allow control over memory use •  Allow control over thread scheduling



§  A host is connected to one or more Compute devices

§  Compute device §  A collec#on of one or more compute

units (~ cores) §  A compute unit is composed of

one or more processing elements (~ threads)

§  Processing elements execute code as SIMD or SPMD

Host(PC)

......

...

......

...

......

...

......

...

Compute Devices (GPU, CPU, DSP, FPGA…)

OpenCL Device Model

Processing Element (Thread)

Compute Device (GPU, CPU, …)

......

...

Compute Unit (Core)


examples/DeviceQuery.js §  Queries parameters of all OpenCL devices aRached to your computer

§  Example: on a MacBook Pro early 2011, OSX 10.8


Found 2 devices --------------------------------- Device: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz --------------------------------- DEVICE_NAME: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz DEVICE_VENDOR: Intel DRIVER_VERSION: 1.1 DEVICE_VERSION: OpenCL 1.2 DEVICE_PROFILE: FULL_PROFILE DEVICE_OPENCL_C_VERSION: OpenCL C 1.2 DEVICE_TYPE: cpu DEVICE_MAX_COMPUTE_UNITS: 8 DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1 DEVICE_MAX_WORK_GROUP_SIZE: 1024 DEVICE_MAX_CLOCK_FREQUENCY: 2200 MHz

--------------------------------- Device: ATI Radeon HD 6750M --------------------------------- ATI Radeon HD 6750M AMD 1.0 OpenCL 1.1 FULL_PROFILE OpenCL C 1.1 gpu 6 3 1024 / 1024 / 1024 1024 600 MHz


DeviceQuery §  Queries parameters of all OpenCL devices aRached to your computer

§  Example: on a MacBook Pro early 2011, OSX 10.8


Found 2 devices --------------------------------- Device: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz --------------------------------- DEVICE_NAME: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz DEVICE_VENDOR: Intel DRIVER_VERSION: 1.1 DEVICE_VERSION: OpenCL 1.2 DEVICE_PROFILE: FULL_PROFILE DEVICE_OPENCL_C_VERSION: OpenCL C 1.2 DEVICE_TYPE: cpu DEVICE_MAX_COMPUTE_UNITS: 8 DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1 DEVICE_MAX_WORK_GROUP_SIZE: 1024 DEVICE_MAX_CLOCK_FREQUENCY: 2200 MHz

--------------------------------- Device: ATI Radeon HD 6750M --------------------------------- ATI Radeon HD 6750M AMD 1.0 OpenCL 1.1 FULL_PROFILE OpenCL C 1.1 gpu 6 3 1024 / 1024 / 1024 1024 600 MHz

-  4 cores, hyperthreaded => 8 compute units

-  Up to 1024 threads in 1D, at 2.2 GHz

-  6 compute units -  Up to 1024 threads

but 3 dims, at 600 MHz


OpenCL Execution Model

§  Kernel §  Basic unit of executable code (~ DLL entry point) §  Data-‐parallel or task-‐parallel

§  Program §  Collec#on of kernels and func#ons called by kernels §  Analogous to a dynamic library (DLL)

§  Command Queue §  Control opera#ons on OpenCL objects (memory transfers, kernels execu#on, synchroniza#on) §  Commands queued in order §  Execu#on in-‐order or out-‐of-‐order §  Applica#ons may use mul#ple command-‐queues per device

§  Work-‐item §  An execu#on of a kernel by a processing element (~ thread)

§  Work-‐group §  A collec#on of work-‐items that execute on a single compute unit (~ core)

Queue Queue Context

GPU

CPU


OpenCL Work-group 2D analogy

# work-items = # pixels # work-groups = # tiles Work-group size = tileW * tileH All threads in a workgroup run synchronously

Local

Global


OpenCL Kernel § Defined on a N-‐dimensional computa#on domain

§ A kernel is executed at each point of the computa#on domain

// In JavaScript function multiple(a,b,n) { var c = []; for(var i=0; i<n; ++i) c[i] = a[i] * b[i]; return c; }

// In OpenCL C99 /** * @param a, b, c are buffers in global memory * @param n number of elements in a, b, and c */ __kernel void multiply(__global const float *a, __global const float *b, __global float *c, unsigned int n) { unsigned int tid = get_global_id(0); // thread number if(tid >= n) return; // make sure we don't pass buffer area c[tid] = a[tid] * b[tid]; }


OpenCL Memory Model §  On Host

§  CPU RAM

§  On Compute Device §  Global memory = GPU RAM §  Constant memory = cached global memory §  Texture memory = cached global memory

op#mized for streaming reads §  Local memory = high-‐speed memory shared

among work-‐items of a work-‐group (~ L1 cache)

§  Private memory = registers of a work-‐item, very fast memory

§  Memory management is explicit §  App must move data host ➞ global ➞ local

and back

Private Memory Private Memory

Work-Item 1 Work-Item M

Workgroup 1

Private Memory Private Memory

Work-Item 1 Work-Item M

Workgroup N

Local Memory Local Memory

Global Memory / Constant and Texture Caches

Compute Device

Host Memory

Host

Command queuesand

API calls


WebCL API

WebCLMemoryObject{abstract}

WebCLImageWebCLBuffer

WebCLContext

WebCLKernel

WebCLProgram CommandQueue Event

WebCLDevice

Sampler

*

WebCL

WebCLPlatform WebCLExtension

* * * *

Platform layer

Compiler layer Runtime layer

Same OO model as OpenCL with JS classes WebCL is global object


“HELLO WORLD” CODE WALKTHROUGH



WebCL sequence (host side)

§ Create context § Compile kernels

§ Setup command-‐queues

§ Setup kernels arguments

§ Execute commands

§ Read results

Select Platform

Select Device

Create Context

Load and compile kernels on devices

Create command queues for each device

Send data to devices using their command

queues

Send commands to devices using their command queues

Get data from devices using their command

queues

Release resources

Create buffers to store data on devices

Update kernels arguments

Platform layerCompilerRuntime layer


Select Platform

Select Device

Create Context




queues



queues

Release resources



WebCL sequence (host side) [1/6] // create the OpenCL context try { clContext = WebCL.createContext({ deviceType: WebCL.DEVICE_TYPE_GPU }); } catch(err) { throw "Error: Failed to create context! "+err; } var devices = clContext.getInfo(WebCL.CONTEXT_DEVICES); if (!devices) { throw "Error: Failed to retrieve compute devices for context!"; }


Select Platform

Select Device

Create Context




queues



queues

Release resources



// Create the compute program from the source buffer (text) clProgram = clContext.createProgram(getScource("multiply_script")); // Build the program executable try { clProgram.build(clDevice, '-cl-fast-relaxed-math -DDEBUG=1'); } catch (err) { throw "Error: Failed to build program executable!\n" + clProgram.getBuildInfo(clDevice, WebCL.PROGRAM_BUILD_LOG); }

clKernel = clProgram.createKernel("multiply");

WebCL sequence (host side) [2/6] <script id="multiply_script" type="x-webcl"> __kernel void multiply(__global const float *a, __global const float *b, __global float *c, unsigned int n) { unsigned int tid = get_global_id(0); // thread number if(tid >= n) return; // make sure we don't pass buffer area c[tid] = a[tid] * b[tid]; } </script>


Select Platform

Select Device

Create Context




queues



queues

Release resources



WebCL sequence (host side) [3/6] BUFFER_SIZE=10; var A=new Uint32Array(BUFFER_SIZE); var B=new Uint32Array(BUFFER_SIZE);

// store data in A and B …

var size=BUFFER_SIZE*Uint32Array.BYTES_PER_ELEMENT; // size in bytes // Create buffer for A and B and copy host contents var aBuffer = clContext.createBuffer(WebCL.MEM_READ_ONLY, size); var bBuffer = clContext.createBuffer(WebCL.MEM_READ_ONLY, size);

// Create buffer for C to read results var cBuffer = clContext.createBuffer(WebCL.MEM_WRITE_ONLY, size);


// Set kernel args clKernel.setArg(0, aBuffer); clKernel.setArg(1, bBuffer); clKernel.setArg(2, cBuffer); clKernel.setArg(3, BUFFER_SIZE, WebCL.type.UINT);

Select Platform

Select Device

Create Context




queues



queues

Release resources



WebCL sequence (host side) [4/6] // Create command queue clQueue=context.createCommandQueue(devices[0]); // enqueue buffers clQueue.enqueueWriteBuffer (aBuffer, false, 0, size, A); clQueue.enqueueWriteBuffer (bBuffer, false, 0, size, B);

__kernel void multiply(__global const float *a, __global const float *b, __global float *c, unsigned int n);


Select Platform

Select Device

Create Context




queues



queues

Release resources



WebCL sequence (host side) [5/6]

// Execute (enqueue) kernel clQueue.enqueueNDRangeKernel(clKernel, null, // global work offset [BUFFER_SIZE], // global work size [2]); // local work size

Note: Use local work size = [] or null (default) to let driver chose the best values.


Select Platform

Select Device

Create Context




queues



queues

Release resources



WebCL sequence (host side) [6/6]

// get results and block while getting them var C=new Uint32Array(BUFFER_SIZE); clQueue.enqueueReadBuffer (cBuffer, true, // blocking call 0, size, C);


Example: Matrix multiplication

§  “Hello World of CL”

§  C = A x B

§  N x N matrices

A B

C


Example: Matrix multiplication

§  Op#miza#on §  N x N matrices §  C divided into m x m #les §  With

•  m = N / P •  P = # threads per workgroup (16)

A B

C


Example: Comparison with sequential §  MacBook Pro (early 2011), OSX 10.8

§  CPU: Intel Core i7, 2.2GHz, 4 cores §  GPU: AMD Radeon HD 6750M, 1 GB, 480 SPU, 600 MHz, 576 GFLOPS

0

50

100

150

200

250

128 256 512 1024 2048

Spee

dup

fact

or

OpenMP

CL (CPU)

CL (GPU)

CL (GPU opt)


WEBCL – WEBGL INTEROP.



WebCL / WebGL interop

§  WebCL context created from WebGL context

§  Configure shared CL objects from GL counterparts

§  Sync GL and CL §  Flush GL, acquire GL object §  Execute CL §  Release CL object, flush CL

§  Vertex arrays, textures, render-‐buffers can be shared with CL

Initialize WebGL

Initialize WebCL

Configure shared CL-GL data

Set kernels args

Enqueue commands

Execute kernels

Update Scene

Initialization

Rendering loop (per frame)

Render scene


WebCL / WebGL interop Initialize WebGL

Initialize WebCL


Set kernels args

Enqueue commands

Execute kernels

Update Scene

Render scene

// Create WebGL context var gl = canvas.getContext("experimental-webgl"); // Init GL …

// create the OpenCL context try { clContext = WebCL.createContext({ deviceType: WebCL.DEVICE_TYPE_GPU, shareGroup: gl }); } catch(err) { throw "Error: Failed to create context! "+err; }


WebCL / WebGL interop (texture) // Create OpenGL texture object gl.activeTexture(gl.TEXTURE0); glTexture = gl.createTexture(); gl.bindTexture(gl.TEXTURE_2D, glTexture); gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST); gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST); gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA, TextureWidth, TextureHeight, 0, gl.RGBA, gl.UNSIGNED_BYTE, null); gl.bindTexture(gl.TEXTURE_2D, null);

Initialize WebGL

Initialize WebCL


Set kernels args

Enqueue commands

Execute kernels

Update Scene

Render scene

// Create the compute program from the source buffer (text) clProgram = clContext.createProgram(getScource("multiply_script")); // Build the program executable try { clProgram.build(clDevice, '-cl-fast-relaxed-math -DDEBUG=1'); } catch (err) { throw "Error: Failed to build program executable!\n" + clProgram.getBuildInfo(clDevice, WebCL.PROGRAM_BUILD_LOG); }

clKernel = clProgram.createKernel("multiply");


Demo: GL Texture update with CL

§  Based on Evgeny Demidov 2D ink droplet WebGL ~26 fps WebCL ~124 fps


WebCL / WebGL interop (vbo) Initialize WebGL

Initialize WebCL


Set kernels args

Enqueue commands

Execute kernels

Update Scene

Render scene

// set kernel args values clKernel.setArg(0, clVBO); clKernel.setArg(1, mesh_width, WebCL.type.UINT); clKernel.setArg(2, mesh_height, WebCL.type.UINT);

// create buffer object glVBO = gl.createBuffer(); gl.bindBuffer(gl.ARRAY_BUFFER, glVBO);

// initialize buffer object var sizeInBytes = mesh_width * mesh_height * 4 * FloatArray.BYTES_PER_ELEMENT; gl.bufferData(gl.ARRAY_BUFFER, sizeInBytes, gl.DYNAMIC_DRAW);

// create OpenCL buffer from GL VBO clVBO = clContext.createFromGLBuffer(WebCL.MEM_WRITE_ONLY, glVBO);


Demo: VBO update with CL


WebCL/WebGL interop (host side)

// Sync GL and acquire buffer from GL gl.flush(); clQueue.enqueueAcquireGLObjects(clTexture);

// Set global and local work sizes for kernel var local = null; var global = [ TextureWidth, TextureHeight ];

try { clQueue.enqueueNDRangeKernel(clKernel, null, global, local); } catch (err) { throw "Failed to enqueue kernel! " + err; }

// Release GL texture clQueue.enqueueReleaseGLObjects(clTexture); clQueue.flush();

Initialize WebGL

Initialize WebCL


Set kernels args

Enqueue commands

Execute kernels

Update Scene

Render scene


Perspectives

§  WebCL and Node.JS are a match in heaven §  Node.JS can process lots of events §  WebCL can process lots of data using many devices

§  WebCL enables GPGPU applica#ons in Web browsers §  Careful usage of architecture can lead to impressive speedup §  With WebGL interoperability, rich graphics Web applica#ons are now possible

§  DRAFT WebCL specifica#on §  Quite stable JavaScript host API §  Focusing on more security and robustness


WebCL Open process and Resources

§  Khronos open process to engage Web community §  Public specifica#on drans, mailing lists, forums §  hRp://www.khronos.org/webcl/ §  [email protected]

§  Nokia open source prototype for Firefox in May 2011 (LGPL) §  hRp://webcl.nokiaresearch.com

§  Samsung open source prototype for WebKit in July 2011 (BSD) §  hRp://code.google.com/p/webcl/

§  Motorola open source prototype for NodeJS in March 2012 (BSD) §  hRps://github.com/Motorola-‐Mobility/node-‐webcl §  All demos in this talk were made with node-‐webcl / node-‐webgl


Start learning Now! §  OpenCL Programming Guide -‐ The “Red Book” of OpenCL

§  hRp://www.amazon.com/OpenCL-‐Programming-‐Guide-‐Aanab-‐Munshi/dp/0321749642

§  OpenCL in Ac#on §  hRp://www.amazon.com/OpenCL-‐Ac#on-‐Accelerate-‐Graphics-‐Computa#ons/dp/1617290173/

§  Heterogeneous Compu#ng with OpenCL §  hRp://www.amazon.com/Heterogeneous-‐Compu#ng-‐with-‐OpenCL-‐ebook/dp/B005JRHYUS

§  The OpenCL Programming Book §  hRp://www.fixstars.com/en/opencl/book/


Thank you!

hardware-accelerated computing and rendering with nodefiles.meetup.com › 2048391 ›...

Documents