hardware-accelerated computing and rendering with nodefiles.meetup.com › 2048391 ›...

50
MOTOROLA and the Stylized M Logo are trademarks or registered trademarks of Motorola Trademark Holdings, LLC. All other trademarks are the property of their respective owners. © 2010 Motorola Mobility, Inc. All rights reserved. Hardware-accelerated computing and rendering with Node.JS (node-webcl, node-webgl, node-glfw, node-image) Mikaël Bourges-Sévenier, Motorola Mobility September 6, 2012

Upload: others

Post on 26-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

MOTOROLA and the Stylized M Logo are trademarks or registered trademarks of Motorola Trademark Holdings, LLC. All other trademarks are the property of their respective owners. © 2010 Motorola Mobility, Inc. All rights reserved.

Hardware-accelerated computing and rendering with Node.JS (node-webcl, node-webgl, node-glfw, node-image)

Mikaël Bourges-Sévenier, Motorola Mobility September 6, 2012

Page 2: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 2 Page

Content

§  Mo#va#on  §  General-­‐Purpose  compu#ng  on  GPU  §  Why  Node.JS?  §  Notes  on  mul#-­‐threading  

§  Overview  of  my  Node.JS  modules  §  Architecture  §  Installa#on  §  Examples  /  Demos  

§  Understanding  WebCL  /  OpenCL  §  OpenCL  features  §  OpenCL  model  §  WebCL  API  §  “Hello  World”  code  walkthrough  

§  Perspec#ves  

2012-09-06 © 2012 Motorola Mobility, Inc.

Page 3: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 3 Page

Motivation: General-Purpose Computing on GPU

§  More  and  more  data  to  process  §  Signal  &  Image  processing  §  Data  mining,  paRern  matching,  sta#s#cs…  §  Machine  intelligence  §  Financial  analysis  §  Physics  engines,  ray-­‐tracing…  

§  CPU  tend  to  have  up  to  16  cores  §  General  purpose  §  Launch  a  thread  per  hardware  execu#on  context  

§  GPUs  have  100s  of  cores  §  Persistent  threads  §  Launch  a  workgroup  per  hardware  execu#on  context  §  Designed  for  data-­‐parallel  computa#ons  §  Originally  developed  for  3D  vector  graphics  §  More  transistors  devoted  to  processing  than  caching  

and  control  2012-09-06 © 2012 Motorola Mobility, Inc.

ControlALU ALU

ALU ALU

Cache

DRAM

CPU

DRAM

GPUDavid Luebke, The democratization of Parallel Computing, SC07

Page 4: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 4 Page

Motivation: CPU-GPU Systems-on-a-Chip (SOCs)

2012-09-06 © 2012 Motorola Mobility, Inc.

AMD Trinity

Intel Ivy Bridge

Nvidia Kepler

Page 5: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 5 Page

Motivation: Why Node.JS?

§  V8  JavaScript  engine,  as  on  Chrome  browsers,  cross-­‐pla[orms  

§  Fast  prototyping  §  For  app  development  §  Great  to  test  features  before  adding  them  to  browsers  

§  Modular,  Easily  extensible  §  Tons  of  great  modules  §  Great  for  developing/tes#ng/maintaining  modular  apps  §  Great  for  developing  new  features  

§  For  Server-­‐side  applica#ons  (no  GUI  or  web  browser  as  GUI)  §  With  lots  of  data  to  process  (great  for  OpenCL)  

§  For  client-­‐side  development  §  Same  JS  code  running  on  Node.JS  and  browsers  §  Faster  than  browsers  (less  layers)  

2012-09-06 © 2012 Motorola Mobility, Inc.

Page 6: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 6 Page

Notes on multi-threading

§  JavaScript  has  no  threading  support:  it  is  an  event-­‐based  language  §  Node.JS  is  an  implementa#on  of  the  Reactor  paRern  

§  Opera#ons  run  in  worker  threads  

§  For  (large)  data  intensive  tasks  that  can  be  parallelized,  a  GPU  offers  more  processing  power  than  a  CPU  

2012-09-06 © 2012 Motorola Mobility, Inc.

Event Loop(single thread)

Register callback

Responses via callback

Responses sent to clients Long operations are deferred

to worker threads

Threads are handled by node.js internally.

Page 7: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 7 Page

OVERVIEW OF MY NODE.JS MODULES

2012-09-06 © 2012 Motorola Mobility, Inc.

Page 8: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 8 Page

My Node.JS modules

2012-09-06 © 2012 Motorola Mobility, Inc.

FreeImageGLFWAntTweakBar

OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

Page 9: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 9 Page

My Node.JS modules

2012-09-06 © 2012 Motorola Mobility, Inc.

FreeImageGLFWAntTweakBar

OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

node-webgl

- started from Blue Lava demo for WebOS http://minimason.no.de/- Emulates WebGL 1.x API- Uses desktop OpenGL- Requires node-glfw- Implements DOM document- Implements DOM mouse & key events- Implements HTML <Image>

Page 10: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 10 Page

My Node.JS modules

2012-09-06 © 2012 Motorola Mobility, Inc.

FreeImageGLFWAntTweakBar

OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

node-glfw

- wrapper around GLFW- cross-platform library for opening a window, creating an OpenGL context, and managing input- Relies on GLEW to get the right OpenGL extensions- Over time, added AntTweakBar to get a simple menu system

Page 11: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 11 Page

My Node.JS modules

2012-09-06 © 2012 Motorola Mobility, Inc.

FreeImageGLFWAntTweakBar

OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

node-image

- wraps FreeImage- optimized native buffers for node-webcl and node-webgl

Page 12: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 12 Page

My Node.JS modules

2012-09-06 © 2012 Motorola Mobility, Inc.

FreeImageGLFWAntTweakBar

OpenGL

OpenCLnode-webcl

node-webgl

node-glfw

App

node.js

V8node-image

Native, Hardware

Node.JS module

Required dependency

Optional dependency

node-webcl

- implements WebCL API- implements WebCL - WebGL interop.- follows WebCL spec on a weekly basis

Page 13: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 13 Page

Installation §  Requires  Node.JS  >=  0.7.x  (due  to  TypedArrays)  §  Mac  OSX  10.7,  Microso/  Windows  7,  Ubuntu  10.10+  

§  Make  sure  node-­‐gyp  is  installed  

§  Get  latest  OpenCL  1.1+  /  OpenGL  2+  drivers  for  your  GPU    §  Intel  OpenCL  SDK,  AMD  AMP  SDK,  Nvidia  CUDA  4.x  SDK  

§  Install  latest  na#ve  libraries  in  your  library  and  include  paths  §  GLFW  hRp://www.glfw.org/  §  GLEW  hRp://glew.sourceforge.net/  §  FreeImage  hRp://freeimage.sourceforge.net/  §  AntTweakBar  hRp://www.an#sphere.com/Wiki/tools:anRweakbar  

§  Install  node-­‐webcl  or  node-­‐webgl  §  npm  install  node-­‐webcl  will  also  install  node-­‐webgl  §  npm  install  node-­‐webgl  will  also  install  node-­‐glfw  

§  For  examples,  also  install:  npm  install  opDmist  2012-09-06 © 2012 Motorola Mobility, Inc.

Page 14: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 14 Page

Usage

2012-09-06 © 2012 Motorola Mobility, Inc.

for node-webgl

WebGL = require('node-webgl');Image = WebGL.Image;document = WebGL.document();window = document;canvas = document.createElement("my_canvas");gl = canvas.getContext("experimental-webgl");

for node-webcl

WebCL = require('node-webcl');

Page 15: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 15 Page

Test your installation (node-webgl) §  GL  =  Graphics  Language  

§  cd  node-­‐webgl  

§  node  examples/lightgl/shadowmap.js  

§  node  test/cube.js  

2012-09-06 © 2012 Motorola Mobility, Inc.

examples/lightgl/shadowmap.js

test/lesson08.js (node-image)

test/cube.js (AntTweakBar)

Page 16: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 16 Page

Test your installation (node-webcl) §  CL  =  Compute  Language  

§  cd  node-­‐webcl  

§  node  examples/DeviceQuery.js  

§  node  examples/VectorAdd.js  

2012-09-06 © 2012 Motorola Mobility, Inc.

test/image.js (PBO)

examples/sine.js (FBO) examples/apple/qjulia/qjulia.js (PBO, AntTweakBar)

Page 17: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 17 Page

UNDERSTANDING WEBCL & OPENCL

2012-09-06 © 2012 Motorola Mobility, Inc.

Page 18: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 18 Page

What is WebCL?

§  WebCL  brings  parallel  compu#ng  to  the  Web  through  a  secure  JavaScript  binding  to  OpenCL  1.1  (2011)  §  Open  standard,  royalty-­‐free  §  Pla[orm  independent  §  Device  independent  §  being  standardized  by  Khronos  

§  First  public  working  dran  April  2012  §  hRp://www.khronos.org/webcl/  

2012-09-06 © 2012 Motorola Mobility, Inc.

Page 19: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 19 Page

OpenCL overview

§  OpenCL  framework  has  2  parts  1.  Host  API:  C-­‐based,  cross-­‐pla[orm,  object-­‐oriented  

•  Commands  to  control  send/receive  data  and  control  execu#on  on  devices  

2.  Kernels    •  Run  on  devices  •  use  a  subset  of  C99  and  extensions  •  Vector  extensions  (<type>N)  •  No  recursion,  no  func#on  pointers  •  No  dynamic  memory  (malloc,  free…),  no  standard  libc  methods  (memcpy…)  •  Kernels  are  akin  to  shaders  in  WebGL  

§  Features  §  Well-­‐defined  numerical  accuracy  both  for  integers  and  floats  §  Rich-­‐set  of  built-­‐in  func#ons  (e.g.  as  GLSL  and  more)  

•  But  no  random  method  §  Close  to  the  hardware  

•  Allow  control  over  memory  use  •  Allow  control  over  thread  scheduling  

2012-09-06 © 2012 Motorola Mobility, Inc.

Page 20: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 20 Page

§  A  host  is  connected  to  one  or  more  Compute  devices  

§  Compute  device    §  A  collec#on  of  one  or  more  compute    

units  (~  cores)  §  A  compute  unit  is  composed  of    

one  or  more  processing  elements  (~  threads)  

§  Processing  elements  execute  code    as  SIMD  or  SPMD  

Host(PC)

......

...

......

...

......

...

......

...

Compute Devices (GPU, CPU, DSP, FPGA…)

OpenCL Device Model

Processing Element (Thread)

Compute Device (GPU, CPU, …)

......

...

Compute Unit (Core)

Page 21: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 21 Page

examples/DeviceQuery.js §  Queries  parameters  of  all  OpenCL  devices  aRached  to  your  computer  

§  Example:  on  a  MacBook  Pro  early  2011,  OSX  10.8  

2012-09-06 © 2012 Motorola Mobility, Inc.

Found 2 devices --------------------------------- Device: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz --------------------------------- DEVICE_NAME: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz DEVICE_VENDOR: Intel DRIVER_VERSION: 1.1 DEVICE_VERSION: OpenCL 1.2 DEVICE_PROFILE: FULL_PROFILE DEVICE_OPENCL_C_VERSION: OpenCL C 1.2 DEVICE_TYPE: cpu DEVICE_MAX_COMPUTE_UNITS: 8 DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1 DEVICE_MAX_WORK_GROUP_SIZE: 1024 DEVICE_MAX_CLOCK_FREQUENCY: 2200 MHz

--------------------------------- Device: ATI Radeon HD 6750M --------------------------------- ATI Radeon HD 6750M AMD 1.0 OpenCL 1.1 FULL_PROFILE OpenCL C 1.1 gpu 6 3 1024 / 1024 / 1024 1024 600 MHz

Page 22: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 22 Page

DeviceQuery §  Queries  parameters  of  all  OpenCL  devices  aRached  to  your  computer  

§  Example:  on  a  MacBook  Pro  early  2011,  OSX  10.8  

2012-09-06 © 2012 Motorola Mobility, Inc.

Found 2 devices --------------------------------- Device: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz --------------------------------- DEVICE_NAME: Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz DEVICE_VENDOR: Intel DRIVER_VERSION: 1.1 DEVICE_VERSION: OpenCL 1.2 DEVICE_PROFILE: FULL_PROFILE DEVICE_OPENCL_C_VERSION: OpenCL C 1.2 DEVICE_TYPE: cpu DEVICE_MAX_COMPUTE_UNITS: 8 DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1 DEVICE_MAX_WORK_GROUP_SIZE: 1024 DEVICE_MAX_CLOCK_FREQUENCY: 2200 MHz

--------------------------------- Device: ATI Radeon HD 6750M --------------------------------- ATI Radeon HD 6750M AMD 1.0 OpenCL 1.1 FULL_PROFILE OpenCL C 1.1 gpu 6 3 1024 / 1024 / 1024 1024 600 MHz

-  4 cores, hyperthreaded => 8 compute units

-  Up to 1024 threads in 1D, at 2.2 GHz

-  6 compute units -  Up to 1024 threads

but 3 dims, at 600 MHz

Page 23: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 23 Page

OpenCL Execution Model

§  Kernel  §  Basic  unit  of  executable  code  (~  DLL  entry  point)  §  Data-­‐parallel  or  task-­‐parallel  

§  Program  §  Collec#on  of  kernels  and  func#ons  called  by  kernels  §  Analogous  to  a  dynamic  library  (DLL)  

§  Command  Queue  §  Control  opera#ons  on  OpenCL  objects  (memory  transfers,  kernels  execu#on,  synchroniza#on)  §  Commands  queued  in  order  §  Execu#on  in-­‐order  or  out-­‐of-­‐order  §  Applica#ons  may  use  mul#ple  command-­‐queues  per  device  

§  Work-­‐item  §  An  execu#on  of  a  kernel  by  a  processing  element  (~  thread)  

§  Work-­‐group  §  A  collec#on  of  work-­‐items  that  execute  on  a  single  compute  unit  (~  core)  

Queue Queue Context

GPU

CPU

Page 24: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 24 Page

OpenCL Work-group 2D analogy

# work-items = # pixels # work-groups = # tiles Work-group size = tileW * tileH All threads in a workgroup run synchronously

Local

Global

Page 25: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 25 Page

OpenCL Kernel § Defined  on  a  N-­‐dimensional  computa#on  domain  

§ A  kernel  is  executed  at  each  point  of  the  computa#on  domain  

// In JavaScript function multiple(a,b,n) { var c = []; for(var i=0; i<n; ++i) c[i] = a[i] * b[i]; return c; }

// In OpenCL C99 /** * @param a, b, c are buffers in global memory * @param n number of elements in a, b, and c */ __kernel void multiply(__global const float *a, __global const float *b, __global float *c, unsigned int n) { unsigned int tid = get_global_id(0); // thread number if(tid >= n) return; // make sure we don't pass buffer area c[tid] = a[tid] * b[tid]; }

Page 26: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 26 Page

OpenCL Memory Model §  On  Host  

§  CPU  RAM  

§  On  Compute  Device  §  Global  memory  =  GPU  RAM  §  Constant  memory  =  cached  global  memory  §  Texture  memory  =  cached  global  memory  

op#mized  for  streaming  reads  §  Local  memory  =  high-­‐speed  memory  shared  

among  work-­‐items  of  a  work-­‐group  (~  L1  cache)  

§  Private  memory  =  registers  of  a  work-­‐item,  very  fast  memory  

§  Memory  management  is  explicit  §  App  must  move  data  host  ➞  global  ➞  local    

and  back  

Private Memory Private Memory

Work-Item 1 Work-Item M

Workgroup 1

Private Memory Private Memory

Work-Item 1 Work-Item M

Workgroup N

Local Memory Local Memory

Global Memory / Constant and Texture Caches

Compute Device

Host Memory

Host

Command queuesand

API calls

Page 27: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 27 Page

WebCL API

WebCLMemoryObject{abstract}

WebCLImageWebCLBuffer

WebCLContext

WebCLKernel

WebCLProgram CommandQueue Event

WebCLDevice

Sampler

*

WebCL

WebCLPlatform WebCLExtension

* * * *

Platform layer

Compiler layer Runtime layer

Same OO model as OpenCL with JS classes WebCL is global object

Page 28: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 28 Page

“HELLO WORLD” CODE WALKTHROUGH

2012-09-06 © 2012 Motorola Mobility, Inc.

Page 29: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 29 Page

WebCL sequence (host side)

§ Create  context  § Compile  kernels  

§ Setup  command-­‐queues  

§ Setup  kernels  arguments  

§ Execute  commands  

§ Read  results  

Select Platform

Select Device

Create Context

Load and compile kernels on devices

Create command queues for each device

Send data to devices using their command

queues

Send commands to devices using their command queues

Get data from devices using their command

queues

Release resources

Create buffers to store data on devices

Update kernels arguments

Platform layerCompilerRuntime layer

Page 30: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 30 Page

Select Platform

Select Device

Create Context

Load and compile kernels on devices

Create command queues for each device

Send data to devices using their command

queues

Send commands to devices using their command queues

Get data from devices using their command

queues

Release resources

Create buffers to store data on devices

Update kernels arguments

WebCL sequence (host side) [1/6] // create the OpenCL context try { clContext = WebCL.createContext({ deviceType: WebCL.DEVICE_TYPE_GPU }); } catch(err) { throw "Error: Failed to create context! "+err; } var devices = clContext.getInfo(WebCL.CONTEXT_DEVICES); if (!devices) { throw "Error: Failed to retrieve compute devices for context!"; }

Page 31: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 31 Page

Select Platform

Select Device

Create Context

Load and compile kernels on devices

Create command queues for each device

Send data to devices using their command

queues

Send commands to devices using their command queues

Get data from devices using their command

queues

Release resources

Create buffers to store data on devices

Update kernels arguments

// Create the compute program from the source buffer (text) clProgram = clContext.createProgram(getScource("multiply_script")); // Build the program executable try { clProgram.build(clDevice, '-cl-fast-relaxed-math -DDEBUG=1'); } catch (err) { throw "Error: Failed to build program executable!\n" + clProgram.getBuildInfo(clDevice, WebCL.PROGRAM_BUILD_LOG); }

clKernel = clProgram.createKernel("multiply");

WebCL sequence (host side) [2/6] <script id="multiply_script" type="x-webcl"> __kernel void multiply(__global const float *a, __global const float *b, __global float *c, unsigned int n) { unsigned int tid = get_global_id(0); // thread number if(tid >= n) return; // make sure we don't pass buffer area c[tid] = a[tid] * b[tid]; } </script>

Page 32: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 32 Page

Select Platform

Select Device

Create Context

Load and compile kernels on devices

Create command queues for each device

Send data to devices using their command

queues

Send commands to devices using their command queues

Get data from devices using their command

queues

Release resources

Create buffers to store data on devices

Update kernels arguments

WebCL sequence (host side) [3/6] BUFFER_SIZE=10; var A=new Uint32Array(BUFFER_SIZE); var B=new Uint32Array(BUFFER_SIZE);

// store data in A and B …

var size=BUFFER_SIZE*Uint32Array.BYTES_PER_ELEMENT; // size in bytes // Create buffer for A and B and copy host contents var aBuffer = clContext.createBuffer(WebCL.MEM_READ_ONLY, size); var bBuffer = clContext.createBuffer(WebCL.MEM_READ_ONLY, size);

// Create buffer for C to read results var cBuffer = clContext.createBuffer(WebCL.MEM_WRITE_ONLY, size);

Page 33: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 33 Page

// Set kernel args clKernel.setArg(0, aBuffer); clKernel.setArg(1, bBuffer); clKernel.setArg(2, cBuffer); clKernel.setArg(3, BUFFER_SIZE, WebCL.type.UINT);

Select Platform

Select Device

Create Context

Load and compile kernels on devices

Create command queues for each device

Send data to devices using their command

queues

Send commands to devices using their command queues

Get data from devices using their command

queues

Release resources

Create buffers to store data on devices

Update kernels arguments

WebCL sequence (host side) [4/6] // Create command queue clQueue=context.createCommandQueue(devices[0]); // enqueue buffers clQueue.enqueueWriteBuffer (aBuffer, false, 0, size, A); clQueue.enqueueWriteBuffer (bBuffer, false, 0, size, B);

__kernel void multiply(__global const float *a, __global const float *b, __global float *c, unsigned int n);

Page 34: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 34 Page

Select Platform

Select Device

Create Context

Load and compile kernels on devices

Create command queues for each device

Send data to devices using their command

queues

Send commands to devices using their command queues

Get data from devices using their command

queues

Release resources

Create buffers to store data on devices

Update kernels arguments

WebCL sequence (host side) [5/6]

// Execute (enqueue) kernel clQueue.enqueueNDRangeKernel(clKernel, null, // global work offset [BUFFER_SIZE], // global work size [2]); // local work size

Note: Use local work size = [] or null (default) to let driver chose the best values.

Page 35: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 35 Page

Select Platform

Select Device

Create Context

Load and compile kernels on devices

Create command queues for each device

Send data to devices using their command

queues

Send commands to devices using their command queues

Get data from devices using their command

queues

Release resources

Create buffers to store data on devices

Update kernels arguments

WebCL sequence (host side) [6/6]

// get results and block while getting them var C=new Uint32Array(BUFFER_SIZE); clQueue.enqueueReadBuffer (cBuffer, true, // blocking call 0, size, C);

Page 36: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 36 Page

Example: Matrix multiplication

§  “Hello  World  of  CL”    

§  C  =  A  x  B  

§  N  x  N  matrices  

A B

C

Page 37: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 37 Page

Example: Matrix multiplication

§  Op#miza#on  §  N  x  N  matrices  §  C  divided  into  m  x  m  #les  §  With    

•  m  =  N  /  P  •  P  =  #  threads  per  workgroup  (16)  

A B

C

Page 38: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 38 Page

Example: Comparison with sequential §  MacBook  Pro  (early  2011),  OSX  10.8  

§  CPU:  Intel  Core  i7,  2.2GHz,  4  cores  §  GPU:  AMD  Radeon  HD  6750M,  1  GB,  480  SPU,  600  MHz,  576  GFLOPS  

0

50

100

150

200

250

128 256 512 1024 2048

Spee

dup

fact

or

OpenMP

CL (CPU)

CL (GPU)

CL (GPU opt)

Page 39: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 39 Page

WEBCL – WEBGL INTEROP.

2012-09-06 © 2012 Motorola Mobility, Inc.

Page 40: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 40 Page

WebCL / WebGL interop

§  WebCL  context  created    from  WebGL  context  

§  Configure  shared  CL  objects    from  GL  counterparts  

§  Sync  GL  and  CL  §  Flush  GL,  acquire  GL  object  §  Execute  CL  §  Release  CL  object,  flush  CL  

§  Vertex  arrays,  textures,    render-­‐buffers  can  be  shared    with  CL  

Initialize WebGL

Initialize WebCL

Configure shared CL-GL data

Set kernels args

Enqueue commands

Execute kernels

Update Scene

Initialization

Rendering loop (per frame)

Render scene

Page 41: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 41 Page

WebCL / WebGL interop Initialize WebGL

Initialize WebCL

Configure shared CL-GL data

Set kernels args

Enqueue commands

Execute kernels

Update Scene

Render scene

// Create WebGL context var gl = canvas.getContext("experimental-webgl"); // Init GL …

// create the OpenCL context try { clContext = WebCL.createContext({ deviceType: WebCL.DEVICE_TYPE_GPU, shareGroup: gl }); } catch(err) { throw "Error: Failed to create context! "+err; }

Page 42: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 42 Page

WebCL / WebGL interop (texture) // Create OpenGL texture object gl.activeTexture(gl.TEXTURE0); glTexture = gl.createTexture(); gl.bindTexture(gl.TEXTURE_2D, glTexture); gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST); gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST); gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA, TextureWidth, TextureHeight, 0, gl.RGBA, gl.UNSIGNED_BYTE, null); gl.bindTexture(gl.TEXTURE_2D, null);

Initialize WebGL

Initialize WebCL

Configure shared CL-GL data

Set kernels args

Enqueue commands

Execute kernels

Update Scene

Render scene

// Create the compute program from the source buffer (text) clProgram = clContext.createProgram(getScource("multiply_script")); // Build the program executable try { clProgram.build(clDevice, '-cl-fast-relaxed-math -DDEBUG=1'); } catch (err) { throw "Error: Failed to build program executable!\n" + clProgram.getBuildInfo(clDevice, WebCL.PROGRAM_BUILD_LOG); }

clKernel = clProgram.createKernel("multiply");

Page 43: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 43 Page

Demo: GL Texture update with CL

§  Based  on  Evgeny  Demidov  2D  ink  droplet  WebGL ~26 fps WebCL ~124 fps

Page 44: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 44 Page

WebCL / WebGL interop (vbo) Initialize WebGL

Initialize WebCL

Configure shared CL-GL data

Set kernels args

Enqueue commands

Execute kernels

Update Scene

Render scene

// set kernel args values clKernel.setArg(0, clVBO); clKernel.setArg(1, mesh_width, WebCL.type.UINT); clKernel.setArg(2, mesh_height, WebCL.type.UINT);

// create buffer object glVBO = gl.createBuffer(); gl.bindBuffer(gl.ARRAY_BUFFER, glVBO);

// initialize buffer object var sizeInBytes = mesh_width * mesh_height * 4 * FloatArray.BYTES_PER_ELEMENT; gl.bufferData(gl.ARRAY_BUFFER, sizeInBytes, gl.DYNAMIC_DRAW);

// create OpenCL buffer from GL VBO clVBO = clContext.createFromGLBuffer(WebCL.MEM_WRITE_ONLY, glVBO);

Page 45: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 45 Page

Demo: VBO update with CL

Page 46: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 46 Page

WebCL/WebGL interop (host side)

// Sync GL and acquire buffer from GL gl.flush(); clQueue.enqueueAcquireGLObjects(clTexture);

// Set global and local work sizes for kernel var local = null; var global = [ TextureWidth, TextureHeight ];

try { clQueue.enqueueNDRangeKernel(clKernel, null, global, local); } catch (err) { throw "Failed to enqueue kernel! " + err; }

// Release GL texture clQueue.enqueueReleaseGLObjects(clTexture); clQueue.flush();

Initialize WebGL

Initialize WebCL

Configure shared CL-GL data

Set kernels args

Enqueue commands

Execute kernels

Update Scene

Render scene

Page 47: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 47 Page

Perspectives

§  WebCL  and  Node.JS  are  a  match  in  heaven  §  Node.JS  can  process  lots  of  events  §  WebCL  can  process  lots  of  data  using  many  devices  

§  WebCL  enables  GPGPU  applica#ons  in  Web  browsers  §  Careful  usage  of  architecture  can  lead  to  impressive  speedup  §  With  WebGL  interoperability,  rich  graphics  Web  applica#ons  are  now  possible  

§  DRAFT  WebCL  specifica#on  §  Quite  stable  JavaScript  host  API  §  Focusing  on  more  security  and  robustness  

Page 48: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 48 Page

WebCL Open process and Resources

§  Khronos  open  process  to  engage  Web  community  §  Public  specifica#on  drans,  mailing  lists,  forums  §  hRp://www.khronos.org/webcl/  §  [email protected]  

§  Nokia  open  source  prototype  for  Firefox  in  May  2011  (LGPL)  §  hRp://webcl.nokiaresearch.com  

§  Samsung  open  source  prototype  for  WebKit  in  July  2011  (BSD)  §  hRp://code.google.com/p/webcl/    

§  Motorola  open  source  prototype  for  NodeJS  in  March  2012  (BSD)  §  hRps://github.com/Motorola-­‐Mobility/node-­‐webcl  §  All  demos  in  this  talk  were  made  with  node-­‐webcl  /  node-­‐webgl  

Page 49: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 49 Page

Start learning Now! §  OpenCL  Programming  Guide  -­‐  The  “Red  Book”  of  OpenCL  

§  hRp://www.amazon.com/OpenCL-­‐Programming-­‐Guide-­‐Aanab-­‐Munshi/dp/0321749642  

§  OpenCL  in  Ac#on  §  hRp://www.amazon.com/OpenCL-­‐Ac#on-­‐Accelerate-­‐Graphics-­‐Computa#ons/dp/1617290173/  

§  Heterogeneous  Compu#ng  with  OpenCL  §  hRp://www.amazon.com/Heterogeneous-­‐Compu#ng-­‐with-­‐OpenCL-­‐ebook/dp/B005JRHYUS  

§  The  OpenCL  Programming  Book    §  hRp://www.fixstars.com/en/opencl/book/  

Page 50: Hardware-accelerated computing and rendering with Nodefiles.meetup.com › 2048391 › baynode_node-webcl.pdf · 2012-10-31 · ORGANIZATION DETAIL Page 6 Notes on multi-threading

ORGANIZATION DETAIL 50 Page

Thank you!