Download - Multi-Core Architectures and Programming OpenCL ... · Multi-Core Architectures and Programming OpenCL Implementierung von OpenCV Funktionen Julian Mueller, Patrick Schneider [email protected]

Multi-Core Architectures and Programming

OpenCL Implementierung von OpenCVFunktionen

Julian Mueller, Patrick [email protected]

Hardware/Software Co-DesignUniversity of Erlangen-Nuremberg

August 18, 2011

University of Erlangen-NurembergJulian Mueller, Patrick Schneider 1

Table of content

1 OpenCL - Uberblick

2 OpenCL - Programmierung

3 Funktion: Imageresize

4 Fazit


Khronos

I OpenCL = Open Computing Language.I Spezifiziert durch die Khronos Gruppe.


Modell Sichtweisen

I Plattform-Modell.


Modell SichtweisenI Ausfuhrungs-Modell.


Modell SichtweisenI Speicher-Modell.


FrameworkI Zusammenhange:


Table of content




4 Fazit


ParallelitatI N-dimensionaler Berechnungsraum(N <= 3).I Berechnungsraum definiert die Anzahl der parallel

ausfuhrbaren Kernel.I Bsp. Bildverarbeitung N=2 mit 1024 x 1024 image:

= 1 Kernel pro Pixel: 1,048,576 Kernel.


Initialisierungs-ObjekteI Setup:

– Devices — GPU, CPU, Cell/B.E.– Contexts — Erzeugen des Kontextes.– Queues — Kommunikation zwischen Host und KernelInstanzen.

I Memory:– Buffers — 1D Speicherblocke.– Images — 2D oder 3D Texturen.

I Execution:– Programs — Eigentlicher Code.– Kernels — Threads auf der HW.

I Synchronization:– Events - z.B. Verschrankte Speicherzugriffe,Laufzeitanalyse.


Probleme bei der Implementierung

I Relativ großer Overhead fur die Initialisierung (Codegroßeder Initialisierungsdatei inklusive Fallunterscheidung fur 3verschiedene Kernel ca. 600 Codelines).

I OpenCL Man-Pages nicht immer aufschlussreich.I Korrektes Anlegen und Freigeben der einzelnen Objekte,

z.B:- Fehlerquelle: Allozierung von Speicherplatz mittelsGlobalworkSize oder Anzahl der Pixel.- Fehlerquelle: Beachten der maximalen Workgroupgroße.

I Kernel Code entspricht Textstring - Kompilierung zurLaufzeit.



I Was passiert bei Compiler Fehlern?Implementierung von zusatzlichen Funktionen notwendig.

1 c i E r r 1 = clBui ldProgram ( cpProgram , 0 , NULL, NULL, NULL, NULL) ;p r i n t f ( ” c lBui ldProgram . . . \ n ” ) ;

3 i f ( c i E r r 1 != CL SUCCESS){

5 c l b u i l d s t a t u s b u i l d s t a t u s ;c i E r r 1 = c lGetProgramBui ld In fo ( cpProgram , cdDevice , CL PROGRAM BUILD STATUS,

s i z e o f ( c l b u i l d s t a t u s ) , &b u i l d s t a t u s , NULL) ;7 / / i f program b u i l t f a i l s , p r i n t out e r r o r messages

i f ( b u i l d s t a t u s != CL SUCCESS) {9 char ∗ b u i l d l o g ;

s i z e t r e t v a l s i z e ;11 c i E r r 1 = c lGetProgramBui ld In fo ( cpProgram , cdDevice , CL PROGRAM BUILD LOG, 0 ,

NULL, & r e t v a l s i z e ) ;i f ( c i E r r 1 != CL SUCCESS) {

13 Cleanup ( EXIT FAILURE ) ;}

15 . . .


Probleme bei der Implementierung1 . . . .

b u i l d l o g = new char [ r e t v a l s i z e + 1 ] ;3 c i E r r 1 = c lGetProgramBui ld In fo ( cpProgram , cdDevice , CL PROGRAM BUILD LOG,

r e t v a l s i z e , b u i l d l o g , NULL) ;i f ( c i E r r 1 != CL SUCCESS) {


7 / / there ’ s no in fo rma t i on i n the re ference whether the s t r i n g i s 0 terminated ornot

b u i l d l o g [ r e t v a l s i z e ] = ’ \0 ’ ;9 s td : : cout << ” BUILD LOG: ” << s td : : endl ;

s td : : cout << b u i l d l o g << s td : : endl ;11 de le te [ ] b u i l d l o g ;

}


15 / / Create the kerne lckKernel = c lCreateKerne l ( cpProgram , cExecutableName , &c i E r r 1 ) ;

17 p r i n t f ( ” c lCreateKerne l (%s ) . . . \ n ” , cExecutableName ) ;i f ( c i E r r 1 != CL SUCCESS)

19 {

p r i n t f ( ” E r ro r i n c lCreateKernel , L ine %u i n f i l e %s ! ! ! \ n\n ” , L INE , F ILE ) ;21 Cleanup ( EXIT FAILURE ) ;

}



I Kernel Programmcode:I Double-precission Berechnungen benotigen spezielle

pragama Definitionen.

/ / enable double datatype2 #pragma OPENCL EXTENSION c l k h r f p 6 4 : enable

I Arbeiten mit Texturen, zugriff auf Elemente.

/ / sampler : Var iab le d ie Verarbe i tung der B i ldda ten s p e z i f i z i e r t2 c1 = read imageui ( img , sampler , ( x , y ) ) . x ;


Table of content




4 Fazit


Variante 1: SequentiellI Bilineare Interpolation.I Details Implementierung:

- 2 geschachtelte Schleifen.- Benotigt Daten des Ursprungsbildes sowieDimensionen/Speicherallokation des Zielbildes.- Ursprungsbild: Nur Lese-Operationen.- Zielbild: Schreiboperationen unabhangig voneinander.− > Schleifen theoretisch gut parallelisierbar.

vo id res ize seq ( const Mat &img , const Mat &img1 ) {2 . . .

f o r ( i n t i = 0 ; i < s i ze x ; i ++) {4 . . .

f o r ( i n t j = 0 ; j < s i ze y ; j ++) {6 . . .

img1 . at< i n t >( j , i ) = i n t e r p o l v a l u e ;8 }

}

10 }


Variante 2: Parallel-Buffer

I 1-dimensionaler Datentyp.I Relativ einfache Implementierung, weil der Zugriff auf die

Datenstruktur wie in C/C++ erfolgt.I Problem: GPU-HW-Memory - schlechtes Alignment, je

nach Bildgroße.

/ / Kernel Programmcode2 k e r n e l vo id p a r a l l e l r e s i z e ( g l o b a l const unsigned char ∗ img , g l o b a l unsigned

char ∗ img1 , i n t img h , i n t img w , i n t img1 h , i n t img1 w ){

4 / / get index i n t o g loba l data ar rayi n t i = g e t g l o b a l i d ( 0 ) ;

6 i n t j = g e t g l o b a l i d ( 1 ) ;. . .

8 img1 [ j ∗ img1 w + i ] = ( unsigned char ) i n t e r p o l v a l u e ;}


Variante 3: Parallel-ImageI 2-dimensionaler Datentyp (Texturobjekt).I Vorteile gegenuber Buffer:

- Alignment.- Special Instructions (HW-Filtering).

I Variante 3a/b: Ohne/mit Nutzung des HW-Filtering.1 / / Kernel Programmcode

k e r n e l vo id p a r a l l e l r e s i z e ( g l o b a l const image2d t img , g l o b a l unsigned char ∗img1 , i n t img1 h , i n t img1 w )

3 {

. . . .5 }

1 / / I n i t i a l i s i e r u n g im Host. . . .

3 img2Dformat = {CL R , CL UNSIGNED INT8 } ;img = clCreateImage2D ( cxGPUContext ,CL MEM READ ONLY,& img2Dformat , ( s i z e t ) img w , (

s i z e t ) img h ,0 ,NULL,& c i E r r 1 ) ;5 . . .


Vergleich Variante 3a/b1 / / Kernel Programmcode P a r a l l e l −Image , Var ian te a

k e r n e l vo id p a r a l l e l r e s i z e ( g l o b a l const image2d t img , g l o b a l unsigned char ∗img1 , i n t img1 h , i n t img1 w )

3 {

. . . .5 / / Z u g r i f f auf 4 Elemente notwendig .

c1 = read imageui ( src , sampler a , ( f l o o r x , f l o o r y ) ) . x ;7 c2 = read imageui ( src , sampler a , ( f l o o r x , c e i l y ) ) . x ;

c3 = read imageui ( src , sampler a , ( c e i l x , f l o o r y ) ) . x ;9 c4 = read imageui ( src , sampler a , ( c e i l x , c e i l y ) ) . x ;

. . .11 }

1 / / Kernel Programmcode P a r a l l e l −Image , Var ian te ak e r n e l vo id p a r a l l e l r e s i z e ( g l o b a l const image2d t img , g l o b a l unsigned char ∗

img1 , i n t img1 h , i n t img1 w )3 {

. . .5 / / E inmal iger Z u g r i f f , a l l e s andere uebernimmt d ie HW.

c = read imageui ( img , sampler b , ( x , y ) ) . x ;7 . . .}


Performance

I Gesamtlaufzeiten der verschiedenen Implementierungenim Rahmen statistischer Schwankungen.

I Grund: Imageresize macht nur einen kleinen Teil deseigentlichen Rechenaufwandes aus.

I Separate Betrachtung der Laufzeiten des ImageresizeAlgorithmus.


Performance

385x385318x318

263x263217x217

179x179148x148

123x123101x101

84x8469x69

57x5747x47

39x3932x32

27x27

0123456789

Fermi: Lenna.png, Workgroup size = 16x16

Vergleich: CPU vs. GPU Varianten

cpubufferimageimage_hw_filter

Aufloesung in Pixel

Ze

it in

[ms

]

385x385318x318

263x263217x217

179x179148x148

123x123101x101

84x8469x69

57x5747x47

39x3932x32

27x27

00.010.010.020.020.030.030.040.040.05


Vergleich: GPU Varianten

bufferimageimage_hw_filter

Aufloesung in Pixel

Ze

it in

[ms

]


Performance

385x385318x318

263x263217x217

179x179148x148

123x123101x101

84x8469x69

57x5747x47

39x3932x32

27x27

0123456789

10




Aufloesung in Pixel

Ze

it in

[ms

]

385x385318x318

263x263217x217

179x179148x148

123x123101x101

84x8469x69

57x5747x47

39x3932x32

27x27

0

0.01

0.01

0.02

0.02

0.03

0.03

0.04

0.04




Aufloesung in Pixel

Ze

it in

[ms

]


Performance

2352x35291767x2651

1328x1992998x1497

750x1124563x845

423x635318x477

239x358179x269

135x202101x152

76x11457x86

43x6432x48

0

100

200

300

400

500

600

Fermi: 10MP Bild



Aufloesung in Pixel

Ze

it in

[ms

]

2352x35291767x2651

1328x1992998x1497

750x1124563x845

423x635318x477

239x358179x269

135x202101x152

76x11457x86

43x6432x48

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Fermi: 10MP Bild



Aufloesung in Pixel

Ze

it in

[ms

]


Table of content




4 Fazit


Ergebnisse

I Imageresize-Funktion: Gut parallelisierbares Szenario.I Erfolge bei Betrachtung der reinen Berechnungszeit.I Probleme:

- Initialisierungsoverhead ca. 1s (einmalig). -Kommunikations und Allokationsoverhead.

I Nur bei großen Bildern sinnvoll.


...

Vielen Dank fur Ihre Aufmerksamkeit


Quellen

www.khronos.de

www.nvidia.de

Intel: http://www.intel.com/go/opencl/

Uni Muenster: Prof. Sergei Gorlatch, Michel Steuwer:pvs.uni-muenster.de/pvs/lehre/SS11/mgpp/folien/mgpp4.pdf


Download - Multi-Core Architectures and Programming OpenCL ... · Multi-Core Architectures and Programming OpenCL Implementierung von OpenCV Funktionen Julian Mueller, Patrick Schneider [email protected]

Top Related