parallel image processing - purdue engineeringparallel image processing lei cao and yan wang...

Post on 22-Mar-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parallel Image Processing Lei Cao and Yan Wang

2013-04-19

1

2

Image Size: 4753 * 3168 (*3 RGB)

3

LOMO effect: boundaries are made darker; add contrast to the red channel.

4

5

Tilt-shift effect: certain regions are made out-of-focus by Gaussian filtering

Outline

Algorithm &Serial Code Profiling

OpenMP Parallelization

MPI Parallelization

Performance

Future Work

6

Algorithm

7

Tilt-shift LOMO

Serial Program Profiling

8

4%

71%

8%

15%

2% 0%

Time

Read Image

Tilt Shift

Red Tune

Vignette

Write Image

Others

Tested at Purdue’s Hansen cluster (Dell compute nodes with four 12-core AMD Opteron

6176 processors). Compiler: intel/12.0.084. R/W the image with the JPEG library

LOMO

9

255 135 2

221 127 18

94 100 33 a1 a2 a3

a4 a5 a6

a7 a8 a9

( )* ( )( )

( )

p i a ip center

a i

Gaussian Filtering:

Center Weighted Average

4753 * 3168 (*3

RGB)

Gaussian Filter:

10

255 135 2

221 127 18

94 100 33 a1 a2 a3

a4 a5 a6

a7 a8 a9

11

221 127 18

94 100 33

32 93 105

a1 a2 a3

a4 a5 a6

a7 a8 a9

94 100 33

32 93 105

19 18 17

12

a1 a2 a3

a4 a5 a6

a7 a8 a9

19 18 94 100 33

32 93 32 93 105

19 18 19 18 17

19 18 94 100 33

13

a1 a2 a3

a4 a5 a6

a7 a8 a9

Less computation Vs. More computation

Load Imbalance: dynamical scheduling in OpenMP

OMP Parallelization

Dynamics scheduling

nowait

Loop coalescing: transform 2D image matrix

to 1D array

14

Performance

15

4%

71%

8%

15%

2% 0%

Serial

Read Image

Tilt Shift

Red Tune

Vignette

Write Image

Others

Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);

Serial compiler: intel/12.0.084

Parallel compiler: openmpi/1.4.4_intel-12.0.084; 8 cores are used.

16%

44% 6%

26%

6%

2%

8-core OpenMP

Dynamic Scheduling

16

17%

39%

7%

28%

7%

2%

Dynamic

Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);

Serial compiler: intel/12.0.084

Parallel compiler: openmpi/1.4.4_intel-12.0.084; 8 cores are used.

16%

44% 6%

26%

6%

2%

Static

Read Image

Tilt Shift

Red Tune

Vignette

Write Image

Others

nowait

17

17%

39%

7%

28%

7%

2%

w/ nowait

16%

44% 6%

26%

6%

2%

w/o nowait

Read Image

Tilt Shift

Red Tune

Vignette

Write Image

Others

Speedup

18

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Original Dynamic w/o

nowait

Nowait w/o

dynamic

Dynamic

&nowait

Time

Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);

Parallel compiler: openmpi/1.4.4_intel-12.0.084; 8 cores are used.

+15% +2%

+15%

Speedup of OpenMP and MPI

19

0 5 10 15 20 250

5

10

15

20

25

No. of procs

Sp

ee

du

p

Ideal

OpenMP

MPI

Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);

OMP compiler: openmpi/1.4.4_intel-12.0.084;

MPI compiler: mpich2-intel/1.4.4_intel-12.0.084

MPI Vs. OpenMP

20

9%

52% 6%

25%

6%

2%

8-core MPI

Read Image

Tilt Shift

Red Tune

Vignette

Write Image

Others

Hansen cluster (Dell compute nodes with four 12-core AMD Opteron 6176 processors);

Serial compiler: intel/12.0.084; OpenMP compiler: openmpi/1.4.4_intel-12.0.084; MPI

compiler: mpich2/1.4.4_intel-12.0.0848

16%

44% 6%

26%

6%

2%

8-core OpenMP

MPI: creates a large array in each thread, which is inefficient.

Future Work

Other image sizes (smaller or larger)

Find other aspects to improve the speedup

Try different chunk size in dynamic

scheduling

21

22

Thanks!

top related