christian steinle, university of heidelberg, institute of computer engineering1 l1 hough tracking on...

18
Christian Steinle, University of Heidelberg, Institute of Comp uter Engineering 1 L1 Hough Tracking on a Sony Playstation III Christian Steinle, Andreas Kugel, Reinhard Männer Computer Engineering, University of Heidelberg • Contents Motivation for the Playstation implementation Algorithm adaption Job restrictions Algorithm implementation Status

Upload: solomon-york

Post on 17-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

1

L1 Hough Tracking on a Sony Playstation III

Christian Steinle, Andreas Kugel, Reinhard MännerComputer Engineering, University of Heidelberg

• Contents

– Motivation for the Playstation implementation– Algorithm adaption– Job restrictions– Algorithm implementation– Status

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

2

• Hough Space: 1 dimension for 1 parameter of a track– bending 1/Pz, angle (Px/Pz) and Py/Pz)– Hough Space dimensions: 95 x 31 (cells) x 191 (layers)

– Detector slice with constant angle corresponds to one 2D Hough histogram

– Detector slices are overlapping (multiple scattering)

Z

X

Y

Py/Pz

Px/P z

1 /Pz

Motivation for the Playstation implementation

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

3

Motivation for the Playstation implementation

• L1 Hough Tracking algorithm development in HDL for FPGA HDL descriptions for all critical functional units

• Synthesis result of a single histogram layer implemented with registers shows the requirement of about 30.000 logic cells

50% of FPGA (XC4VLX60: 59,904) registers used for layer

• Conclusion: Develop a multi-chip solution

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

4

Motivation for the Playstation implementation

• Conclusion: Use the CellBE of a Sony Playstation III as cheap and

flexible rapid prototyping system

Multi-chip algorithm Multi-core platform

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

5

Algorithm adaption

• Conclusion: Two parallelism levels:

multiple processors work on input hit data in parallel vector capable ALUs process multiple layers in parallel

Creation of input data packages as so-called jobs

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

6

Job restrictions

1. Balance workload for the processors

• Number of histogram layers: 191• Number of processors for a Sony Playstation III: 6

– Cell BE: 8 but 1 disabled and 1 reserved for OS

yersForJobnumberOfLa32:n1Restrictio

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

7

Job restrictions

2. Optimal use of the processor‘s vector capable ALU

• Each histogram cell contains a 5 bit signature (1bit/detector)

• The vector capable ALU is 128 bit 128 bit / 5 bit / signature = 25 signatures in parallel

• Speed gain by using an implicite type instead of bit manipulation

unsigned char (8 bit) is ALU and memory applicable Each histogram cell contains actually 3 unused bits 128 bit / 8 bit / signature = 16 signatures in parallel

• Random access requires the composing of the processing ALU vector with 16 elements from the memory

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

8

Job restrictions

• Hough transformed Hit contains γmin and γmax

Hit is inserted into γmax - γmin + 1 consecutive layers

• Speed gain by optimizing the histogram memory structure while using the consecutive histogram layer entries for a single hit

Direct memory access on up to 16 consecutive layers in parallel

Each hit in a job is systolically processed just once in parallel A single hit can contribute to more than one job (γmax - γmin >

16, γmax of hit > γmax of actual job) yersForJobnumberOfLa16:n2Restrictio

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

9

Job restrictions

3. Optimal use of the processor‘s local storage

• Local storage contains program code, static data, heap and stack

• Conclusion: Used memory for job = codingTable + histogram + input data Available memory = stackPointer – heapPointer -

securityRegion ForJobusedMemoryemoryavailableM:n3Restrictio

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

10

Algorithm implementation

• Configuration part on the PPU

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

11

Algorithm implementation

• Configuration part on the SPU

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

12

Algorithm implementation

• Different versions of the job creation are possible

– CODEVERSION 1:• Create all jobs with a loop over all hits before the SPU processing

– CODEVERSION 2:• Create all jobs with a loop over all jobs before the SPU processing

– CODEVERSION 3:• Create the jobs parallely pipelined with the SPU processing

– CODEVERSION 4:• Create the jobs for dedicated SPUs when they are ready

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

13

Algorithm implementation

• Processing part one on the PPU

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

14

Algorithm implementation

• Processing part two on the PPU

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

15

Algorithm implementation

• Processing part three on the PPU

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

16

Algorithm implementation

• Different versions of the histogram memory usage

– MEMORYVERSION 1:• Use vector type for memory and computation

– MEMORYVERSION 2:• Use smallest type for a single histogram cell in the memory and

build vector type for computation

– MEMORYVERSION 3:• Use smallest type for histogram cells in parallel layers and build

vector type for computation

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

17

Algorithm implementation

• Proccessing part on the SPU

Christian Steinle, University of Heidelberg, Institute of Computer Engineering

18

Status

• Algorithm adaption is implemented

• Hough processing has to be implemented:– Actual step in scalar version: peakfinding2D– Missing steps in scalar version: peakfinding3D– Actual step in vector version: diagonalization– Missing steps in vector version: peakfinding2D, peakfinding3D

• Measure the timing

• Compare the timing with– the timing of the C++ framework implementation– the estimated timing of the FGPA implementation