sherman braganza prof. miriam leeser reconfigurable ... · fpga and gpu specifics y verification...
TRANSCRIPT
![Page 1: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/1.jpg)
Sherman BraganzaProf. Miriam Leeser
ReConfigurable LaboratoryNortheastern University
Boston, MA
![Page 2: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/2.jpg)
OutlineIntroduction
MotivationOptical Quadrature MicroscopyPhase unwrapping
AlgorithmsMinimum LP norm phase unwrapping
PlatformsReconfigurable Hardware and Graphics Processors
ImplementationFPGA and GPU specificsVerification details
ResultsPerformancePowerCost
Conclusions and Future Work
![Page 3: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/3.jpg)
Motivation – Why Bother With Phase Unwrapping?
Used in phase based imaging applications
IFSAR, OQM microscopyHigh quality results are computationally expensive
Only difficult in 2D or higherIntegrating gradients with noisy dataResidues and path dependency
Wrapped embryo image
0.1
0.3
‐0.1
‐0.3
0.1
0.3
‐0.1
‐0.2
No residues Residues
![Page 4: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/4.jpg)
Algorithms – Which One Do We Choose?
Many phase unwrapping algorithms Goldstein’s, Flynns, Quality maps, Mask Cuts, multi-grid, PCG, Minimum LP norm (Ghiglia and Pritt, “Two Dimensional Phase Unwrapping”, Wiley, NY, 1998.
We need: High quality (performance is secondary)Abilitity to handle noisy data
Choose Minimum LP Norm algorithm: Has the highest computational cost
a) Software embryo unwrap Using matlab ‘unwrap’
b) Software embryo unwrap Using Minimum LP Norm
![Page 5: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/5.jpg)
Breaking Down Minimum LP NormMinimizes existence of differences between measured data and calculated dataIterates Preconditioned Conjugate Gradient (PCG)
94% of total computation timeAlso iterativeTwo steps to PCG
Preconditioner (2D DCT, Poisson calculation and 2D IDCT)Conjugate Gradient
![Page 6: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/6.jpg)
Platforms – Which Accelerator Is Best For Phase Unwrapping?
FPGAsFine grained controlHighly parallelLimited program memory
Floating point?High implementation cost
Xilinx Virtex II Pro architecturehttp://www.xilinx.com/
![Page 7: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/7.jpg)
Platforms ‐ GPUs
G80 Architecture [nvidia.com/cuda]
![Page 8: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/8.jpg)
Platform ComparisonFPGAs GPUs
•Absolute control: Can specific custom bit-widths/architectures to optimally suit application
•Need to fit application to architecture
•Can have fast processor-processor communication
•Multiprocessor-multiprocessor communication is slow
•Low clock frequency •Higher frequency
•High degree of implementation freedom => higher implementation effort. VHDL.
•Relatively straightforward to develop for. Uses standard C syntax
•Small program space. High reprogramming time
•Relatively large program space. Low reprogramming time.
![Page 9: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/9.jpg)
Platform Description
Software unwrap execution time
Platform specifications
• FPGA and GPU on different platforms 4 years apart• Effects of Moore’s Law
• Machine 3 in the Results: Cost section has a Virtex 5 and two Core2Quads
![Page 10: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/10.jpg)
Implementation: Preconditioning On An FPGA
Need to account for bitwidth Minimum of 28 bit needed – Use 24 bit + block exponent
Implement a 2D 1024x512 DCT/IDCT using 1D row/column decompositionImplement a streaming floating point kernel to solve discretizedPoisson equation
27 bit software unwrap 28 bit software unwrap
![Page 11: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/11.jpg)
Minimum LP Norm On A GPU NVIDIA provides 2D FFT kernel
Use to compute 2D DCTCan use CUDA to implement floating point solver
Few accuracy issuesNo area constraints on GPU
Why not implement whole algorithm?Multiple kernels, each computing one CG or LP
norm stepOne host to accelerator transfer per unwrap
![Page 12: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/12.jpg)
Verifying Our ImplementationsLook at residue counts as algorithm progresses
Less than 0.1% differenceVisual inspection: Glass bead gives worst case results
Software unwrap GPU unwrap FPGA unwrap
![Page 13: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/13.jpg)
Verifying Our ImplementationsDifferences between software and accelerated version
GPU vs. Software FPGA vs. Software
![Page 14: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/14.jpg)
Results: FPGAImplemented preconditioner in hardware and measured algorithm speedupMaximum speedup assuming zero preconditioning calculation time :3.9x
We get 2.35x on a V2P70, 3.69x on a V5 (projected)
![Page 15: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/15.jpg)
Results: GPUImplemented entire LP norm kernel on GPU and measured algorithm speedup Speedups for all sections except disk IO5.24x algorithm speedup. 6.86x without disk IO
![Page 16: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/16.jpg)
Results: FPGAs vs. GPUs
Preconditioning onlySimilar platform generation. Projected FPGA results.Includes FPGA data transfer, not GPU
Buses? Currently use PCI-X for FPGA, PCI-E for GPU
![Page 17: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/17.jpg)
Results: PowerGPU power consumption increases significantlyFPGA power decreases
Power consumption (W)
![Page 18: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/18.jpg)
CostMachine 3 includes an AlphaData board with a Xilinx Virtex 5 FPGA platform and two Core2QuadsPerformance is given by 1/Texec
Proportional to FLOPs
Machine 2 $2200Machine 3 $10000
![Page 19: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/19.jpg)
Performance To Watt‐Dollars• Metric to include all parameters
![Page 20: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/20.jpg)
Conclusions And Future WorkFor phase unwrapping GPUs provide higher performance
Higher power consumptionFPGAs have low power consumption
High reprogramming timeOQM: GPUs are the best fit. Cost effective and faster:
Images already on processorFPGAs have a much stronger appeal in the embedded domain
Future WorkExperiment with new GPUs (GTX 280) and platforms (Cell, Larrabee, 4x2 multicore)Multi-FPGA implementation
![Page 21: Sherman Braganza Prof. Miriam Leeser ReConfigurable ... · FPGA and GPU specifics y Verification details y Results y Performance y Power y Cost y Conclusions and Future Work. Motivation](https://reader033.vdocuments.us/reader033/viewer/2022060606/605c553eeccee6493a6fdb28/html5/thumbnails/21.jpg)
Thank You!
Any Questions?
Sherman Braganza ([email protected])Miriam Leeser ([email protected])
Northeastern University ReConfigurable Laboratoryhttp://www.ece.neu.edu/groups/rcl