![Page 1: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/1.jpg)
Sunita Chandrasekaran1 Oscar Hernandez2
Douglas Leslie Maskell1 Barbara Chapman2
Van Bui2
1Nanyang Technological University, Singapore2University of Houston, HPCTools, Texas, USA
Compilation and Parallelization Techniques with Tool Support to Realize Sequence Alignment Algorithm on FPGA
and Multicore
1
10
100
2 4 8 16 32 64 128
threads
Speed Up 1006001000Ideal
![Page 2: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/2.jpg)
• Challenge
• Application – Bioinformatics
• Proposed Idea
• Tool Support
• Tuning Methodology
• Scheduling
• Execution and Tuning Model
• Conclusion and Future Work
![Page 3: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/3.jpg)
Challenge• Reconfigurable Computing – Customizing a
computational fabric for specific applications, e.g. FPGA (Field Programmable Gate Array)
• Reconfigurable Computing and HPC is a reality…• Fills the gap between hardware and software• FPGA based accelerators – Involving massive
parallelism and extensible hardware optimizations• Portions of the application can be run on
reprogrammable hardware
Important to identify the hot spots in the application to determine which portion to be applicable on the software and which portion on the hardware.
Paper presents a tuning methodology to identify the bottlenecks in the program using a parallelizing compiler with the help of static and analysis tools
![Page 4: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/4.jpg)
Application
Bioinformatics – Multiple Sequence Alignment
Arranging the primary sequences of DNA, RNA or protein to identify the regions of similarity
Areas of research in Bioinformatics
Sequence Alignment
Gene Structure Prediction
PhylogeneticTree
Protein Folding
Local Global Constructed based on the distances between the sequences
Classification and Identification of genes
2D 3D
N-Walgorithm
S-Walgorithm
End to End Alignment
Internal small Stretches of Similarity
![Page 5: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/5.jpg)
Smith Waterman Algorithm
• Similar subsequences of two sequences
• Implemented by large bioinformatics organizations
• Dynamic programming algorithm used to compute local alignment of pair of sequences
• Impractical due to time and space complexities
• Progressive alignment is the widely used heuristic- distance value between each pair of sequences- phylogenetic tree- pairwise alignment of various profiles
• Hardware implementations of the algorithm exploit opportunities for parallelism and further accelerate the execution
![Page 6: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/6.jpg)
Proposed Idea
• Efficient C code implementation of the MSA
• Preprocessing steps and parallel processing approaches
• Profiling to determine the performance bottlenecks, identifying the areas of the code that can benefit from the parallelization
• High level optimizations to be performed to obtain a better speed-up Improving the CPI
• Including pipelining, data prefetching, data locality, avoiding
resource contention and support parallelization of the main kernel
![Page 7: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/7.jpg)
Front-end(C/C++ & Fortran 77/90)
LNO(Loop Nest Optimizer)
IPA(Inter Procedural Analyzer)
WOPT(global scalar optimizer)
IR-to-source translation(whir2c & whirl2f)
Native compilersNative
compilers
Source code w/ OpenMP directives
Portable OpenMPRuntime library
ExecutablesExecutables
Linking
BackendSource code w/ OMP lib calls
OpenUH Compiler Infrastructure Tool Support
![Page 8: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/8.jpg)
The OpenUH Compiler• Based on the Open64 compiler.
• A suite of optimizing compiler tools for Linux/Intel IA-64 systems and IA-32 (source-to-source).
• First release open-sourced by SGI– Available for researchers/developers in the community.
• Multiple languages and multiple targets– C, C++ and Fortran77/90 – OpenMP 2.0 support (University of Houston, Tsinghua University, PathScale)
![Page 9: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/9.jpg)
Call Graph
Flow Graph
Array Regions
Data Dependence Analysis
OpenUH/64 includes The Dragon Analysis Tool
![Page 10: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/10.jpg)
TAU- Profiling Toolkit for Performance Analysis of Parallel programs written in Fortran, C, C++, Java or Python
![Page 11: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/11.jpg)
Tuning Methodology
• Bottlenecks in the program are identified with hardware performance counters
The following are the investigations:
• Count of useful instructions = 7.63E+9
• No-opt operations = 44% (moving this portion to the reconfigurable platform would be inefficient)
• Branch Mispredictions = 75% (this would stall the pipeline, cause wastage of resources)
• Cycles per instruction = 0.3178 (Instructions are stalling)
![Page 12: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/12.jpg)
Goal:
To reduce total cycles, reduce stalls, no-ops, conditionals and hoist loops outside, improve memory locality
• Used software parallel programming paradigm, OpenMP and pragmas to parallelize the code
• Realized the dependencies in the program with Dragon tool• Control Flow and Data Flow graph used to distinguish
between regions• Aggressive privatizations applied to most of the arrays• Fine grained locks define to access shared arrays• Hot spots of the application identified
![Page 13: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/13.jpg)
OpenMP Pseudo codemsap {
#pragma parallel region private(..) firstprivate(..)
{
#pragma omp for
for(…)
Initialize Array of Locks
#pragma omp for no wait
for (…) {
for (…) {
for (…) {
Computations ()
for (…) {
{
Computations ()
}
// update to shared data
omp_set_lock()
Updates to shared data.
omp_unset_lock()
}
}
![Page 14: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/14.jpg)
Result Obtained after performing optimizations:• Count of useful instructions = 8.40E+9
• No-opt operations = 24%
• Branch Mispredictions = 59%
• Cycles per instruction = 0.28 (Lowered, hence higher performance)
CPI improvements of 11.89% - Reduction in branch misprediction of 21.33% - NOP instructions reduced by 45.45% Parameters Unoptimized Optimized
Useful Instruction Count
7.63E + 9 8.40E + 9
NOP operations 44% 24%
Branch Mispredictions
75% 59%
Cycles/Inst
CPI
0.3178 0.28
![Page 15: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/15.jpg)
Scheduling
Static Scheduling• Reduced synchronization/communication overhead
• Uneven sized tasks
• Load imbalances and idle processors leading to wastage of resources
• Triangular matrix- resultant matrix not achieved - No ideal speed-up
![Page 16: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/16.jpg)
Dynamic Scheduling• Option of Flexibility
• As the parallel loop is executed, number of iterations each thread performs is determined dynamically
• Loop divided into chunks of h iterations or chunk size equaling to 1 or x% of the hth iterations.
• Ideal speed-up of ~80% achieved
![Page 17: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/17.jpg)
Dynamic Scheduling (Triangular Matrix)
Static Scheduling
Vs
![Page 18: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/18.jpg)
Execution and Tuning Model
![Page 19: Sunita Chandrasekaran 1 Oscar Hernandez 2 Douglas Leslie Maskell 1 Barbara Chapman 2 Van Bui 2 1 Nanyang Technological University, Singapore 2 University](https://reader035.vdocuments.us/reader035/viewer/2022062417/5518a163550346881f8b48be/html5/thumbnails/19.jpg)
Conclusion and Future Work
• Multithreaded application achieves 78% of ideal speed-up on dynamic scheduling with 128 threads on 1000 sequence protein data set.
• Looking at translating OpenMP to Impulse-C, a tool for main stream embedded programmers seeking high performance through FPGA co-processing
• Plan to address the lack of tools and techniques for turn-key mapping of algorithms to the hybrid CPU-FPGA systems by developing an OpenUH add – on module to perform this mapping automatically