the java profiler based on byte code analysis and instrumentation for many-core hardware...
TRANSCRIPT
![Page 1: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/1.jpg)
The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators
Marcin Pietroń1,2, Michał Karwatowski1,2, Kazimierz Wiatr12
1AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków,2ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków
RUC 17-18.09.2015 Kraków
![Page 2: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/2.jpg)
2Agenda
GPU acceleration
Code analysis and instrumentation
Experiments
Results
Conclusion and future work
![Page 3: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/3.jpg)
3GPU as modern hardware accelerators
Computing power (over 1 Tflops)
Availability
High parallelism (SIMT architecture)
High level programming tools (CUDA, OpenCL)
![Page 4: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/4.jpg)
4GPU hardware accelerators
Number of algorithms from different domains implemented in GPU:
Linear algebra (e.g. cublas, cula)
Deep learning, neural networks, machine learning algorithms (e.g. SVM)
Computational intelligence (e.g. genetic, memetic algorithms)
Data and text mining
![Page 5: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/5.jpg)
5Code analysis
Implementation should be preceded by appropriate analysis
Analysis can be automated
Static analysis for finding hidden parallelism (Banarjee, Range Test, Omega Test) and data reusing and distribution
Profiling as dynamic analysis
![Page 6: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/6.jpg)
6Byte code analysis and instrumentation
Byte code analysis just in time
Apprioprate instrumentation for profiling and static analysis
Results of analysis and profiling can be used for implementation
![Page 7: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/7.jpg)
7System architecture
![Page 8: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/8.jpg)
8Byte code instrumentation
instrumenting array data read instructions
instrumenting array data write instructions
instrumenting array data read and write instructions for counting number of accesses and standard deviation,
instrumenting single variables read and write for counting number of accesses.
![Page 9: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/9.jpg)
9Byte code instrumentation
for (int i = 1; i < 100; i++) { test_1[i] = 100; test_2[i] = test_1[i-1] + 10; }
for (int i = 1; i < 100; i++) { test_1[i] = 100; test_1_mon[i] = i; test_2[i] = test_1[i-1] + 10; if (test_1_mon[i-1] < i) { dist_vectors[i-1] = i-test_1_mon[i]; } }
27: iconst_128: istore %630: iload %632: bipush 10034: if_icmpge#9337: aload_138: iload %640: bipush 10042: iastore43: aload_344: iload %646: iload %648: iastore...70: if_icmpge#8773: aload %575: iload %677: iconst_178: isub79: iload %681: aload_382: iload %684: iaload85: isub86: iastore87: iinc %6 190: goto #30
![Page 10: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/10.jpg)
10GPU implementation rules
if data is reused between iterations (between threads) this data should be transfer to shared memory,
data reused by only single iteration should be transfer to local memory (registers),
data which is reused, read only and without regular accesses should be allocated in texture memory,
![Page 11: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/11.jpg)
11GPU implementation rules
common constant values used by threads should be write to constant memory,
data with single access but without coalesced access should be transfer in a group in a coalesced manner to shared memory and then read from this memory for further computing.
![Page 12: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/12.jpg)
12JCuda generation
Implementation can be done manually or partly in automated way
Rules generate some parallel code patterns
![Page 13: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/13.jpg)
13Experimental results
size of matrix GPU time [ms] CPU time (MKL BLAS) [ms]
256×256 0.4 6
512×512 4.3 24
1024×1024 34 158
2048×2048 285 956
4096×4096 2817 990
![Page 14: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/14.jpg)
14Conlusions and future work
Implementation preceded by source code analysis helps adaption algorithm in GPU
Automated parallel code generation in GPU save a lot of time
Based on byte code = portable
Optimizations in code generation must be done furter in our system (memories access patterns)
![Page 15: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz](https://reader031.vdocuments.us/reader031/viewer/2022032206/56649ec75503460f94bd2dcd/html5/thumbnails/15.jpg)
15Questions
?