autotvm & device fleetlearning to optimize tensor programs high-level data flow graph and...
TRANSCRIPT
![Page 1: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/1.jpg)
AutoTVM & Device Fleet`
![Page 2: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/2.jpg)
Learning to Optimize Tensor Programs
High-level data flow graph and optimizations
Hardware
Frameworks
![Page 3: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/3.jpg)
Learning to Optimize Tensor Programs
High-level data flow graph and optimizations
Hardware
Frameworks
![Page 4: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/4.jpg)
Machine Learning based Program Optimizer
Learning to Optimize Tensor Programs
High-level data flow graph and optimizations
Hardware
Frameworks
![Page 5: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/5.jpg)
Machine Learning based Program Optimizer
Learning to Optimize Tensor Programs
High-level data flow graph and optimizations
Learning to generate optimized programfor new operator workloads and hardware
Hardware
Frameworks
![Page 6: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/6.jpg)
Search over Possible Program Transformations
Hardware
Loop Transformations Thread Bindings Cache Locality
Thread Cooperation Tensorization Latency Hiding
C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
![Page 7: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/7.jpg)
Search over Possible Program Transformations
Hardware
Loop Transformations Thread Bindings Cache Locality
Thread Cooperation Tensorization Latency Hiding
C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
![Page 8: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/8.jpg)
Search over Possible Program Transformations
Hardware
Loop Transformations Thread Bindings Cache Locality
Thread Cooperation Tensorization Latency Hiding
Billionsof possibleoptimizationchoices
C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
![Page 9: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/9.jpg)
Learning-based Program Optimizer
!4
Program Optimizer ProgramCode Generator
![Page 10: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/10.jpg)
Learning-based Program Optimizer
Runtime Measurements
!4
Program Optimizer ProgramCode Generator
![Page 11: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/11.jpg)
Learning-based Program Optimizer
Runtime Measurements
High experiment cost, each trial costs ~1second !4
Program Optimizer ProgramCode Generator
![Page 12: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/12.jpg)
Learning-based Program Optimizer
!5
Program Optimizer ProgramCode Generator
![Page 13: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/13.jpg)
Learning-based Program Optimizer
!5
Program Optimizer ProgramCode Generator
Cost Model
![Page 14: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/14.jpg)
Learning-based Program Optimizer
Need reliable cost model per hardware!5
Program Optimizer ProgramCode Generator
Cost Model
![Page 15: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/15.jpg)
Learning-based Program Optimizer
Program Optimizer ProgramCode Generator
![Page 16: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/16.jpg)
Learning-based Program Optimizer
Program Optimizer ProgramCode Generator
D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>
Training data
![Page 17: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/17.jpg)
Learning-based Program Optimizer
Program Optimizer ProgramCode Generator
D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>
Training data
Learning
Statistical Cost Model
![Page 18: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/18.jpg)
Learning-based Program Optimizer
• Relatively low experiment cost• Domain-specific problem structure• Large quantity of similar tasks
Unique ProblemCharacteristics
Program Optimizer ProgramCode Generator
D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>
Training data
Learning
Statistical Cost Model
![Page 19: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/19.jpg)
Program-aware Cost Modeling
High-Level Configuration
![Page 20: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/20.jpg)
Program-aware Cost Modeling
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
Low-level Abstract Syntax Tree (shared between tasks)
![Page 21: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/21.jpg)
Program-aware Cost Modeling
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
Low-level Abstract Syntax Tree (shared between tasks)
C A By 64 64 64x 8 8 64k 1 8 8
y 1x 8k 64
touched memory
outer looplength
statistical features
Boosted Tree Ensembles
![Page 22: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/22.jpg)
Program-aware Cost Modeling
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
Low-level Abstract Syntax Tree (shared between tasks)
for
context vec of x
for
for
context vec of y
context vec of k
+
soft scatter
finalembedding
TreeGRU
C A By 64 64 64x 8 8 64k 1 8 8
y 1x 8k 64
touched memory
outer looplength
statistical features
Boosted Tree Ensembles
![Page 23: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/23.jpg)
Effectiveness of ML based Model
!8
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
![Page 24: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/24.jpg)
Effectiveness of ML based Model
!8
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
One Conv2D Layer of ResNet18 on Titan X
![Page 25: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/25.jpg)
Effectiveness of ML based Model
!8
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
Number of Trials
One Conv2D Layer of ResNet18 on Titan X
![Page 26: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/26.jpg)
Effectiveness of ML based Model
!8
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
Number of Trials
One Conv2D Layer of ResNet18 on Titan X
Rela
tive
Spee
dup
Baseline: CuDNN
![Page 27: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/27.jpg)
Effectiveness of ML based Model
!9
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
Number of Trials
One Conv2D Layer of ResNet18 on Titan X
Rela
tive
Spee
dup
Baseline: CuDNN
TVM: Random Search
![Page 28: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/28.jpg)
Effectiveness of ML based Model
!10
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
One Conv2D Layer of ResNet18 on Titan X
Rela
tive
Spee
dup
TVM: Random Search
TVM: ML-based Model
Number of Trials
Baseline: CuDNN
![Page 29: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/29.jpg)
Transfer Learning Among Different Workloads
Historical Optimization Tasks
![Page 30: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/30.jpg)
Transfer Learning Among Different Workloads
Historical Optimization Tasks
Domain Invariant Program Representations
![Page 31: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/31.jpg)
Transfer Learning Among Different Workloads
Historical Optimization Tasks
Domain Invariant Program Representations
Transferable Models to speedup new tasks
![Page 32: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/32.jpg)
Transfer Learning Among Different Workloads
Historical Optimization Tasks
Domain Invariant Program Representations
Transferable Models to speedup new tasks
![Page 33: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/33.jpg)
NVIDIA GPU Optimization (GTX 1080 Ti)La
tenc
y (m
s)
0
1.75
3.5
5.25
7
ResNet-50 MobileNet VGG-19 Inception V3 DenseNet-121
MXNet + TensorRT 4.0 AutoTVM
![Page 34: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/34.jpg)
AMD GPU Optimization (Vega FE)La
tenc
y (m
s)
0
1.75
3.5
5.25
7
ResNet-50 MobileNet DenseNet-121
MIOpen AutoTVM
![Page 35: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/35.jpg)
Bonus (INT8, GTX 1080)La
tenc
y (m
s)
0E+00
4E-05
8E-05
1.2E-04
1.6E-04
1-7-512-512-1 4-7-512-512-1 1-7-512-512-3 4-7-512-512-3 1-14-256-256-1 4-14-256-256-1
cuDNN AutoTVM
![Page 36: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/36.jpg)
High Level: Scaling Automatic Performance Profiling
!15
@
@
@
@
Fleet Tracker
![Page 37: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/37.jpg)
Low Level: Portable RPC Tracker + Server
!16
Resource Manager (Tracker)
Nvidia GPU Server
RPC RT
CUDA tasks
Android Phone
RPC RT
OpenCL tasks
AMD GPU Server
RPC RT
ROCmtasks
Zynq FPGA board
RPC RT
JIT driver
Raspberry Pi
RPC RT
ARM tasks
Shared cluster of heterogeneous devices
Optimization Service
RPCclient
Resource token
Resource Allocation
RPC Session Data Path
Optimization Service
RPCclient
crosscompiler
Red modules can be reconfigured remotely in each session
crosscompiler
Running optimization services
Prioritizer
Workload 1
Workload 2
Workload 3
ML-based cost model
…
Hardwarebitstream
![Page 38: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/38.jpg)
RPC Communication Flow
!17
Client Tracker
Client Device
upload code
run code
return run time
Device
![Page 39: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/39.jpg)
RPC Communication Flow
!17
Client Tracker
Client Device
upload code
run code
return run time
Device
device free
![Page 40: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/40.jpg)
RPC Communication Flow
!17
Client Tracker
Client Device
upload code
run code
return run time
Device
device free
![Page 41: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/41.jpg)
RPC Communication Flow
!17
Client Tracker
request device
Client Device
upload code
run code
return run time
Device
device free
![Page 42: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/42.jpg)
RPC Communication Flow
!17
Client Tracker
request device
Client Device
upload code
run code
return run time
Device
device free
![Page 43: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/43.jpg)
RPC Communication Flow
!17
Client Tracker
request device
Client Device
upload code
run code
return run time
Device
device free
![Page 44: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/44.jpg)
RPC Communication Flow
!17
Client Tracker
request device
return handle
Client Device
upload code
run code
return run time
Device
device free
![Page 45: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/45.jpg)
Model to Tuned Implementation
!18
Model Bag of Operatorsoperator extraction AutoTVM tuning Tuned
Model
![Page 46: AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and optimizations Hardware Frameworks. Learning to Optimize Tensor Programs High-level data flow](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f17f3d77a3731406e304cc5/html5/thumbnails/46.jpg)
Next: Autoscheduler, Lianmin @ 16:30
!19
conv2d, x86
conv2d, GPU, winograd
conv2d, ARM, spatial packing
Handcrafted Schedule Templates