![Page 2: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/2.jpg)
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical Code Completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusion
![Page 3: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/3.jpg)
Marriage of ML and PL
SLANG: Code completion[PLDI 14] PL Translation[Onward 14]
More Application:
Program bug detection
Program invariants inference
PL design: Probabilistic PL
Binary analysis
........
JSNice: Type Predication[POPL 15]
![Page 4: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/4.jpg)
Intermediate Representation
Sequences Trees
Graphical Models Feature Vectors
Other IRs in PL research: AST, CFG, CDG, DDG, PDG. SSA, CPS.....
![Page 5: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/5.jpg)
Extract program representation with Program Analysis
•SLANG: alias and typestate analysis •JSNice: scope and alias analysis, type analysis
......
•Other application:
Use type inference to get trained labels
Use SAT Solver to check path condition
....
![Page 6: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/6.jpg)
What's the suitableprobabilistic model?
N-gram language model [PLDI 14]
Probabilistic context-free grammers [ICSE 12]
Netural networks
Support vector machine
Conditional Ramdom Fields[POPL 15]
......
Same like the IR, it's dependent on the application.
![Page 7: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/7.jpg)
ML for PL
[Picture from Martin Vechec's slide]
![Page 8: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/8.jpg)
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions
![Page 9: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/9.jpg)
Statistical Code Completion
[SLANG, V.Raychev et al' PLDI 14]
Key insight: Regularities in code are similar to regularities in natural language
![Page 10: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/10.jpg)
Techniques in SLANG
• IR: Sequences (setences) • Program Analysis: typestate analysis, alias analysis • Trained Model: Netural Network, N-gram language model • Some smoothing techniques
![Page 11: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/11.jpg)
N-gram language model
Conditional probability only on previous n-1 words
Training is achieved by counting n-grams.
Time complexity for each word encountreed in training is constant, so training
is usually fast.
Other models used: Recurrent Netural Network(RNN). RNN can learn dependencies beyond the prior several words, but usually slower
![Page 12: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/12.jpg)
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical code completion • Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusion
![Page 13: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/13.jpg)
Learning to Recognize Functions in Binary Code
[Tiffany Bao et al' Usenix Security 14]
When we use gcc with -O3, the function information may be stripped.
Can we automatically and accurately recover function information from binaries?
![Page 14: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/14.jpg)
Example: GCC
#include <stdio.h>int fac(int x){ if (x == 1) return 1; else return x * fac(x - 1);}
void main(int argc, char **argv){ printf("%d", fac(10));}
![Page 15: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/15.jpg)
Example: GCC
• default -O0 08048443 <main>:push %ebpmov %esp,%ebpand $0xfffffff0,%espsub $0x10,%esp…
0804841c <fac>:push %ebpmov %esp,%ebpsub $0x18,%espcmpl $0x1,0x8(%ebp)jne 804842f <fac+0x13>mov $0x1,%eax…
![Page 16: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/16.jpg)
-O1 -O2
0804841c <fac>:push %ebxsub $0x18,%espmov 0x20(%esp),%ebxmov $0x1,%eaxcmp $0x1,%ebx…
08048330 <main>:mov $0x1,%edxmov $0xa,%eaxlea 0x0(%esi),%esi…push %ebpmov %esp,%ebpand $0xfffffff0,%espsub $0x10,%esp…
![Page 17: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/17.jpg)
ByteWeight
A machine learning + program analysis approach to function identification
Training: •Creates a model of function start patterms using supervised learning
Usage:– Use trained models to match function start on stripped binaries — Function
Start Identification– Use program analysis to identify all bytes associated with a function —
Function Identification
![Page 18: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/18.jpg)
[Picture from Bao's slide]
![Page 19: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/19.jpg)
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions
![Page 20: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/20.jpg)
Problems of Program analysis
• Program have unbounded behaviors
• Program analysis – Analyze all behaviors– Run for a finite time
• In finite time, observe only finite behaviors
• Need to generalize
![Page 21: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/21.jpg)
Generalization in Program Analyais
• Abstraction interpretation: widening operator[
• CEGAR: interpolants
• Parameter tuning of tools(flow, path sensitivity, etc)
• Lots of folk knowledge, heuristics,...
![Page 22: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/22.jpg)
Generalization in Machine Learning
• “It’s all about generalization”I• A famous concept in Computational learning theory
– Complexity and Feasibility of learning
• Learn a function from observations
• Hope that the function generalizes
![Page 23: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/23.jpg)
Bias-variance Tradeofs in Program analysis
[Aiken, POPL 14]
•Model the generalization process – Probably Approximately Correct(PAC) model – Bias: Empirical error of best available hypothesis– Variance: O(VC-d)
•Explain know observations by this model
•Use this model to obtain better tools(in ASTREE, Yogi Project..)
![Page 24: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/24.jpg)
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions
![Page 25: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/25.jpg)
Combine ML and PL Research
•Already lots of work in: POPL, PLDI, SOSP, OSDI, Usenix Security....
•Lots of applications and theories to be found.
•Combination with other fileds: System, Security..
![Page 26: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b0e7f8b9ab05998d73a/html5/thumbnails/26.jpg)
Thank you !