deep learning for programming

78
Deep Learning for Programming Prof. Wray Buntine http://Bayesian-Models.org Dept. of Data Sc. & AI, Monash University College of Eng. & Comp. Sc., Vin University 2021-08-17 1 / 42

Upload: others

Post on 12-May-2022

17 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Deep Learning for Programming

Deep Learning for Programming

Prof. Wray Buntine

http://Bayesian-Models.org

Dept. of Data Sc. & AI, Monash University

College of Eng. & Comp. Sc., Vin University

2021-08-171 / 42

Page 2: Deep Learning for Programming

Or Thoughts ... From an Old Guy

⇐= ME

(before shaving and

haircut)

2 / 42

Page 3: Deep Learning for Programming

Or Thoughts ... From an Old Guy

⇐= ME

(before shaving and

haircut)

2 / 42

Page 4: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

A Catalogue of Deep Learning Ideas

Conclusion

3 / 42

Page 5: Deep Learning for Programming

Context

from MIT Technology Review

Karen Hao, 03/11/2020

3 / 42

Page 6: Deep Learning for Programming

Information Retrieval: Images

image search gives

4 / 42

Page 7: Deep Learning for Programming

Information Retrieval becomes ML

Information Retrieval (IR): from a single query (image, text), retrieve “best” matching

documents. e.g. text/image search engine

I this is like learning from just one example!

I called zero/one-shot learning in Deep Neural Networks, especially for images

I state-of-the-art in text IR is hybrid IR and transformer-based language models!I see SIGIR 2018 tutorial by Xu, He and LiI see QGenHyb by Ma, Korotkov, Yang, Hall, McDonald, EACL 2021I to my knowledge IR techniques and earlier one-shot researchers did not intersect!

5 / 42

Page 8: Deep Learning for Programming

Information Retrieval becomes ML

Information Retrieval (IR): from a single query (image, text), retrieve “best” matching

documents. e.g. text/image search engine

I this is like learning from just one example!

I called zero/one-shot learning in Deep Neural Networks, especially for images

I state-of-the-art in text IR is hybrid IR and transformer-based language models!I see SIGIR 2018 tutorial by Xu, He and LiI see QGenHyb by Ma, Korotkov, Yang, Hall, McDonald, EACL 2021I to my knowledge IR techniques and earlier one-shot researchers did not intersect!

5 / 42

Page 9: Deep Learning for Programming

Information Retrieval becomes ML

Information Retrieval (IR): from a single query (image, text), retrieve “best” matching

documents. e.g. text/image search engine

I this is like learning from just one example!

I called zero/one-shot learning in Deep Neural Networks, especially for images

I state-of-the-art in text IR is hybrid IR and transformer-based language models!I see SIGIR 2018 tutorial by Xu, He and LiI see QGenHyb by Ma, Korotkov, Yang, Hall, McDonald, EACL 2021I to my knowledge IR techniques and earlier one-shot researchers did not intersect!

5 / 42

Page 10: Deep Learning for Programming

Context, cont.

I Deep Learning successes:I machine translation, speech understandingI bio-informatics (e.g., 3-D protein folding)I object/person tracking in video

I Machine Learning in the guise of Deep Learning is now happening at aworld-wind pace:I a phase shift happening a few years ago,I huge teams making rapid advances,I results/methods already outdated and superceded at their point of publication,

I How can we employ these techniques in traditional computer science?

6 / 42

} representation learning!

Page 11: Deep Learning for Programming

Context, cont.

I Deep Learning successes:I machine translation, speech understandingI bio-informatics (e.g., 3-D protein folding)I object/person tracking in video

I Machine Learning in the guise of Deep Learning is now happening at aworld-wind pace:I a phase shift happening a few years ago,I huge teams making rapid advances,I results/methods already outdated and superceded at their point of publication,

I How can we employ these techniques in traditional computer science?

6 / 42

} representation learning!

Page 12: Deep Learning for Programming

Context, cont.

I Deep Learning successes:I machine translation, speech understandingI bio-informatics (e.g., 3-D protein folding)I object/person tracking in video

I Machine Learning in the guise of Deep Learning is now happening at aworld-wind pace:I a phase shift happening a few years ago,I huge teams making rapid advances,I results/methods already outdated and superceded at their point of publication,

I How can we employ these techniques in traditional computer science?6 / 42

} representation learning!

Page 13: Deep Learning for Programming

Software Systems adding AI/ML

I source data needs quality management, versioning, etc. ⇐= the biggest problem

I AI systems components entangled in complex ways

I ML knowledge required of developers to harness the ML tools

I whole new testing and development frameworks needed

I explainability, fairness, transparency, fault tolerance, ...

30 years ago a similar problem in AI was faced when integrating/developing expert

systems with traditional software:

I machine learning was proposed as the solution!

I even then the problem of data quality management was recognised

7 / 42

Page 14: Deep Learning for Programming

Software Systems adding AI/ML

I source data needs quality management, versioning, etc. ⇐= the biggest problem

I AI systems components entangled in complex ways

I ML knowledge required of developers to harness the ML tools

I whole new testing and development frameworks needed

I explainability, fairness, transparency, fault tolerance, ...

30 years ago a similar problem in AI was faced when integrating/developing expert

systems with traditional software:

I machine learning was proposed as the solution!

I even then the problem of data quality management was recognised7 / 42

Page 15: Deep Learning for Programming

AI/ML Dev. Tasks

from “Software Engineering for Machine Learning: A Case Study,” Amershi et al., ICSE 2019

8 / 42

Page 16: Deep Learning for Programming

AI/ML Dev. Pipeline

from “Software Engineering for Machine Learning: A Case Study,” Amershi et al., ICSE 2019

different to traditional agile or other!

near identical to the data science dev. pipeline from 2013and they’ve had huge trouble integrating into the career/HR environment

9 / 42

Page 17: Deep Learning for Programming

AI/ML Dev. Pipeline

from “Software Engineering for Machine Learning: A Case Study,” Amershi et al., ICSE 2019

different to traditional agile or other!

near identical to the data science dev. pipeline from 2013and they’ve had huge trouble integrating into the career/HR environment

9 / 42

Page 18: Deep Learning for Programming

Programming Languages for AI/ML

I support for building & testing ML models

I libraries for:I streaming dataI hyper-parameter optimisationI probability componentsI matrices and tensorsI auto-differentiationI model components (convolutions, transformers, etc.)

I probabilistic programming

10 / 42

Page 19: Deep Learning for Programming

AI/ML for Programming Languages

I debugging, testing and code reviews

I comments checking and generation

I code search and matching

I auto-generating software

I smart compilation

I auto-generating distributed and/or low-level code

11 / 42

Page 20: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

A Catalogue of Deep Learning Ideas

Conclusion

12 / 42

Page 21: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

Old ML

Deep ML Reminders

New (Deep) ML

A Catalogue of Deep Learning Ideas

Conclusion12 / 42

Page 22: Deep Learning for Programming

What did We Learn from Old ML?I how to do exponential family, linear and partitioning models well

e.g., XGBoost, online LDA, Gaussian mixtures, logistic regressionI plus how to augment them with fancy mathematical tricks

e.g., kernels, non-parametric Bayesian

i.e., infinite vectors

i.e., cases where learning admits computational simplifications

I models as first-class objectse.g., Bayesian & Markov networks, neural networks

I statistical ML theorye.g., bias-variance tradeoff, variational methods, ensembles

e.g., Bayesian MCMC as “gold standard” learning

12 / 42

Page 23: Deep Learning for Programming

What did We Learn from Old ML?I how to do exponential family, linear and partitioning models well

e.g., XGBoost, online LDA, Gaussian mixtures, logistic regressionI plus how to augment them with fancy mathematical tricks

e.g., kernels, non-parametric Bayesian

i.e., infinite vectors

i.e., cases where learning admits computational simplifications

I models as first-class objectse.g., Bayesian & Markov networks, neural networks

I statistical ML theorye.g., bias-variance tradeoff, variational methods, ensembles

e.g., Bayesian MCMC as “gold standard” learning

12 / 42

Page 24: Deep Learning for Programming

What did We Learn from Old ML?I how to do exponential family, linear and partitioning models well

e.g., XGBoost, online LDA, Gaussian mixtures, logistic regressionI plus how to augment them with fancy mathematical tricks

e.g., kernels, non-parametric Bayesian

i.e., infinite vectors

i.e., cases where learning admits computational simplifications

I models as first-class objectse.g., Bayesian & Markov networks, neural networks

I statistical ML theorye.g., bias-variance tradeoff, variational methods, ensembles

e.g., Bayesian MCMC as “gold standard” learning12 / 42

Page 25: Deep Learning for Programming

What did We Learn, cont.?

I cost functions (using statistical ML theory)

e.g., regularisation, metrics and divergences,

i.e., the theory of objective functions for deep NNs

I paradigms

e.g., online learning, active learning, transfer learning

I learning on networksI deterministic networksI probabilistic networks

13 / 42

Page 26: Deep Learning for Programming

What did We Learn, cont.?

I cost functions (using statistical ML theory)

e.g., regularisation, metrics and divergences,

i.e., the theory of objective functions for deep NNs

I paradigms

e.g., online learning, active learning, transfer learning

I learning on networksI deterministic networksI probabilistic networks

13 / 42

Page 27: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

Old ML

Deep ML Reminders

New (Deep) ML

A Catalogue of Deep Learning Ideas

Conclusion14 / 42

Page 28: Deep Learning for Programming

A Non-trivial Network

see playground.tensorflow.org

14 / 42

Page 29: Deep Learning for Programming

Deep Representations

from https://thedatascientist.com/what-deep-learning-is-and-isnt/

Observation: different layers of the network “learn” alternative representations.15 / 42

Page 30: Deep Learning for Programming

Deep Architechures

from “Memory-Based Network for Scene Graph with Unbalanced Relations”, MM’20, Wang et al.

Complex systems are built from neural modules.16 / 42

Page 31: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

Old ML

Deep ML Reminders

New (Deep) ML

A Catalogue of Deep Learning Ideas

Conclusion17 / 42

Page 32: Deep Learning for Programming

How Does Deep Learning Innovate?I Model/Spec driven black-box algorithms ease the workload of developers.

I machine learning without statistics!

I Porting down to GPUs or multi-core allows real speed.I Learning representations and discovering higher order concepts.

I convolutions, structures, sequences, ...I high capacity makes them very flexible in fitting and does implicit parallel search

I self-supervisioni.e., pre-training networks with fabricated but useful tasks

I Allows “modelling in the large”:I multi-task learning, imitation learning, meta-learningI components like convolutions, structures, sequences, ...

17 / 42

Page 33: Deep Learning for Programming

How Does Deep Learning Innovate?I Model/Spec driven black-box algorithms ease the workload of developers.

I machine learning without statistics!

I Porting down to GPUs or multi-core allows real speed.

I Learning representations and discovering higher order concepts.I convolutions, structures, sequences, ...I high capacity makes them very flexible in fitting and does implicit parallel search

I self-supervisioni.e., pre-training networks with fabricated but useful tasks

I Allows “modelling in the large”:I multi-task learning, imitation learning, meta-learningI components like convolutions, structures, sequences, ...

17 / 42

Page 34: Deep Learning for Programming

How Does Deep Learning Innovate?I Model/Spec driven black-box algorithms ease the workload of developers.

I machine learning without statistics!

I Porting down to GPUs or multi-core allows real speed.I Learning representations and discovering higher order concepts.

I convolutions, structures, sequences, ...I high capacity makes them very flexible in fitting and does implicit parallel search

I self-supervisioni.e., pre-training networks with fabricated but useful tasks

I Allows “modelling in the large”:I multi-task learning, imitation learning, meta-learningI components like convolutions, structures, sequences, ...

17 / 42

Page 35: Deep Learning for Programming

How Does Deep Learning Innovate?I Model/Spec driven black-box algorithms ease the workload of developers.

I machine learning without statistics!

I Porting down to GPUs or multi-core allows real speed.I Learning representations and discovering higher order concepts.

I convolutions, structures, sequences, ...I high capacity makes them very flexible in fitting and does implicit parallel search

I self-supervisioni.e., pre-training networks with fabricated but useful tasks

I Allows “modelling in the large”:I multi-task learning, imitation learning, meta-learningI components like convolutions, structures, sequences, ...

17 / 42

Page 36: Deep Learning for Programming

How Does Deep Learning Innovate?I Model/Spec driven black-box algorithms ease the workload of developers.

I machine learning without statistics!

I Porting down to GPUs or multi-core allows real speed.I Learning representations and discovering higher order concepts.

I convolutions, structures, sequences, ...I high capacity makes them very flexible in fitting and does implicit parallel search

I self-supervisioni.e., pre-training networks with fabricated but useful tasks

I Allows “modelling in the large”:I multi-task learning, imitation learning, meta-learningI components like convolutions, structures, sequences, ...

17 / 42

Page 37: Deep Learning for Programming

Why is Deep Neural Nets Successful?

from Stanford Uni. NLP course,

CS224N/Ling284 2020

18 / 42

Page 38: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

A Catalogue of Deep Learning Ideas

Conclusion

19 / 42

Page 39: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

A Catalogue of Deep Learning Ideas

Ensembling

Data Augmentation

Self-Supervision

Probabilistic Programming

Conclusion19 / 42

Page 40: Deep Learning for Programming

Ensembling

I Have set of (x , y) data Data and network parameters θ ∈ Θ.

I The Bayesian posterior, predicting y given new test point x :∫θ∈Θ

p(y | x , θ)︸ ︷︷ ︸prediction using parameters θ

· p(θ|Data)︸ ︷︷ ︸posterior on parameters θ

I The gold standard algorithm is complex MCMC

I The simple ensemble says to train a small set of “good but different” models Θ̂

and pool them together1|Θ̂|

∑θ∈Θ̂

p(y | x , θ)

I First did as a student’s masters project at University of Sydney in 1988.

19 / 42

Page 41: Deep Learning for Programming

Ensembling

I Have set of (x , y) data Data and network parameters θ ∈ Θ.

I The Bayesian posterior, predicting y given new test point x :∫θ∈Θ

p(y | x , θ)︸ ︷︷ ︸prediction using parameters θ

· p(θ|Data)︸ ︷︷ ︸posterior on parameters θ

I The gold standard algorithm is complex MCMC

I The simple ensemble says to train a small set of “good but different” models Θ̂

and pool them together1|Θ̂|

∑θ∈Θ̂

p(y | x , θ)

I First did as a student’s masters project at University of Sydney in 1988.19 / 42

Page 42: Deep Learning for Programming

The (Frequentist) Laws of Learning

I The First Law of Learning for model family H (Geman & Geman, 1992)

Mean-Squared-Error(H) = Bias(H)2 + Variance(H) + IrreducibleError

I The Second Law of Learning for ensemble H of models (Uedo & Nakano, 1996):

Mean-Squared-Error(H) = Bias(H)2 + 1|H|

Variance(H) + IrreducibleError

+(1− 1|H|

)Covariance(H)

I Therefore:I elements of the ensemble should have lower bias and varianceI elements of the ensemble should be de-correlated

20 / 42

Page 43: Deep Learning for Programming

The (Frequentist) Laws of Learning

I The First Law of Learning for model family H (Geman & Geman, 1992)

Mean-Squared-Error(H) = Bias(H)2 + Variance(H) + IrreducibleError

I The Second Law of Learning for ensemble H of models (Uedo & Nakano, 1996):

Mean-Squared-Error(H) = Bias(H)2 + 1|H|

Variance(H) + IrreducibleError

+(1− 1|H|

)Covariance(H)

I Therefore:I elements of the ensemble should have lower bias and varianceI elements of the ensemble should be de-correlated

20 / 42

Page 44: Deep Learning for Programming

The (Frequentist) Laws of Learning

I The First Law of Learning for model family H (Geman & Geman, 1992)

Mean-Squared-Error(H) = Bias(H)2 + Variance(H) + IrreducibleError

I The Second Law of Learning for ensemble H of models (Uedo & Nakano, 1996):

Mean-Squared-Error(H) = Bias(H)2 + 1|H|

Variance(H) + IrreducibleError

+(1− 1|H|

)Covariance(H)

I Therefore:I elements of the ensemble should have lower bias and varianceI elements of the ensemble should be de-correlated

20 / 42

Page 45: Deep Learning for Programming

Understanding the Second Law

random versus quasi-random

I estimating∫

θ∈Θ p(y | x , θ) · p(θ|Data) dθ

I the random points clump “statistically”

I the quasi-random points are de-correlated withlittle clumping

e.g., determinantal point processes

(like some adversarial training)

e.g., Stein variational gradient descent

I ensemble∑

θ∈Θ̂ p(y | x , θ) works better using

quasi-random points

21 / 42

Page 46: Deep Learning for Programming

Understanding the Second Law

random versus quasi-random

I estimating∫

θ∈Θ p(y | x , θ) · p(θ|Data) dθ

I the random points clump “statistically”

I the quasi-random points are de-correlated withlittle clumping

e.g., determinantal point processes

(like some adversarial training)

e.g., Stein variational gradient descent

I ensemble∑

θ∈Θ̂ p(y | x , θ) works better using

quasi-random points

21 / 42

Page 47: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

A Catalogue of Deep Learning Ideas

Ensembling

Data Augmentation

Self-Supervision

Probabilistic Programming

Conclusion22 / 42

Page 48: Deep Learning for Programming

Data Augmentation

from Ahmad, Muhammad and Baik, PLoS ONE 2017

IDEA: classified data is hard to get, so lets generate new data!22 / 42

Page 49: Deep Learning for Programming

Data Augmentation: MNIST

I technique first used in Zipcode recognition for

the US Post

I images can be rotated, shifted, thinned, etc.

23 / 42

Page 50: Deep Learning for Programming

Data Augmentation: Theory

I We need to identify an invariant in our data we want to hold.

e.g. For text, we could replace synonyms, back-translate, etc.

I We need an algorithm to apply changes to the data reflecting the invariant.

I Probabilistic model, for the augmentation distribution Aug

p(yi | xi , θ) standard data likelihood

p(yi | xi , θ,Aug) =∫

x ′p(yi | x ′i , θ)p(x ′|xi ,Aug) dx ′ augmented data likelihood

(the augmentation acts like a convolution)

I Developed as vicinal risk minimisation by Chapelle et al. 2001.

24 / 42

Page 51: Deep Learning for Programming

Data Augmentation: Theory

I We need to identify an invariant in our data we want to hold.

e.g. For text, we could replace synonyms, back-translate, etc.

I We need an algorithm to apply changes to the data reflecting the invariant.

I Probabilistic model, for the augmentation distribution Aug

p(yi | xi , θ) standard data likelihood

p(yi | xi , θ,Aug) =∫

x ′p(yi | x ′i , θ)p(x ′|xi ,Aug) dx ′ augmented data likelihood

(the augmentation acts like a convolution)

I Developed as vicinal risk minimisation by Chapelle et al. 2001.24 / 42

Page 52: Deep Learning for Programming

Data Augmentation: Implementation

I Probabilistic model, for the augmentation distribution Aug

p(yi | xi , θ,Aug) =∫

x ′p(yi | x ′i , θ)p(x ′|xi ,Aug) dx ′ augmented data likelihood

≈∑

x ′∈E(xi )p(yi | x ′i , θ) ensembled approximation

using ensemble of augmentations E (xi ) generated with p(x ′|xi ,Aug)

I At learning time we train stochastically with one or more augmentations.

I At inference time we should also do augmentations.

25 / 42

Page 53: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

A Catalogue of Deep Learning Ideas

Ensembling

Data Augmentation

Self-Supervision

Probabilistic Programming

Conclusion26 / 42

Page 54: Deep Learning for Programming

Encoding, Decoding and Embeddings

from “Representation Learning on Graphs: Methods and Applications”, Hamilton, Ying and Leskovec, 2017

Our operational framework: every piece of input structure has two embeddings,

contextual and non-contextual.26 / 42

Page 55: Deep Learning for Programming

Self-supervision

I to do pre-training, we need a task for learning

I embedding methods (CBOW, Word2Vec, etc.) are an early precursor

I pre-training should build lower-level features useful for subsequent target classes

Self-supervision: an artificial task created for the purposes of learning a network useful

as a pre-trained network.

I called “self” supervised since the task is created automatically

27 / 42

Page 56: Deep Learning for Programming

Self-supervision

I to do pre-training, we need a task for learning

I embedding methods (CBOW, Word2Vec, etc.) are an early precursor

I pre-training should build lower-level features useful for subsequent target classes

Self-supervision: an artificial task created for the purposes of learning a network useful

as a pre-trained network.

I called “self” supervised since the task is created automatically

27 / 42

Page 57: Deep Learning for Programming

Self-supervision, cont.

Self-supervision: an artificial task created for pre-training network.I examples for image recognition:

I color a B/W imageI fill-in missing patches (“image in-painting”)I object classification from very broad image class

I examples for text classification:I predict missing wordsI forwards and backwards order

I generally, the self-supervision task should be (1) richer and more refined than or

(2) similar to subsequent target tasks

28 / 42

Page 58: Deep Learning for Programming

Self-supervision, cont.

Self-supervision: an artificial task created for pre-training network.I examples for image recognition:

I color a B/W imageI fill-in missing patches (“image in-painting”)I object classification from very broad image class

I examples for text classification:I predict missing wordsI forwards and backwards order

I generally, the self-supervision task should be (1) richer and more refined than or

(2) similar to subsequent target tasks28 / 42

Page 59: Deep Learning for Programming

Pseudo-likelihood and Self-supervision

A pseudo-likelihood (Besag 1975) is an approximation to the joint probability

distribution of a collection of random variables using univariate conditionals:

p(X |M,Θ) =I∏

i=1p(xi | x1, ..., xi−1,M,Θ) (product rule of probability)

= p(x1 |M,Θ)p(x2 | x1,M,Θ) · · · p(xI | x1, ..., xI−1,M,Θ)︸ ︷︷ ︸repeat this term

p̃(X |M,Θ) ≡I∏

i=1p(xi |X−i ,M,Θ)︸ ︷︷ ︸

each term is a single prediction

where X−i = X \ {xi}

29 / 42

Page 60: Deep Learning for Programming

Pseudo-likelihood, cont.

p̃(X |M,Θ) ≡I∏

i=1p(xi |X−i ,M,Θ) where X−i = X \ {xi}

I it is easily computed because the individual conditionals can usually be easily

marginalised.

I maximising pseudo-likelihood is known to be consistent with maximising likelihood

in the limit of infinite data.

I Pseudo-likelihood can be viewed as a simplified theoretical view of self-supervision.

I But an alternative view is of representation learning.30 / 42

Page 61: Deep Learning for Programming

Self-supervision with Natural LanguageI stochastic pseudo-likelihood training on

PLL(W ; θ) =|W |∑t=1

log p(wt |W−t , θ)

I may add other prediction objectives about

sentence ordering, etc.

I when the model is multi-layer transformers,

is the basis of the biggest revolution in

NLP history (BERT, XLNET, ...)

figure from Salazar, Liang et al., ACL 2020

31 / 42

Page 62: Deep Learning for Programming

Graph Neural Networks

from “Representation Learning on Graphs: Methods and Applications”, Hamilton, Ying and Leskovec, 2017

I techniques similar to word embeddings have been developed for graphsI nodes in graphs also have extensive side information

I “hospital patient” node in a graph may have their socio-economic data attached

I pseudo-likelihood style self-supervision applies equally as well32 / 42

Page 63: Deep Learning for Programming

Summary

I Arbitary graph structures with embedded text can be modelled using this

framework.

I Self-supervision works but:I pays to be clever about the pseudo-likelihood prediction tasks

I These self-supervision encoder-decoder networks form ideal pre-trained networksfor subsequent prediction tasks.I can be done for written code and comments

33 / 42

Page 64: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

A Catalogue of Deep Learning Ideas

Ensembling

Data Augmentation

Self-Supervision

Probabilistic Programming

Conclusion34 / 42

Page 65: Deep Learning for Programming

Bayes infer. Using Gibbs Sampling

BUGS, Spiegelhalter, Thomas, Best, Gilks, 1996

Modelling language:model{

# model priors

beta0 ~ dnorm(0, 0.001)

eta1 ~ dnorm(0, 0.001)

tau ~ dgamma(0.1, 0.1)

sigma <- 1/sqrt(tau)

# data model, linear regression

for( i in 1:n) {

mu[i] <- beta0+ beta1*x[i]

y[i] ~ dnorm(mu[i] , tau)

}

}

I Simple Bayesian linear regression using

Gaussian model ~x = β0 + β1~y .

I All constants, parameters and data are

defined in the language.

34 / 42

Page 66: Deep Learning for Programming

Bayes infer. Using Gibbs Sampling

BUGS, Spiegelhalter, Thomas, Best, Gilks, 1996

Modelling language:model{

# model priors

beta0 ~ dnorm(0, 0.001)

eta1 ~ dnorm(0, 0.001)

tau ~ dgamma(0.1, 0.1)

sigma <- 1/sqrt(tau)

# data model, linear regression

for( i in 1:n) {

mu[i] <- beta0+ beta1*x[i]

y[i] ~ dnorm(mu[i] , tau)

}

}

I Simple Bayesian linear regression using

Gaussian model ~x = β0 + β1~y .

I All constants, parameters and data are

defined in the language.

34 / 42

Page 67: Deep Learning for Programming

Probabilistic Programming

I used to define probability models for automatic algorithm generation

I provide statistical constructs that users can directly call and use

I predominantly declarative - focus on what needs to be done over how

I can provide support for a plethora of operationsI Gibbs, Hamiltonian Monte Carlo and other GibbsI variational algorithmsI black-box optimisation

this section largely drawn from PhD thesis of Dr. Sachith Seneviratne

35 / 42

Page 68: Deep Learning for Programming

Probabilistic Programming: Examples

BUGS uses a directed acyclic graph (DAG) to represent a model (1995)

AutoBayes uses Schema to compose models (1998)

Stan uses Hamiltonian Monte Carlo and targets continuous sampling (2012)

Edward CPU/GPU support with various inference schemes (2016)

Google TensorFlow Probability scale performance across CPUs, GPUs, and TPU,

within TensorFlow

Initially developed independently of Deep Learning, but has similar goals, and similar

techniques, and there is now convergence.

36 / 42

Page 69: Deep Learning for Programming

Probabilistic Programming: Examples

BUGS uses a directed acyclic graph (DAG) to represent a model (1995)

AutoBayes uses Schema to compose models (1998)

Stan uses Hamiltonian Monte Carlo and targets continuous sampling (2012)

Edward CPU/GPU support with various inference schemes (2016)

Google TensorFlow Probability scale performance across CPUs, GPUs, and TPU,

within TensorFlow

Initially developed independently of Deep Learning, but has similar goals, and similar

techniques, and there is now convergence.36 / 42

Page 70: Deep Learning for Programming

Outline

Context

Old ML and Stats versus New ML

A Catalogue of Deep Learning Ideas

Conclusion

37 / 42

Page 71: Deep Learning for Programming

Example Uses

37 / 42

Page 72: Deep Learning for Programming

Code is Different to Natural Language

I NLP is “natural” and extremely hard to parse, with new or erroneous tokensappearingI majority of deep nets treat it like a sequence, not a structure

I code is easy to parse with precise structures, except forI the language buried in comments,I and variable names

I we even have auxiliary information like modification time, author, etc.

code has rich structures and auxiliary information as well as embedded text

38 / 42

Page 73: Deep Learning for Programming

Pre-training for Code

I use graphs to represent structure of code, model with graph NNs

I harness standard language models for the comments

I harness standard tokenisation models for variable names (WordPiece, etc.)

I develop clever self-supervision tasks:I predict commentsI predict control flowI predict expressions

39 / 42

Page 74: Deep Learning for Programming

Variants in Probabilistic Programming

Sachith Seneviratne’s PhD thesis, 2020

I high level code generated from models

I variant algorithms (part Gibbs, part variational, different probability formulations)

I variant implementations (different parts in multi-core, GPU)

I automated testing harness to “race” techniquesI ala “Using machine learning to focus iterative optimization,” Agakov, Bonilla, et al.,

Int. Symp. on Code Gen. and Optim., 2006

Needs the code generation and automated testing infrastructure to make work.

40 / 42

Page 75: Deep Learning for Programming

Variants in Probabilistic Programming

Sachith Seneviratne’s PhD thesis, 2020

I high level code generated from models

I variant algorithms (part Gibbs, part variational, different probability formulations)

I variant implementations (different parts in multi-core, GPU)

I automated testing harness to “race” techniquesI ala “Using machine learning to focus iterative optimization,” Agakov, Bonilla, et al.,

Int. Symp. on Code Gen. and Optim., 2006

Needs the code generation and automated testing infrastructure to make work.

40 / 42

Page 76: Deep Learning for Programming

Another Application

generating clickbait for developers

who can tell me where this image is

grabbed from?

41 / 42

Page 77: Deep Learning for Programming

Another Application

generating clickbait for developers

who can tell me where this image is

grabbed from?

41 / 42

Page 78: Deep Learning for Programming

Questions?

42 / 42