presented by tambet matiisen write programs · 2018-02-20 · separately, we concatenate the...

DeepCoder: Learning to Write Programs

Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, Daniel Tarlow

Presented by Tambet MatiisenASE seminar, 20.02.2018

The Task

Inductive Program Synthesis (IPS) - given a set of input-output pairs, generate (the shortest) program that converts input into output.

The Idea

● Use neural network to predict from input-output pairs which functions were used in the program.

● Use those predictions to prioritize exhaustive search.

The Method

1. Define Domain Specific Language (DSL)2. Generate programs and input-output examples3. Train neural network to predict functions from input-output pairs4. Perform program search using classical methods

Domain Specific Language (DSL)

The Method


Generate programs and input-output examples● Enumerate programs in DSL

○ Prune those for which shorter equivalent program exists (i.e. unused variable).

● Generate inputs and outputs

○ Constrain outputs to predetermined range ([-256, 255]), propagate constraints back to input to obtain ranges for inputs. If range is empty, discard program.

○ Pick inputs from pre-computed valid ranges and execute the program to obtain output values.

● Generate attributes

○ Use binary vector to represent which functions were used in the program.

The Method


The Network Architecture“For the encoder we use a simple feed-forward architecture. First, we represent the input and output types (singleton or array) by a one-hot-encoding, and we pad the inputs and outputs to a maximum length L with a special NULL value. Second, each integer in the inputs and in the output is mapped to a learned embedding vector of size E = 20. (The range of integers is restricted to a finite range and each embedding is parametrized individually.) Third, for each input-output example separately, we concatenate the embeddings of the input types, the inputs, the output type, and the output into a single (fixed-length) vector, and pass this vector through H = 3 hidden layers containing K = 256 sigmoid units each. The third hidden layer thus provides an encoding of each individual input-output example. Finally, for input-output examples in a set generated from the same program, we pool these representations together by simple arithmetic averaging.”

Network Inputs“we represent the input and output types (singleton or array) by a one-hot-encoding”

integer

array

1 0

0 1

“each integer in the inputs and in the output is mapped to a learned embedding vector of size E = 20”

NB! Equivalent to multiplying one-hot vector with weight matrix, just faster.

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -1-256

-255

-254

...

253

254

255

1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.14.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2-2. 1.3 -4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3-4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3

-.3 1.4 0.7 -1

...

learned

“we pad the inputs and outputs to a maximum length L with a special NULL value”

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -1-17

-3

4

NULL

NULL

NULL

1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.14.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3

-.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.14.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.14.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

Array [-17, -3, 4], padded to L = 6

“for each input-output example separately, we concatenate the embeddings of the input types, the inputs, the output type, and the output into a single (fixed-length) vector”

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -1-17

-3

4

NULL

NULL

NULL

1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.1 4.1 5.5 0.9 -.4

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.14.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3

-.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.14.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3 -4. -.14.1 5.5 0.9 -.4 -.3 1.4 0.7 -1

array 0 1

integer 1 0

0.1 2.4 -.2 1.3 1.2 -.8 -.11

3.2 -2. 1.3-4. -.1 4.1 5.5 0.9 -.4 -.3 1.4 0.7 -17

“... and pass this vector through H = 3 hidden layers containing K = 256 sigmoid units each”

Input layer

Hidden layer 1

Σ

Σ

Σ

Hidden layer 2

Σ

Σ

Σ

Hidden layer 3

Σ

Σ

Σ

x0

x1

x2

x3

1 1 1

wi,j(1) wi,j

(2) wi,j(3)

h1(1)

h2(1)

h3(1)

h1(2)

h2(2)

h3(2)

h1(3)

h2(3)

h3(3)

Bias inputs

bj(1) bj

(2) bj(3)

encoding or representation

(vector of length 256)

“... and pass this vector through H = 3 hidden layers containing K = 256 sigmoid units each”

Matrix notation:

h(k) - 256x1 matrixW(k) - 256x256 matrix (learned)b(k) - 256x1 matrix (learned)

“Finally, for input-output examples in a set generated from the same program, we pool these representations together by simple arithmetic averaging.”

0.3 1.8 -.8 1.3sample 1

sample 2

sample 3

avgpooled

1.1 -.8 -.1 2.1

0.4 0.8 -.2 1.1

-.2 1.4 0.3 -.1

...

...

...

...

256

“We use a decoder that pre-multiplies the encoding of input-output examples by a learned CxK matrix, where C = 34 is the number of functions in our DSL...”

Pooled values

Outputlayer

Σ

Σ

Σ

z0

z1

z2

z3

1

wi,j(4)

p1

p2

p3

Bias inputs

bj(4)

probabilities of functions(vector of length 34)

“... and treats the resulting C numbers as log-unnormalized probabilities (logits) of each function appearing in the source code.”

The Final Architecture

“We use the negative cross entropy loss to train the neural network”

Equivalently could just maximize cross entropy, just minimization is usually implemented.

Training using gradient descentTake derivative of loss function with respect to weights to see how loss changes when you change the weight. Change the weight so that the loss would decrease.

In matrix notation:

Here α is a learning rate that must be manually tuned.

The Method


Program search● Depth-first search (DFS) - consider the functions ordered by their predicted

probabilities from the neural network.● “Sort and add” enumeration - maintains a set of active functions and

performs DFS with the active function set only. Whenever the search fails, the next most probable function (or several) are added to the active set and the search restarts with this larger active set.

● Sketch - SMT-based program synthesis tool. Sketch can utilize the neural network predictions in a Sort and add scheme as described above, as the possibilities for each function hole can be restricted to the current active set.

● λ2 - can be used in our framework using a Sort and add scheme as described above by choosing the library of functions according to the neural network predictions.

The Experiments

Programs of length T=3

● Test set was guaranteed to be semantically disjoint from all programs on which the neural network was trained (we have ensured that all test programs behave differently from all programs used during training on at least one input).

● As a baseline, we also ran all search procedures using a simple prior as function probabilities, computed from their global incidence in the program corpus.

Programs of length T=5

Example predictions

Em

bedd

ings

Con

fusi

on M

atrix

(T=3

)

Con

fusi

on M

atrix

(T=5

)

Alternative Model● Encoder: GRU-based RNN● Decoder: RNN trained to predict the entire program token-by-token● Beam search was used to explore likely programs predicted by RNN

It only lead to a solution comparable with the other techniques when searching for programs of lengths T <= 2, where the search space size is very small (on the order of 10^3).

We do not rule out that a more sophisticated RNN decoder or training procedure could be possibly more successful.

Thank [email protected]

presented by tambet matiisen write programs · 2018-02-20 · separately, we concatenate the...

Documents