staqc: a systematically mined question-code dataset from...

1

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

Ziyu Yao

In collaboration with Prof. Daniel S. Weld (UW), Dr. Wei-Peng Chen (Fujitsu Lab), Prof. Huan Sun (OSU).

The Web Conference 2018, May 25th

The Ohio State University

Mapping between natural language and programming language

2

“how to clone or copy a python list?”

“new_list = copy.copy(old_list)”

Question Code

e.g., automated code search/annotation/generation.

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Challenges

n Lack of large-scale datasets for model development.i.e., pairs of <natural language question, code snippet>

3Ziyu YaoA Systematically Mined Question-Code


Challenges

n Lack of large-scale datasets for model development.i.e., pairs of <natural language question, code snippet>

n And datasets are important:



This work: StaQC

5

StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!

“how-to-do-it” questions [Souza et al., 2014; Defim et al., 2016]: the questioner provides a scenario and asks how to implement it.

“how to clone or copy a python list?”

“new_list = copy.copy(old_list)”

Question Code

Example:



This work: StaQC

6


Continuously growing in size and diversity



n Containing multiple code solutions to the same question.

q Question “How to limit a number to be within a specified range?”q 4 code solutions in StaQC:

Diversity of StaQC

7

Question: "How to limit a number to be within a specified range?

(Python)"Code answers:

def clamp(n, minn, maxn): return max(min(maxn, n),minn)

clamp = lambda n, minn,maxn: max(min(maxn, n),minn)

n = minn if n < minn else maxn if n > maxn else n

def clamp(n, minn, maxn): if n < minn: return minn elif n > maxn: return maxn else: return n

(1)

(2) (3)

(4)



Diversity of StaQC

n Containing different questions asking for semantically similar code solutions.



Diversity of StaQC

n Containing different questions asking for semantically similar code solutions.



Critical for Model Robustness:

1. Natural language variation.

2. Different implementations to do the same thing in programming language.

This work: StaQC

10



A better source for constructing models mapping between NL and PL



This work: StaQC

11



A better source for constructing models mapping between NL and PL



Example

12

Accepted answer post

Question



Example

13


Question

Code block 1

Code block 2

Code block 3



Example

14


Question

Code block 1

Code block 2

Code block 3



Should we pair <Question, Code block n>?

Example

15


Question

Code block 1

Code block 2

Code block 3



Is Code block n a “standalone” solution to the question?

“Standalone” code solution

16

By looking at Code block n,

can you solve the problem:



“Standalone” code solution

17

By looking at Code block n,

can you solve the problem:

No! (it shows the usage, but with no details of the function)



Example: Question-Code pairs

18


Question

Not a standalonesolution!

(Showing usage, no details of the function)

Standalone solution!

Standalone solution!☑

✖

☑



Previous methods: heuristics based

n “Select All”: Taking all code snippets in the answer post as code solutions. [Allamanis et al., 2015][Zilberstein and Yahav, 2016]

q Low precision

19

(ground truth)



Previous methods: heuristics based

n “Select First”: Taking only the first code snippet in the answer post as a code solution, or considering only answer posts containing exactly one code snippet. [Iyer et al., 2016]

q Low recall

20

(ground truth)



Our solution: A systematic framework

n Binary classification formulation:Input: A question on Stack Overflow and its accepted answer post with multiple code snippetsOutput: A binary label for each code snippet on whether it is a standalone solution to the question



A bi-view formulation

22

𝑺𝟏

𝑺𝟐

𝑺𝟑

𝑪𝟏

𝑪𝟐

𝑪𝟑

Interleaving text and codeblocks



Text-based view: contextual hints

23

𝑺𝟏

𝑺𝟐

𝑺𝟑

𝑪𝟏

𝑪𝟐

𝑪𝟑

𝐶'

𝐶(

𝐶)

more likely to be a code solution

more likely to be a code solution



Code-based view: semantics of code content

24

𝑺𝟐

𝑺𝟑

𝑪𝟏

𝑪𝟐

𝑪𝟑

possibly a usage demo

more likely to be a solution

more likely to be a solution



Formulation for each code snippet

25

Predict a “solution or not” label for a code snippet (here 𝐶() based on:1. Textual context (text view): 𝑺𝟐,𝑺𝟑.2. Code content (code view): 𝑪𝟐.



10/23/2017 BiM-HNN.html

1/1

Softmax

token-levelencoder

block-levelencoder

code labelprediction

a sequence ofword tokens in

concatfeedforwardBi-GRU


Bi-GRU

Bi-GRU

a sequence ofcode tokens in

Bi-GRU


Bi-View Hierarchical Neural Network (BiV-HNN)

26

𝐶,𝑞𝑞: question



Token-level encoder for text blocks

27

11/6/2017 BiM-HNN-text-token.html

1/1

Softmax

token-levelencoder

block-levelencoder





Bi-GRU

Bi-GRU


Bi-GRU




Token-level encoder for code blocks

28

11/6/2017 BiM-HNN-code-token.html

1/1

Softmax

token-levelencoder

block-levelencoder





Bi-GRU

Bi-GRU


Bi-GRU




Block-level encoder

29


1/1

Softmax

token-levelencoder

block-levelencoder





Bi-GRU

Bi-GRU


Bi-GRU




Code label prediction

30


1/1

Softmax

token-levelencoder

block-levelencoder





Bi-GRU

Bi-GRU


Bi-GRU




Experimental Setup

n Manually annotating “solution or not” label on code snippets in one answer post.q Python and SQL domain.q Four undergraduates with substantial Cohen’s kappa agreement.

n Training/validation/test split: 60% - 20% -20%.

31

Python SQL# of Question-Code pairs 4,884 3,637

% of positive Question-Code pairs 44% 57%



Main Results

n Heuristic methods: Select-First, Select-All.n Feature engineering based methods:

q Logistic Regression (LR), Support Vector Machine (SVM).q Features: text-based (uni-/bi-grams, the connectives, etc) and

code-based (code tokens, etc).

32

Python SQLSelect-First 0.607 0.613Select-All 0.642 0.737

LR 0.766 0.846SVM 0.753 0.850

BiV-HNN 0.841 0.888(comparison on F1)



Research questions for understanding BiV-HNN

Q1: text view, code view, or bi-view?

Q2: hierarchical structure or flat structure?

Q3: block-level encoder: sequential or feedforward?



Research questions for understanding BiV-HNN


Q2: hierarchical structure or flat structure?

Q3: block-level encoder: sequential or feedforward?




35

10/24/2017 UniM-HNN_RNN-qc.html

1/1

Softmax

CODEBLOCK

concatfeedforward

Block-level Encoder (Bi-GRU)



Softmax

Bi-GRU


Bi-GRUBi-GRUBi-GRU


10/24/2017 UniM-HNN_RNN-qc.html

1/1

Softmax

CODEBLOCK

concatfeedforward

Block-level Encoder (Bi-GRU)



Softmax

Bi-GRU


Bi-GRUBi-GRUBi-GRU



1/1

Softmax

token-levelencoder

block-levelencoder





Bi-GRU

Bi-GRU


Bi-GRU


Text-HNN(text view)

Code-HNN(code view)

Python SQLText-HNN 0.771 0.840Code-HNN 0.812 0.851BiV-HNN 0.841 0.888

BiV-HNN(bi-view)

(comparison on F1)



Model combination

n Text-HNN, Code-HNN and BiV-HNN are observed complementary to each other.q On Python validation set, 60%~70% of mistakes made by one

model can be corrected by the other two models.



Model combination



n Model combination: the label of a code snippet is predicted only when the three models agree on it.



Model combination



n Model combination: the label of a code snippet is predicted only when the three models agree on it.

n Model combination on testing set:q Python: ~70% of code snippets are labeled with 0.92 F1.q SQL: ~80% of code snippets are labeled with 0.94 F1.



Systematically mined StaQC

39

Model combination



New or unannotated post

StaQC: Systematically mined Question-Code pairs

40



A better source for constructing models mapping between natural language and programming language



StaQC: A better source for downstream tasks

n Code Retrieval as an exemplar downstream task (SQL domain).

n Neural Network model CODENN. [Iyer et al., 2016]q CODENN(Original): trained on ~26K heuristically collected QC pairs.q CODENN(StaQC): trained on ~120K systematically mined QC pairs.

41

~6% absolute gain(conservative)



Mea

n R

ecip

roca

l Ran

k

StaQC: Systematically mined Question-Code pairs

42



A better source for constructing models mapping between natural language and programming language

Data and code are available at: https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset



43

Thank you! Questions?



staqc: a systematically mined question-code dataset from...

Documents