staqc: a systematically mined question-code dataset from...
TRANSCRIPT
1
StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow
Ziyu Yao
In collaboration with Prof. Daniel S. Weld (UW), Dr. Wei-Peng Chen (Fujitsu Lab), Prof. Huan Sun (OSU).
The Web Conference 2018, May 25th
The Ohio State University
Mapping between natural language and programming language
2
“how to clone or copy a python list?”
“new_list = copy.copy(old_list)”
Question Code
e.g., automated code search/annotation/generation.
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Challenges
n Lack of large-scale datasets for model development.i.e., pairs of <natural language question, code snippet>
3Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Challenges
n Lack of large-scale datasets for model development.i.e., pairs of <natural language question, code snippet>
n And datasets are important:
4Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
This work: StaQC
5
StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!
“how-to-do-it” questions [Souza et al., 2014; Defim et al., 2016]: the questioner provides a scenario and asks how to implement it.
“how to clone or copy a python list?”
“new_list = copy.copy(old_list)”
Question Code
Example:
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
This work: StaQC
6
StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!
Continuously growing in size and diversity
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
n Containing multiple code solutions to the same question.
q Question “How to limit a number to be within a specified range?”q 4 code solutions in StaQC:
Diversity of StaQC
7
Question: "How to limit a number to be within a specified range?
(Python)"Code answers:
def clamp(n, minn, maxn): return max(min(maxn, n),minn)
clamp = lambda n, minn,maxn: max(min(maxn, n),minn)
n = minn if n < minn else maxn if n > maxn else n
def clamp(n, minn, maxn): if n < minn: return minn elif n > maxn: return maxn else: return n
(1)
(2) (3)
(4)
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Diversity of StaQC
n Containing different questions asking for semantically similar code solutions.
8Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Diversity of StaQC
n Containing different questions asking for semantically similar code solutions.
9Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Critical for Model Robustness:
1. Natural language variation.
2. Different implementations to do the same thing in programming language.
This work: StaQC
10
StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!
Continuously growing in size and diversity
A better source for constructing models mapping between NL and PL
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
This work: StaQC
11
StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!
Continuously growing in size and diversity
A better source for constructing models mapping between NL and PL
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Example
12
Accepted answer post
Question
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Example
13
Accepted answer post
Question
Code block 1
Code block 2
Code block 3
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Example
14
Accepted answer post
Question
Code block 1
Code block 2
Code block 3
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Should we pair <Question, Code block n>?
Example
15
Accepted answer post
Question
Code block 1
Code block 2
Code block 3
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Is Code block n a “standalone” solution to the question?
“Standalone” code solution
16
By looking at Code block n,
can you solve the problem:
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
“Standalone” code solution
17
By looking at Code block n,
can you solve the problem:
No! (it shows the usage, but with no details of the function)
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Example: Question-Code pairs
18
Accepted answer post
Question
Not a standalonesolution!
(Showing usage, no details of the function)
Standalone solution!
Standalone solution!☑
✖
☑
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Previous methods: heuristics based
n “Select All”: Taking all code snippets in the answer post as code solutions. [Allamanis et al., 2015][Zilberstein and Yahav, 2016]
q Low precision
19
(ground truth)
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Previous methods: heuristics based
n “Select First”: Taking only the first code snippet in the answer post as a code solution, or considering only answer posts containing exactly one code snippet. [Iyer et al., 2016]
q Low recall
20
(ground truth)
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Our solution: A systematic framework
n Binary classification formulation:Input: A question on Stack Overflow and its accepted answer post with multiple code snippetsOutput: A binary label for each code snippet on whether it is a standalone solution to the question
21Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
A bi-view formulation
22
𝑺𝟏
𝑺𝟐
𝑺𝟑
𝑪𝟏
𝑪𝟐
𝑪𝟑
Interleaving text and codeblocks
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Text-based view: contextual hints
23
𝑺𝟏
𝑺𝟐
𝑺𝟑
𝑪𝟏
𝑪𝟐
𝑪𝟑
𝐶'
𝐶(
𝐶)
more likely to be a code solution
more likely to be a code solution
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Code-based view: semantics of code content
24
𝑺𝟐
𝑺𝟑
𝑪𝟏
𝑪𝟐
𝑪𝟑
possibly a usage demo
more likely to be a solution
more likely to be a solution
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Formulation for each code snippet
25
Predict a “solution or not” label for a code snippet (here 𝐶() based on:1. Textual context (text view): 𝑺𝟐,𝑺𝟑.2. Code content (code view): 𝑪𝟐.
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
10/23/2017 BiM-HNN.html
1/1
Softmax
token-levelencoder
block-levelencoder
code labelprediction
a sequence ofword tokens in
concatfeedforwardBi-GRU
a sequence ofword tokens in
Bi-GRU
Bi-GRU
a sequence ofcode tokens in
Bi-GRU
a sequence ofword tokens in
Bi-View Hierarchical Neural Network (BiV-HNN)
26
𝐶,𝑞𝑞: question
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Token-level encoder for text blocks
27
11/6/2017 BiM-HNN-text-token.html
1/1
Softmax
token-levelencoder
block-levelencoder
code labelprediction
a sequence ofword tokens in
concatfeedforwardBi-GRU
a sequence ofword tokens in
Bi-GRU
Bi-GRU
a sequence ofcode tokens in
Bi-GRU
a sequence ofword tokens in
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Token-level encoder for code blocks
28
11/6/2017 BiM-HNN-code-token.html
1/1
Softmax
token-levelencoder
block-levelencoder
code labelprediction
a sequence ofword tokens in
concatfeedforwardBi-GRU
a sequence ofword tokens in
Bi-GRU
Bi-GRU
a sequence ofcode tokens in
Bi-GRU
a sequence ofword tokens in
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Block-level encoder
29
11/6/2017 BiM-HNN.html
1/1
Softmax
token-levelencoder
block-levelencoder
code labelprediction
a sequence ofword tokens in
concatfeedforwardBi-GRU
a sequence ofword tokens in
Bi-GRU
Bi-GRU
a sequence ofcode tokens in
Bi-GRU
a sequence ofword tokens in
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Code label prediction
30
10/23/2017 BiM-HNN.html
1/1
Softmax
token-levelencoder
block-levelencoder
code labelprediction
a sequence ofword tokens in
concatfeedforwardBi-GRU
a sequence ofword tokens in
Bi-GRU
Bi-GRU
a sequence ofcode tokens in
Bi-GRU
a sequence ofword tokens in
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Experimental Setup
n Manually annotating “solution or not” label on code snippets in one answer post.q Python and SQL domain.q Four undergraduates with substantial Cohen’s kappa agreement.
n Training/validation/test split: 60% - 20% -20%.
31
Python SQL# of Question-Code pairs 4,884 3,637
% of positive Question-Code pairs 44% 57%
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Main Results
n Heuristic methods: Select-First, Select-All.n Feature engineering based methods:
q Logistic Regression (LR), Support Vector Machine (SVM).q Features: text-based (uni-/bi-grams, the connectives, etc) and
code-based (code tokens, etc).
32
Python SQLSelect-First 0.607 0.613Select-All 0.642 0.737
LR 0.766 0.846SVM 0.753 0.850
BiV-HNN 0.841 0.888(comparison on F1)
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Research questions for understanding BiV-HNN
Q1: text view, code view, or bi-view?
Q2: hierarchical structure or flat structure?
Q3: block-level encoder: sequential or feedforward?
33Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Research questions for understanding BiV-HNN
Q1: text view, code view, or bi-view?
Q2: hierarchical structure or flat structure?
Q3: block-level encoder: sequential or feedforward?
34Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Q1: text view, code view, or bi-view?
35
10/24/2017 UniM-HNN_RNN-qc.html
1/1
Softmax
CODEBLOCK
concatfeedforward
Block-level Encoder (Bi-GRU)
a sequence ofword tokens in
a sequence ofword tokens in
Softmax
Bi-GRU
a sequence ofcode tokens in
Bi-GRUBi-GRUBi-GRU
a sequence ofword tokens in
10/24/2017 UniM-HNN_RNN-qc.html
1/1
Softmax
CODEBLOCK
concatfeedforward
Block-level Encoder (Bi-GRU)
a sequence ofword tokens in
a sequence ofword tokens in
Softmax
Bi-GRU
a sequence ofcode tokens in
Bi-GRUBi-GRUBi-GRU
a sequence ofword tokens in
10/23/2017 BiM-HNN.html
1/1
Softmax
token-levelencoder
block-levelencoder
code labelprediction
a sequence ofword tokens in
concatfeedforwardBi-GRU
a sequence ofword tokens in
Bi-GRU
Bi-GRU
a sequence ofcode tokens in
Bi-GRU
a sequence ofword tokens in
Text-HNN(text view)
Code-HNN(code view)
Python SQLText-HNN 0.771 0.840Code-HNN 0.812 0.851BiV-HNN 0.841 0.888
BiV-HNN(bi-view)
(comparison on F1)
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Model combination
n Text-HNN, Code-HNN and BiV-HNN are observed complementary to each other.q On Python validation set, 60%~70% of mistakes made by one
model can be corrected by the other two models.
36Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Model combination
n Text-HNN, Code-HNN and BiV-HNN are observed complementary to each other.q On Python validation set, 60%~70% of mistakes made by one
model can be corrected by the other two models.
n Model combination: the label of a code snippet is predicted only when the three models agree on it.
37Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Model combination
n Text-HNN, Code-HNN and BiV-HNN are observed complementary to each other.q On Python validation set, 60%~70% of mistakes made by one
model can be corrected by the other two models.
n Model combination: the label of a code snippet is predicted only when the three models agree on it.
n Model combination on testing set:q Python: ~70% of code snippets are labeled with 0.92 F1.q SQL: ~80% of code snippets are labeled with 0.94 F1.
38Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Systematically mined StaQC
39
Model combination
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
New or unannotated post
StaQC: Systematically mined Question-Code pairs
40
StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!
Continuously growing in size and diversity
A better source for constructing models mapping between natural language and programming language
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
StaQC: A better source for downstream tasks
n Code Retrieval as an exemplar downstream task (SQL domain).
n Neural Network model CODENN. [Iyer et al., 2016]q CODENN(Original): trained on ~26K heuristically collected QC pairs.q CODENN(StaQC): trained on ~120K systematically mined QC pairs.
41
~6% absolute gain(conservative)
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
Mea
n R
ecip
roca
l Ran
k
StaQC: Systematically mined Question-Code pairs
42
StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!
Continuously growing in size and diversity
A better source for constructing models mapping between natural language and programming language
Data and code are available at: https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow
43
Thank you! Questions?
Ziyu YaoA Systematically Mined Question-Code
Dataset from Stack Overflow