staqc: a systematically mined question-code dataset from...

43
1 StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow Ziyu Yao In collaboration with Prof. Daniel S. Weld (UW), Dr. Wei-Peng Chen (Fujitsu Lab), Prof. Huan Sun (OSU). The Web Conference 2018, May 25 th The Ohio State University

Upload: others

Post on 19-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

1

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

Ziyu Yao

In collaboration with Prof. Daniel S. Weld (UW), Dr. Wei-Peng Chen (Fujitsu Lab), Prof. Huan Sun (OSU).

The Web Conference 2018, May 25th

The Ohio State University

Page 2: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Mapping between natural language and programming language

2

“how to clone or copy a python list?”

“new_list = copy.copy(old_list)”

Question Code

e.g., automated code search/annotation/generation.

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 3: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Challenges

n Lack of large-scale datasets for model development.i.e., pairs of <natural language question, code snippet>

3Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 4: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Challenges

n Lack of large-scale datasets for model development.i.e., pairs of <natural language question, code snippet>

n And datasets are important:

4Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 5: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

This work: StaQC

5

StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!

“how-to-do-it” questions [Souza et al., 2014; Defim et al., 2016]: the questioner provides a scenario and asks how to implement it.

“how to clone or copy a python list?”

“new_list = copy.copy(old_list)”

Question Code

Example:

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 6: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

This work: StaQC

6

StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!

Continuously growing in size and diversity

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 7: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

n Containing multiple code solutions to the same question.

q Question “How to limit a number to be within a specified range?”q 4 code solutions in StaQC:

Diversity of StaQC

7

Question: "How to limit a number to be within a specified range?

(Python)"Code answers:

def clamp(n, minn, maxn):   return max(min(maxn, n),minn)

clamp = lambda n, minn,maxn: max(min(maxn, n),minn)

n = minn if n < minn else maxn if n > maxn else n

def clamp(n, minn, maxn):   if n < minn:     return minn   elif n > maxn:     return maxn   else:     return n

(1)

(2) (3)

(4)

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 8: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Diversity of StaQC

n Containing different questions asking for semantically similar code solutions.

8Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 9: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Diversity of StaQC

n Containing different questions asking for semantically similar code solutions.

9Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Critical for Model Robustness:

1. Natural language variation.

2. Different implementations to do the same thing in programming language.

Page 10: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

This work: StaQC

10

StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!

Continuously growing in size and diversity

A better source for constructing models mapping between NL and PL

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 11: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

This work: StaQC

11

StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!

Continuously growing in size and diversity

A better source for constructing models mapping between NL and PL

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 12: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Example

12

Accepted answer post

Question

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 13: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Example

13

Accepted answer post

Question

Code block 1

Code block 2

Code block 3

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 14: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Example

14

Accepted answer post

Question

Code block 1

Code block 2

Code block 3

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Should we pair <Question, Code block n>?

Page 15: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Example

15

Accepted answer post

Question

Code block 1

Code block 2

Code block 3

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Is Code block n a “standalone” solution to the question?

Page 16: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

“Standalone” code solution

16

By looking at Code block n,

can you solve the problem:

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 17: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

“Standalone” code solution

17

By looking at Code block n,

can you solve the problem:

No! (it shows the usage, but with no details of the function)

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 18: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Example: Question-Code pairs

18

Accepted answer post

Question

Not a standalonesolution!

(Showing usage, no details of the function)

Standalone solution!

Standalone solution!☑

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 19: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Previous methods: heuristics based

n “Select All”: Taking all code snippets in the answer post as code solutions. [Allamanis et al., 2015][Zilberstein and Yahav, 2016]

q Low precision

19

(ground truth)

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 20: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Previous methods: heuristics based

n “Select First”: Taking only the first code snippet in the answer post as a code solution, or considering only answer posts containing exactly one code snippet. [Iyer et al., 2016]

q Low recall

20

(ground truth)

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 21: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Our solution: A systematic framework

n Binary classification formulation:Input: A question on Stack Overflow and its accepted answer post with multiple code snippetsOutput: A binary label for each code snippet on whether it is a standalone solution to the question

21Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 22: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

A bi-view formulation

22

𝑺𝟏

𝑺𝟐

𝑺𝟑

𝑪𝟏

𝑪𝟐

𝑪𝟑

Interleaving text and codeblocks

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 23: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Text-based view: contextual hints

23

𝑺𝟏

𝑺𝟐

𝑺𝟑

𝑪𝟏

𝑪𝟐

𝑪𝟑

𝐶'

𝐶(

𝐶)

more likely to be a code solution

more likely to be a code solution

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 24: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Code-based view: semantics of code content

24

𝑺𝟐

𝑺𝟑

𝑪𝟏

𝑪𝟐

𝑪𝟑

possibly a usage demo

more likely to be a solution

more likely to be a solution

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 25: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Formulation for each code snippet

25

Predict a “solution or not” label for a code snippet (here 𝐶() based on:1. Textual context (text view): 𝑺𝟐,𝑺𝟑.2. Code content (code view): 𝑪𝟐.

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 26: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

10/23/2017 BiM-HNN.html

1/1

Softmax

token-levelencoder

block-levelencoder

code labelprediction

a sequence ofword tokens in    

concatfeedforwardBi-GRU

a sequence ofword tokens in    

Bi-GRU

Bi-GRU

a sequence ofcode tokens in    

Bi-GRU

a sequence ofword tokens in    

Bi-View Hierarchical Neural Network (BiV-HNN)

26

𝐶,𝑞𝑞: question

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 27: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Token-level encoder for text blocks

27

11/6/2017 BiM-HNN-text-token.html

1/1

Softmax

token-levelencoder

block-levelencoder

code labelprediction

a sequence ofword tokens in    

concatfeedforwardBi-GRU

a sequence ofword tokens in    

Bi-GRU

Bi-GRU

a sequence ofcode tokens in    

Bi-GRU

a sequence ofword tokens in    

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 28: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Token-level encoder for code blocks

28

11/6/2017 BiM-HNN-code-token.html

1/1

Softmax

token-levelencoder

block-levelencoder

code labelprediction

a sequence ofword tokens in    

concatfeedforwardBi-GRU

a sequence ofword tokens in    

Bi-GRU

Bi-GRU

a sequence ofcode tokens in    

Bi-GRU

a sequence ofword tokens in    

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 29: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Block-level encoder

29

11/6/2017 BiM-HNN.html

1/1

Softmax

token-levelencoder

block-levelencoder

code labelprediction

a sequence ofword tokens in    

concatfeedforwardBi-GRU

a sequence ofword tokens in    

Bi-GRU

Bi-GRU

a sequence ofcode tokens in    

Bi-GRU

a sequence ofword tokens in    

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 30: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Code label prediction

30

10/23/2017 BiM-HNN.html

1/1

Softmax

token-levelencoder

block-levelencoder

code labelprediction

a sequence ofword tokens in    

concatfeedforwardBi-GRU

a sequence ofword tokens in    

Bi-GRU

Bi-GRU

a sequence ofcode tokens in    

Bi-GRU

a sequence ofword tokens in    

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 31: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Experimental Setup

n Manually annotating “solution or not” label on code snippets in one answer post.q Python and SQL domain.q Four undergraduates with substantial Cohen’s kappa agreement.

n Training/validation/test split: 60% - 20% -20%.

31

Python SQL# of Question-Code pairs 4,884 3,637

% of positive Question-Code pairs 44% 57%

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 32: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Main Results

n Heuristic methods: Select-First, Select-All.n Feature engineering based methods:

q Logistic Regression (LR), Support Vector Machine (SVM).q Features: text-based (uni-/bi-grams, the connectives, etc) and

code-based (code tokens, etc).

32

Python SQLSelect-First 0.607 0.613Select-All 0.642 0.737

LR 0.766 0.846SVM 0.753 0.850

BiV-HNN 0.841 0.888(comparison on F1)

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 33: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Research questions for understanding BiV-HNN

Q1: text view, code view, or bi-view?

Q2: hierarchical structure or flat structure?

Q3: block-level encoder: sequential or feedforward?

33Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 34: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Research questions for understanding BiV-HNN

Q1: text view, code view, or bi-view?

Q2: hierarchical structure or flat structure?

Q3: block-level encoder: sequential or feedforward?

34Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 35: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Q1: text view, code view, or bi-view?

35

10/24/2017 UniM-HNN_RNN-qc.html

1/1

Softmax

CODEBLOCK

concatfeedforward

Block-level Encoder (Bi-GRU)

a sequence ofword tokens in    

a sequence ofword tokens in    

Softmax

Bi-GRU

a sequence ofcode tokens in    

Bi-GRUBi-GRUBi-GRU

a sequence ofword tokens in    

10/24/2017 UniM-HNN_RNN-qc.html

1/1

Softmax

CODEBLOCK

concatfeedforward

Block-level Encoder (Bi-GRU)

a sequence ofword tokens in    

a sequence ofword tokens in    

Softmax

Bi-GRU

a sequence ofcode tokens in    

Bi-GRUBi-GRUBi-GRU

a sequence ofword tokens in    

10/23/2017 BiM-HNN.html

1/1

Softmax

token-levelencoder

block-levelencoder

code labelprediction

a sequence ofword tokens in    

concatfeedforwardBi-GRU

a sequence ofword tokens in    

Bi-GRU

Bi-GRU

a sequence ofcode tokens in    

Bi-GRU

a sequence ofword tokens in    

Text-HNN(text view)

Code-HNN(code view)

Python SQLText-HNN 0.771 0.840Code-HNN 0.812 0.851BiV-HNN 0.841 0.888

BiV-HNN(bi-view)

(comparison on F1)

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 36: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Model combination

n Text-HNN, Code-HNN and BiV-HNN are observed complementary to each other.q On Python validation set, 60%~70% of mistakes made by one

model can be corrected by the other two models.

36Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 37: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Model combination

n Text-HNN, Code-HNN and BiV-HNN are observed complementary to each other.q On Python validation set, 60%~70% of mistakes made by one

model can be corrected by the other two models.

n Model combination: the label of a code snippet is predicted only when the three models agree on it.

37Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 38: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Model combination

n Text-HNN, Code-HNN and BiV-HNN are observed complementary to each other.q On Python validation set, 60%~70% of mistakes made by one

model can be corrected by the other two models.

n Model combination: the label of a code snippet is predicted only when the three models agree on it.

n Model combination on testing set:q Python: ~70% of code snippets are labeled with 0.92 F1.q SQL: ~80% of code snippets are labeled with 0.94 F1.

38Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 39: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

Systematically mined StaQC

39

Model combination

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

New or unannotated post

Page 40: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

StaQC: Systematically mined Question-Code pairs

40

StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!

Continuously growing in size and diversity

A better source for constructing models mapping between natural language and programming language

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 41: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

StaQC: A better source for downstream tasks

n Code Retrieval as an exemplar downstream task (SQL domain).

n Neural Network model CODENN. [Iyer et al., 2016]q CODENN(Original): trained on ~26K heuristically collected QC pairs.q CODENN(StaQC): trained on ~120K systematically mined QC pairs.

41

~6% absolute gain(conservative)

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Mea

n R

ecip

roca

l Ran

k

Page 42: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

StaQC: Systematically mined Question-Code pairs

42

StaQC: the largest dataset to date of ~148K Python and ~120K SQL “how-to-do-it”* Question-Code pairs!

Continuously growing in size and diversity

A better source for constructing models mapping between natural language and programming language

Data and code are available at: https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow

Page 43: StaQC: A Systematically Mined Question-Code Dataset from ...web.cse.ohio-state.edu/~yao.470/slides/StaQC_slides.pdf · n “Select First”: Taking only the first code snippet in

43

Thank you! Questions?

Ziyu YaoA Systematically Mined Question-Code

Dataset from Stack Overflow