data driven student feedback for programming intensive moocs · coursera’s ml course [moocshop,...
TRANSCRIPT
![Page 1: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/1.jpg)
Jonathan Huang Stanford University
Data driven student feedback for Programming Intensive MOOCs
Towards global scale CS education
Leonidas Guibas Chris Piech Andy Nguyen
![Page 2: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/2.jpg)
Steve Jobs Stanford, 2005
“all of my working-class parents' savings were being spent on my college tuition… …the minute I dropped out [of college] I could stop taking the required classes that didn't interest me, and begin dropping in on the ones that looked interesting.”
![Page 3: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/3.jpg)
Course selection is better online
MOOC = Massive Open Online Courses
![Page 4: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/4.jpg)
Untapped potential
CS does not count towards high school math/science
requirements in 36 of 50 states
$0
$10
$20
$30
$40
73-7
478
-79
83-8
488
-89
93-9
498
-99
03-0
408
-09
13-1
4
Thou
sand
s
*** source: www.code.org, www.collegeboard.org
0
500
1000
1500
2000
2011 2016 2021
Thou
sand
s
![Page 5: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/5.jpg)
400 students 100,000 students
Stanford ML-class
![Page 6: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/6.jpg)
10 TAs 2,500 TAs (???)
Stanford ML-class
![Page 7: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/7.jpg)
Ease of global scale feedback on a spectrum
Short Response
Long Response
Multiple choice
Essay questions
Proofs Today: Programming Assignments
![Page 8: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/8.jpg)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
Feedback for Coding Assignments: Easy?
8
Test Inputs
Correct / Incorrect ? Test Outputs
Linear Regression submission (Homework 1) for Coursera’s ML class
![Page 9: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/9.jpg)
The “but it works!!” solution function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters hypo = X*theta; newMat = hypo – y; trans1 = (X(:,1)); trans1 = trans1’; newMat1 = trans1 * newMat; temp1 = sum(newMat1); temp1 = (temp1 *alpha)/m; A = [temp1]; theta(1) = theta(1) - A; trans2 = (X(:,2))’ ; newMat2 = trans2*newMat; temp2 = sum(newMat2); temp2 = (temp2 *alpha)/m; B = [temp2]; theta(2)= theta(2) - B; J_history(iter) = computeCost(X, y, theta); end theta(1) = theta(1); theta(2)= theta(2); Why??
Correctness Efficiency
Style Elegance
Better: theta = theta-(alpha/m) *X'*(X*theta-y)
Good Good Poor Poor
![Page 10: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/10.jpg)
for a class of 100,000, and in real time,
Let’s do this:
New programming problems
New courses
But… can’t require too much instructor effort to make this work for:
New programming languages
![Page 11: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/11.jpg)
We now have massive datasets…
Visualization of 40,000 implementations of linear regression submitted to Coursera’s ML course
[Moocshop, 2013]
# St
uden
ts
Intro CS
1K 10K
20M
![Page 12: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/12.jpg)
Efficient index for “code phrases” of a
MOOC dataset
Shared structure discovery amongst
many student submissions
Applications such as bug finding (w/o execution) and
MOOC-scale feedback
Results on real MOOC with > 1 million
submissions
Codewebs Engine
![Page 13: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/13.jpg)
First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
Correct
Incorrect
![Page 14: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/14.jpg)
First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
Correct
Incorrect
Dear Lisa Simpson, consider the dimension of the expression:
X'*(X*theta-y) and what happens after you call sum on it…
Syntax based approach:
Attach this message to everyone containing that exact expression
(covers 99 submissions)
![Page 15: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/15.jpg)
The extraneous sum bug takes many forms…
(Easier) Output based approach:
Attach message to everyone who matched extraneous sum bug in unit test output
(covers 1091 submissions)
theta = theta-alpha*1/m*sum(X'*(X*theta-y));
theta = theta-alpha*1/m*sum(((theta’*X’)’-y)’*X);
theta = theta-alpha*1/m*sum(transpose(X*theta-y)*X);
…
![Page 16: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/16.jpg)
Codewebs approach to feedback
Combined 1604
theta = theta-alpha*1/m*sum(X'*(X*theta-y));
Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression
Output based 1091
Codewebs 1208
# submissions covered by single message
![Page 17: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/17.jpg)
Codewebs approach to feedback
Combined 1604
theta = theta-alpha*1/m*sum(X'*(X*theta-y));
Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression
Output based 1091
Codewebs 1208
# submissions covered by single message
~47% improvement over just using an output based
feedback system!!
![Page 18: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/18.jpg)
Abstract syntax tree representations
• Whitespace • Comments • …
ASTs ignore:
function A = warmUpExercise() A = []; A = eye(5); endfunction
ASSIGN
IDENT (A) INDEX_EXP
IDENT (eye) ARGUMENT_LIST
CONST (5)
ASTs
![Page 19: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/19.jpg)
Indexing documents by phrases
term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}
The bright and blue butterfly hangs on the breeze…
We all something something yellow submarine…
“blue sky” “yellow submarine”
![Page 20: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/20.jpg)
Indexing documents by phrases
term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}
The bright and blue butterfly hangs on the breeze…
We all something something yellow submarine…
“blue sky” “yellow submarine”
What basic queries should an AST search engine support?
![Page 21: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/21.jpg)
Code Phrases
Subtrees and subforests of an AST
BINARY_EXP (*)
POSTFIX (‘) BINARY_EXP (-)
IDENT (X) IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta)
![Page 22: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/22.jpg)
Code Phrases
Context within a larger subtree
BINARY_EXP (*)
POSTFIX (‘) BINARY_EXP (-)
IDENT (X) IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta)
![Page 23: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/23.jpg)
Code Phrases
Context within a larger subtree
BINARY_EXP (*)
POSTFIX (‘) BINARY_EXP (-)
IDENT (X) IDENT (y)
replacement site
![Page 24: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/24.jpg)
The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}
10 print “hello” 20 goto 10
10 for i=1:10 20 x = x+1 30 end
![Page 25: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/25.jpg)
The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}
10 print “hello” 20 goto 10
10 for i=1:10 20 x = x+1 30 end
Very expensive, esp. for large ASTs! def buildIndex(): for A in ASTs: for every code phrase x contained in A: Compute hashcode h[x] Insert A at h[x]
subtrees, subforests, contexts
![Page 26: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/26.jpg)
Hashing Code Phrases
BINARY_EXP (-)
IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta) BINARY_EXP (-)
IDENT (y)
BINARY_EXP (*)
IDENT (X)
IDENT (theta)
Step 1. Create postorder listing of nodes.
Step 2. Hash postorder list via:
![Page 27: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/27.jpg)
Recycling hash computations
BINARY_EXP (-)
IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta) BINARY_EXP (-)
IDENT (y)
BINARY_EXP (*)
IDENT (X)
IDENT (theta)
Observation: Can hash sublist of postorder to get hash of code phrases!
![Page 28: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/28.jpg)
Recycling hash computations
BINARY_EXP (-)
IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta) BINARY_EXP (-)
IDENT (y)
BINARY_EXP (*)
IDENT (X)
IDENT (theta)
Observation: Can hash sublist of postorder to get hash of code phrases!
Idea of DP: Store prefix hashes and prime powers for all
O(n) in time and space
![Page 29: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/29.jpg)
Recycling hash computations
BINARY_EXP (-)
IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta) BINARY_EXP (-)
IDENT (y)
BINARY_EXP (*)
IDENT (X)
IDENT (theta)
Observation: Can hash sublist of postorder to get hash of code phrases!
After precomputation, can get any other hash in constant time!
![Page 30: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/30.jpg)
Indexing is fast in practice
00.5
11.5
22.5
3
0 100 200 300 400 500
Run
time
(sec
onds
)
Average AST size (# nodes)
Time for indexing 1000 ASTs
![Page 31: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/31.jpg)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
Application: Statistical Bug Finding
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
83% of ASTs containing this code phrase were buggy!
Solution: X'*(X*theta-y);
vs.
![Page 32: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/32.jpg)
Query Index
Fail Fail Fail Pass Fail Fail
83% of ASTs containing this code phrase were buggy!
Is sum(X'*(X*theta-y)) likely to be a bug?
![Page 33: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/33.jpg)
Query Index
Fail Fail Fail Pass Fail Fail
83% of ASTs containing this code phrase were buggy!
Is sum(X'*(X*theta-y)) likely to be a bug?
![Page 34: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/34.jpg)
Query Index
Fail Fail Fail Pass Fail Fail
83% of ASTs containing this code phrase were buggy!
Is sum(X'*(X*theta-y)) likely to be a bug?
Compute bug probability for all subforests, return smallest bugs found
Many ways to formulate probabilistic bug localization (we compute probability on local contexts)
![Page 35: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/35.jpg)
Bug Detection Accuracy
00.10.20.30.40.50.60.70.80.9
1
0 0.2 0.4 0.6 0.8 1
Bug
det
ectio
n F-
scor
e (B
asel
ine)
Bug detection F-score (Codewebs)
neural net training with backpropagation
logistic regression objective
linear regression with gradient descent
Each point represents a single coding problem. Bubble size = Average # nodes per submitted AST
better
bette
r
![Page 36: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/36.jpg)
More than one way to skin a cat…
Canonicalization: apply semantic preserving transformation rules to ASTs to increase matching probability
![Page 37: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/37.jpg)
X*1*(Y+Z)
X*[1;1]’*[Y;Z]
1. Impossible to predict all ways of writing the same thing
1. Canonicalization rules not
typically generalizable across languages
X*ones(1,2)*[Y;Z]
X*repmat(1,2,1)’*[Y;Z]
X*ones(1,length([Y;Z]))*[Y;Z]
X*transpose([1;1])*[Y;Z]
X*repmat(1,size([Y;Z],1),1)’*[Y;Z]
X*(Y+Z)
X*(Z+Y)
(Y+Z)*X
X*Z+X*Y
Z*X+X*Y
Z*X+Y*X
Difficulties with Canonicalization
![Page 38: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/38.jpg)
Codewebs Approach Use data to determine canonicalization rules
Customize rules to each assignment
Don’t need to be perfect, we’re not building a compiler!
![Page 39: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/39.jpg)
Here’s the idea. def residual (X, theta, y): hypothesis = X * theta solution = hypothesis - y return solution
def residual(X, theta, y): hypothesis = (theta’ * X’)’ solution = hypothesis - y return solution
![Page 40: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/40.jpg)
Counter Example
Agreement can be context dependent
def foo(): solution = solveProblem() print(solution) return solution
def foo(): solution = solveProblem() print(!solution) return solution
![Page 41: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/41.jpg)
= ?
![Page 42: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/42.jpg)
Join on context
Query Index
Query Index
= ?
![Page 43: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/43.jpg)
Fail Fail Pass Pass Fail Pass
Fail Fail Pass Pass Fail Pass
100% probability of equivalence! ** fine print: need to account for sample size in general
![Page 44: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/44.jpg)
Workflow
theta = theta-alpha*1/m*(X'*(X*theta-y));
“alphaOverM” “prediction”
“residual”
Human provides:
![Page 45: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/45.jpg)
length (y) size (X, 1) size (y, 1)
rows (X) m rows (y)
length (X) length (x (:, 1)) size (X) (1)
(theta' * X' - y')'
(X * theta - y)
({hypothesis} - y)
({hypothesis}' - y’)'
[{hypothesis} - y]
sum({hypothesis} - y, 2)
…
alpha * (1 ./ {m}) alpha * (1 / {m}) alpha * 1 ./ {m}
.01 / {m} alpha * {m} ^ -1 alpha .* (1 ./ {m})
1 / {m} * alpha alpha ./ {m} alpha .* (1 / {m})
alpha * inv ({m}) 1 .* alpha ./ {m} alpha * pinv ({m})
alpha / {m} alpha .* 1 / {m} 1 * alpha / {m}
(theta' * X')'
(X * theta)
theta(1) + theta (2) * X (:, 2)
(X * theta (:))
[X] * theta
sum(X.*repmat(theta',{m},1), 2) …
{m}
{alphaOverM}
{hypothesis} {residual}
![Page 46: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/46.jpg)
Canonicalization improves bug detection accuracy
0.650.7
0.750.8
0.850.9
0.95F-
scor
e
# unique ASTs considered
without canonicalization
with canonicalization
High
er is
bet
ter
![Page 47: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/47.jpg)
How many submissions can we give feedback to with fixed effort?
0
5000
10000
15000
20000
25000
0 1 10 19
# su
bmis
sion
s co
vere
d
(out
of 4
0,00
0)
# equivalence classes
with 25 ASTsmarked
with 200 ASTsmarked
Canonicalization, 25 marked ASTS
No Canonicalization, 200 marked ASTs
![Page 48: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/48.jpg)
If we can find shared structure, we can facilitate feedback
In education: ASTs, proofs, essays, architecture, poems…
“gradient” “residual”
“learning rate”
“base case”
“inductive hypothesis”
“apartheid”
“Afrikaner Calvinism”
“elections in South Africa”
![Page 49: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/49.jpg)
Data is revolutionizing many fields
And it can revolutionize education too!
![Page 50: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”](https://reader034.vdocuments.us/reader034/viewer/2022042919/5f61d4ac5c3a6a0b570bdb6d/html5/thumbnails/50.jpg)
Thank you!! [email protected] http://www.stanford.edu/~jhuang11 @jonathanhuang11