data driven student feedback for programming intensive moocs · coursera’s ml course [moocshop,...

50
Jonathan Huang Stanford University Data driven student feedback for Programming Intensive MOOCs Towards global scale CS education Leonidas Guibas Chris Piech Andy Nguyen

Upload: others

Post on 21-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Jonathan Huang Stanford University

Data driven student feedback for Programming Intensive MOOCs

Towards global scale CS education

Leonidas Guibas Chris Piech Andy Nguyen

Page 2: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Steve Jobs Stanford, 2005

“all of my working-class parents' savings were being spent on my college tuition… …the minute I dropped out [of college] I could stop taking the required classes that didn't interest me, and begin dropping in on the ones that looked interesting.”

Page 3: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Course selection is better online

MOOC = Massive Open Online Courses

Page 4: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Untapped potential

CS does not count towards high school math/science

requirements in 36 of 50 states

$0

$10

$20

$30

$40

73-7

478

-79

83-8

488

-89

93-9

498

-99

03-0

408

-09

13-1

4

Thou

sand

s

*** source: www.code.org, www.collegeboard.org

0

500

1000

1500

2000

2011 2016 2021

Thou

sand

s

Page 5: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

400 students 100,000 students

Stanford ML-class

Page 6: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

10 TAs 2,500 TAs (???)

Stanford ML-class

Page 7: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Ease of global scale feedback on a spectrum

Short Response

Long Response

Multiple choice

Essay questions

Proofs Today: Programming Assignments

Page 8: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Feedback for Coding Assignments: Easy?

8

Test Inputs

Correct / Incorrect ? Test Outputs

Linear Regression submission (Homework 1) for Coursera’s ML class

Page 9: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

The “but it works!!” solution function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters hypo = X*theta; newMat = hypo – y; trans1 = (X(:,1)); trans1 = trans1’; newMat1 = trans1 * newMat; temp1 = sum(newMat1); temp1 = (temp1 *alpha)/m; A = [temp1]; theta(1) = theta(1) - A; trans2 = (X(:,2))’ ; newMat2 = trans2*newMat; temp2 = sum(newMat2); temp2 = (temp2 *alpha)/m; B = [temp2]; theta(2)= theta(2) - B; J_history(iter) = computeCost(X, y, theta); end theta(1) = theta(1); theta(2)= theta(2); Why??

Correctness Efficiency

Style Elegance

Better: theta = theta-(alpha/m) *X'*(X*theta-y)

Good Good Poor Poor

Page 10: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

for a class of 100,000, and in real time,

Let’s do this:

New programming problems

New courses

But… can’t require too much instructor effort to make this work for:

New programming languages

Page 11: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

We now have massive datasets…

Visualization of 40,000 implementations of linear regression submitted to Coursera’s ML course

[Moocshop, 2013]

# St

uden

ts

Intro CS

1K 10K

20M

Page 12: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Efficient index for “code phrases” of a

MOOC dataset

Shared structure discovery amongst

many student submissions

Applications such as bug finding (w/o execution) and

MOOC-scale feedback

Results on real MOOC with > 1 million

submissions

Codewebs Engine

Page 13: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Correct

Incorrect

Page 14: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Correct

Incorrect

Dear Lisa Simpson, consider the dimension of the expression:

X'*(X*theta-y) and what happens after you call sum on it…

Syntax based approach:

Attach this message to everyone containing that exact expression

(covers 99 submissions)

Page 15: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

The extraneous sum bug takes many forms…

(Easier) Output based approach:

Attach message to everyone who matched extraneous sum bug in unit test output

(covers 1091 submissions)

theta = theta-alpha*1/m*sum(X'*(X*theta-y));

theta = theta-alpha*1/m*sum(((theta’*X’)’-y)’*X);

theta = theta-alpha*1/m*sum(transpose(X*theta-y)*X);

Page 16: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Codewebs approach to feedback

Combined 1604

theta = theta-alpha*1/m*sum(X'*(X*theta-y));

Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression

Output based 1091

Codewebs 1208

# submissions covered by single message

Page 17: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Codewebs approach to feedback

Combined 1604

theta = theta-alpha*1/m*sum(X'*(X*theta-y));

Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression

Output based 1091

Codewebs 1208

# submissions covered by single message

~47% improvement over just using an output based

feedback system!!

Page 18: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Abstract syntax tree representations

• Whitespace • Comments • …

ASTs ignore:

function A = warmUpExercise() A = []; A = eye(5); endfunction

ASSIGN

IDENT (A) INDEX_EXP

IDENT (eye) ARGUMENT_LIST

CONST (5)

ASTs

Page 19: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Indexing documents by phrases

term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

The bright and blue butterfly hangs on the breeze…

We all something something yellow submarine…

“blue sky” “yellow submarine”

Page 20: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Indexing documents by phrases

term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

The bright and blue butterfly hangs on the breeze…

We all something something yellow submarine…

“blue sky” “yellow submarine”

What basic queries should an AST search engine support?

Page 21: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Code Phrases

Subtrees and subforests of an AST

BINARY_EXP (*)

POSTFIX (‘) BINARY_EXP (-)

IDENT (X) IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta)

Page 22: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Code Phrases

Context within a larger subtree

BINARY_EXP (*)

POSTFIX (‘) BINARY_EXP (-)

IDENT (X) IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta)

Page 23: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Code Phrases

Context within a larger subtree

BINARY_EXP (*)

POSTFIX (‘) BINARY_EXP (-)

IDENT (X) IDENT (y)

replacement site

Page 24: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}

10 print “hello” 20 goto 10

10 for i=1:10 20 x = x+1 30 end

Page 25: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}

10 print “hello” 20 goto 10

10 for i=1:10 20 x = x+1 30 end

Very expensive, esp. for large ASTs! def buildIndex(): for A in ASTs: for every code phrase x contained in A: Compute hashcode h[x] Insert A at h[x]

subtrees, subforests, contexts

Page 26: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Hashing Code Phrases

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Step 1. Create postorder listing of nodes.

Step 2. Hash postorder list via:

Page 27: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Recycling hash computations

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Observation: Can hash sublist of postorder to get hash of code phrases!

Page 28: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Recycling hash computations

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Observation: Can hash sublist of postorder to get hash of code phrases!

Idea of DP: Store prefix hashes and prime powers for all

O(n) in time and space

Page 29: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Recycling hash computations

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Observation: Can hash sublist of postorder to get hash of code phrases!

After precomputation, can get any other hash in constant time!

Page 30: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Indexing is fast in practice

00.5

11.5

22.5

3

0 100 200 300 400 500

Run

time

(sec

onds

)

Average AST size (# nodes)

Time for indexing 1000 ASTs

Page 31: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Application: Statistical Bug Finding

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

83% of ASTs containing this code phrase were buggy!

Solution: X'*(X*theta-y);

vs.

Page 32: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Query Index

Fail Fail Fail Pass Fail Fail

83% of ASTs containing this code phrase were buggy!

Is sum(X'*(X*theta-y)) likely to be a bug?

Page 33: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Query Index

Fail Fail Fail Pass Fail Fail

83% of ASTs containing this code phrase were buggy!

Is sum(X'*(X*theta-y)) likely to be a bug?

Page 34: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Query Index

Fail Fail Fail Pass Fail Fail

83% of ASTs containing this code phrase were buggy!

Is sum(X'*(X*theta-y)) likely to be a bug?

Compute bug probability for all subforests, return smallest bugs found

Many ways to formulate probabilistic bug localization (we compute probability on local contexts)

Page 35: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Bug Detection Accuracy

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

Bug

det

ectio

n F-

scor

e (B

asel

ine)

Bug detection F-score (Codewebs)

neural net training with backpropagation

logistic regression objective

linear regression with gradient descent

Each point represents a single coding problem. Bubble size = Average # nodes per submitted AST

better

bette

r

Page 36: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

More than one way to skin a cat…

Canonicalization: apply semantic preserving transformation rules to ASTs to increase matching probability

Page 37: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

X*1*(Y+Z)

X*[1;1]’*[Y;Z]

1. Impossible to predict all ways of writing the same thing

1. Canonicalization rules not

typically generalizable across languages

X*ones(1,2)*[Y;Z]

X*repmat(1,2,1)’*[Y;Z]

X*ones(1,length([Y;Z]))*[Y;Z]

X*transpose([1;1])*[Y;Z]

X*repmat(1,size([Y;Z],1),1)’*[Y;Z]

X*(Y+Z)

X*(Z+Y)

(Y+Z)*X

X*Z+X*Y

Z*X+X*Y

Z*X+Y*X

Difficulties with Canonicalization

Page 38: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Codewebs Approach Use data to determine canonicalization rules

Customize rules to each assignment

Don’t need to be perfect, we’re not building a compiler!

Page 39: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Here’s the idea. def residual (X, theta, y): hypothesis = X * theta solution = hypothesis - y return solution

def residual(X, theta, y): hypothesis = (theta’ * X’)’ solution = hypothesis - y return solution

Page 40: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Counter Example

Agreement can be context dependent

def foo(): solution = solveProblem() print(solution) return solution

def foo(): solution = solveProblem() print(!solution) return solution

Page 41: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

= ?

Page 42: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Join on context

Query Index

Query Index

= ?

Page 43: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Fail Fail Pass Pass Fail Pass

Fail Fail Pass Pass Fail Pass

100% probability of equivalence! ** fine print: need to account for sample size in general

Page 44: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Workflow

theta = theta-alpha*1/m*(X'*(X*theta-y));

“alphaOverM” “prediction”

“residual”

Human provides:

Page 45: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

length (y) size (X, 1) size (y, 1)

rows (X) m rows (y)

length (X) length (x (:, 1)) size (X) (1)

(theta' * X' - y')'

(X * theta - y)

({hypothesis} - y)

({hypothesis}' - y’)'

[{hypothesis} - y]

sum({hypothesis} - y, 2)

alpha * (1 ./ {m}) alpha * (1 / {m}) alpha * 1 ./ {m}

.01 / {m} alpha * {m} ^ -1 alpha .* (1 ./ {m})

1 / {m} * alpha alpha ./ {m} alpha .* (1 / {m})

alpha * inv ({m}) 1 .* alpha ./ {m} alpha * pinv ({m})

alpha / {m} alpha .* 1 / {m} 1 * alpha / {m}

(theta' * X')'

(X * theta)

theta(1) + theta (2) * X (:, 2)

(X * theta (:))

[X] * theta

sum(X.*repmat(theta',{m},1), 2) …

{m}

{alphaOverM}

{hypothesis} {residual}

Page 46: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Canonicalization improves bug detection accuracy

0.650.7

0.750.8

0.850.9

0.95F-

scor

e

# unique ASTs considered

without canonicalization

with canonicalization

High

er is

bet

ter

Page 47: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

How many submissions can we give feedback to with fixed effort?

0

5000

10000

15000

20000

25000

0 1 10 19

# su

bmis

sion

s co

vere

d

(out

of 4

0,00

0)

# equivalence classes

with 25 ASTsmarked

with 200 ASTsmarked

Canonicalization, 25 marked ASTS

No Canonicalization, 200 marked ASTs

Page 48: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

If we can find shared structure, we can facilitate feedback

In education: ASTs, proofs, essays, architecture, poems…

“gradient” “residual”

“learning rate”

“base case”

“inductive hypothesis”

“apartheid”

“Afrikaner Calvinism”

“elections in South Africa”

Page 49: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Data is revolutionizing many fields

And it can revolutionize education too!

Page 50: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Thank you!! [email protected] http://www.stanford.edu/~jhuang11 @jonathanhuang11