data driven student feedback for programming intensive moocs · coursera’s ml course [moocshop,...

Jonathan Huang Stanford University

Data driven student feedback for Programming Intensive MOOCs

Towards global scale CS education

Leonidas Guibas Chris Piech Andy Nguyen

Steve Jobs Stanford, 2005

“all of my working-class parents' savings were being spent on my college tuition… …the minute I dropped out [of college] I could stop taking the required classes that didn't interest me, and begin dropping in on the ones that looked interesting.”

Course selection is better online

MOOC = Massive Open Online Courses

Untapped potential

CS does not count towards high school math/science

requirements in 36 of 50 states

$0

$10

$20

$30

$40

73-7

478

-79

83-8

488

-89

93-9

498

-99

03-0

408

-09

13-1

4

Thou

sand

s

*** source: www.code.org, www.collegeboard.org

0

500

1000

1500

2000

2011 2016 2021

Thou

sand

s

400 students 100,000 students

Stanford ML-class

10 TAs 2,500 TAs (???)

Stanford ML-class

Ease of global scale feedback on a spectrum

Short Response

Long Response

Multiple choice

Essay questions

Proofs Today: Programming Assignments

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Feedback for Coding Assignments: Easy?

8

Test Inputs

Correct / Incorrect ? Test Outputs

Linear Regression submission (Homework 1) for Coursera’s ML class

The “but it works!!” solution function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters hypo = X*theta; newMat = hypo – y; trans1 = (X(:,1)); trans1 = trans1’; newMat1 = trans1 * newMat; temp1 = sum(newMat1); temp1 = (temp1 *alpha)/m; A = [temp1]; theta(1) = theta(1) - A; trans2 = (X(:,2))’ ; newMat2 = trans2*newMat; temp2 = sum(newMat2); temp2 = (temp2 *alpha)/m; B = [temp2]; theta(2)= theta(2) - B; J_history(iter) = computeCost(X, y, theta); end theta(1) = theta(1); theta(2)= theta(2); Why??

Correctness Efficiency

Style Elegance

Better: theta = theta-(alpha/m) *X'*(X*theta-y)

Good Good Poor Poor

for a class of 100,000, and in real time,

Let’s do this:

New programming problems

New courses

But… can’t require too much instructor effort to make this work for:

New programming languages

We now have massive datasets…

Visualization of 40,000 implementations of linear regression submitted to Coursera’s ML course

[Moocshop, 2013]

# St

uden

ts

Intro CS

1K 10K

20M

Efficient index for “code phrases” of a

MOOC dataset

Shared structure discovery amongst

many student submissions

Applications such as bug finding (w/o execution) and

MOOC-scale feedback

Results on real MOOC with > 1 million

submissions

Codewebs Engine

First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Correct

Incorrect

First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end


Correct

Incorrect

Dear Lisa Simpson, consider the dimension of the expression:

X'*(X*theta-y) and what happens after you call sum on it…

Syntax based approach:

Attach this message to everyone containing that exact expression

(covers 99 submissions)

The extraneous sum bug takes many forms…

(Easier) Output based approach:

Attach message to everyone who matched extraneous sum bug in unit test output

(covers 1091 submissions)

theta = theta-alpha*1/m*sum(X'*(X*theta-y));

theta = theta-alpha*1/m*sum(((theta’*X’)’-y)’*X);

theta = theta-alpha*1/m*sum(transpose(X*theta-y)*X);

…

Codewebs approach to feedback

Combined 1604


Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression

Output based 1091

Codewebs 1208

# submissions covered by single message

Codewebs approach to feedback

Combined 1604


Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression

Output based 1091

Codewebs 1208

# submissions covered by single message

~47% improvement over just using an output based

feedback system!!

Abstract syntax tree representations

• Whitespace • Comments • …

ASTs ignore:

function A = warmUpExercise() A = []; A = eye(5); endfunction

ASSIGN

IDENT (A) INDEX_EXP

IDENT (eye) ARGUMENT_LIST

CONST (5)

ASTs

Indexing documents by phrases

term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

The bright and blue butterfly hangs on the breeze…

We all something something yellow submarine…

“blue sky” “yellow submarine”

Indexing documents by phrases

term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

The bright and blue butterfly hangs on the breeze…

We all something something yellow submarine…

“blue sky” “yellow submarine”

What basic queries should an AST search engine support?

Code Phrases

Subtrees and subforests of an AST

BINARY_EXP (*)

POSTFIX (‘) BINARY_EXP (-)

IDENT (X) IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta)

Code Phrases

Context within a larger subtree

BINARY_EXP (*)


IDENT (X) IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta)

Code Phrases

Context within a larger subtree

BINARY_EXP (*)


IDENT (X) IDENT (y)

replacement site

The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}

10 print “hello” 20 goto 10

10 for i=1:10 20 x = x+1 30 end

The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}

10 print “hello” 20 goto 10

10 for i=1:10 20 x = x+1 30 end

Very expensive, esp. for large ASTs! def buildIndex(): for A in ASTs: for every code phrase x contained in A: Compute hashcode h[x] Insert A at h[x]

subtrees, subforests, contexts

Hashing Code Phrases

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Step 1. Create postorder listing of nodes.

Step 2. Hash postorder list via:

Recycling hash computations

BINARY_EXP (-)



IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Observation: Can hash sublist of postorder to get hash of code phrases!


BINARY_EXP (-)



IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)


Idea of DP: Store prefix hashes and prime powers for all

O(n) in time and space


BINARY_EXP (-)



IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)


After precomputation, can get any other hash in constant time!

Indexing is fast in practice

00.5

11.5

22.5

3

0 100 200 300 400 500

Run

time

(sec

onds

)

Average AST size (# nodes)

Time for indexing 1000 ASTs


Application: Statistical Bug Finding


83% of ASTs containing this code phrase were buggy!

Solution: X'*(X*theta-y);

vs.

Query Index

Fail Fail Fail Pass Fail Fail


Is sum(X'*(X*theta-y)) likely to be a bug?

Query Index

Fail Fail Fail Pass Fail Fail


Is sum(X'*(X*theta-y)) likely to be a bug?

Compute bug probability for all subforests, return smallest bugs found

Many ways to formulate probabilistic bug localization (we compute probability on local contexts)

Bug Detection Accuracy

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

Bug

det

ectio

n F-

scor

e (B

asel

ine)

Bug detection F-score (Codewebs)

neural net training with backpropagation

logistic regression objective

linear regression with gradient descent

Each point represents a single coding problem. Bubble size = Average # nodes per submitted AST

better

bette

r

More than one way to skin a cat…

Canonicalization: apply semantic preserving transformation rules to ASTs to increase matching probability

X*1*(Y+Z)

X*[1;1]’*[Y;Z]

1. Impossible to predict all ways of writing the same thing

1. Canonicalization rules not

typically generalizable across languages

X*ones(1,2)*[Y;Z]

X*repmat(1,2,1)’*[Y;Z]

X*ones(1,length([Y;Z]))*[Y;Z]

X*transpose([1;1])*[Y;Z]

X*repmat(1,size([Y;Z],1),1)’*[Y;Z]

X*(Y+Z)

X*(Z+Y)

(Y+Z)*X

X*Z+X*Y

Z*X+X*Y

Z*X+Y*X

Difficulties with Canonicalization

Codewebs Approach Use data to determine canonicalization rules

Customize rules to each assignment

Don’t need to be perfect, we’re not building a compiler!

Here’s the idea. def residual (X, theta, y): hypothesis = X * theta solution = hypothesis - y return solution

def residual(X, theta, y): hypothesis = (theta’ * X’)’ solution = hypothesis - y return solution

Counter Example

Agreement can be context dependent

def foo(): solution = solveProblem() print(solution) return solution

def foo(): solution = solveProblem() print(!solution) return solution

Join on context

Query Index

Query Index

= ?

Fail Fail Pass Pass Fail Pass

Fail Fail Pass Pass Fail Pass

100% probability of equivalence! ** fine print: need to account for sample size in general

Workflow

theta = theta-alpha*1/m*(X'*(X*theta-y));

“alphaOverM” “prediction”

“residual”

Human provides:

length (y) size (X, 1) size (y, 1)

rows (X) m rows (y)

length (X) length (x (:, 1)) size (X) (1)

(theta' * X' - y')'

(X * theta - y)

({hypothesis} - y)

({hypothesis}' - y’)'

[{hypothesis} - y]

sum({hypothesis} - y, 2)

…

alpha * (1 ./ {m}) alpha * (1 / {m}) alpha * 1 ./ {m}

.01 / {m} alpha * {m} ^ -1 alpha .* (1 ./ {m})

1 / {m} * alpha alpha ./ {m} alpha .* (1 / {m})

alpha * inv ({m}) 1 .* alpha ./ {m} alpha * pinv ({m})

alpha / {m} alpha .* 1 / {m} 1 * alpha / {m}

(theta' * X')'

(X * theta)

theta(1) + theta (2) * X (:, 2)

(X * theta (:))

[X] * theta

sum(X.*repmat(theta',{m},1), 2) …

{m}

{alphaOverM}

{hypothesis} {residual}

Canonicalization improves bug detection accuracy

0.650.7

0.750.8

0.850.9

0.95F-

scor

e

# unique ASTs considered

without canonicalization

with canonicalization

High

er is

bet

ter

How many submissions can we give feedback to with fixed effort?

0

5000

10000

15000

20000

25000

0 1 10 19

# su

bmis

sion

s co

vere

d

(out

of 4

0,00

0)

# equivalence classes

with 25 ASTsmarked

with 200 ASTsmarked

Canonicalization, 25 marked ASTS

No Canonicalization, 200 marked ASTs

If we can find shared structure, we can facilitate feedback

In education: ASTs, proofs, essays, architecture, poems…

“gradient” “residual”

“learning rate”

“base case”

“inductive hypothesis”

“apartheid”

“Afrikaner Calvinism”

“elections in South Africa”

Data is revolutionizing many fields

And it can revolutionize education too!

Thank you!! [email protected] http://www.stanford.edu/~jhuang11 @jonathanhuang11

mailto:[email protected]

http://www.stanford.edu/%7Ejhuang11

data driven student feedback for programming intensive moocs · coursera’s ml course [moocshop,...

Documents