Transcript
Page 1: Naive application of Machine Learning to Software Development

Naive application of Machine Learning to Software Development

Page 2: Naive application of Machine Learning to Software Development

Naive application of Machine Learning to Software Developmentor... what developers don't tell :)

Page 3: Naive application of Machine Learning to Software Development

What and why42 Coffee Cups:

completely distributed development team

Page 4: Naive application of Machine Learning to Software Development

What and why42 Coffee Cups:

completely distributed development team

Hard facts about how software is done

Page 5: Naive application of Machine Learning to Software Development

What and why42 Coffee Cups:

completely distributed development team

Hard facts about how software is done

LOTS OF THEM

Page 6: Naive application of Machine Learning to Software Development

What and why

Facts

Page 7: Naive application of Machine Learning to Software Development

What and why

Facts Profit

Page 8: Naive application of Machine Learning to Software Development

What and why

Facts Profit???

Page 9: Naive application of Machine Learning to Software Development

What and why

???Toy problem:

get ticket and predict how long it will take to close it

Page 10: Naive application of Machine Learning to Software Development

What and why

???Toy problem:

get ticket and predict how long it will take to close it

Bonus: learn scikit-learn :)

Page 11: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev

Page 12: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev python-numpy python-numpy-dev

Page 13: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev python-numpy python-numpy-dev python-scipy

Page 14: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev python-numpy python-numpy-dev python-scipy python-setuptools libatlas-dev g++

Page 15: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev python-numpy python-numpy-dev python-scipy python-setuptools libatlas-dev g++

● pip install -U scikit-learn

Page 16: Naive application of Machine Learning to Software Development

Data: closed ticketsimport urllib2

url = \

'https://code.djangoproject.com/query?format=csv' +\

'&col=id&col=time&col=changetime&col=reporter' + \

'&col=summary&col=status&col=owner&col=type' + \'&col=component&order=priority'

tickets = urllib2.urlopen(url).read()

open('2012-10-09.csv','w').write(tickets)

Page 17: Naive application of Machine Learning to Software Development

Data: closed ticketsid,time,changetime,reporter,summary,status,owner,type,component1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian,Create architecture for anonymous sessions,closed,jacob,enhancement,Core (Other)2,2005-07-13 12:04:45,2007-07-03 16:04:18,anonymous,Calendar popup - next/previous month links close the popup in Safari,closed,jacob,defect,contrib.admin

Page 18: Naive application of Machine Learning to Software Development

Data: closed date and description

def get_data(ticket):

url = 'https://code.djangoproject.com/ticket/%s'\

% ticket

ticket_html = urllib2.urlopen(url)

bs = BeautifulSoup(ticket_html)

Page 19: Naive application of Machine Learning to Software Development

Data: closed date and description

# get closing date

d = bs.find_all('div','date')[0]

p = list(d.children)[3]

href = p.find('a')['href']

close_time_str = urlparse.parse_qs(href)

['/timeline?from'][0]

close_time = datetime.datetime.strptime

(close_time_str[:-6],

'%Y-%m-%dT%H:%M:%S')

# ... more black magic, see code

Page 20: Naive application of Machine Learning to Software Development

Data: closed date and description

def get_data(ticket):

[...]

# get description and return

de = bs.find_all('div', 'description')[0]

return close_time, de.text

Page 21: Naive application of Machine Learning to Software Development

Data: closed date and description

tickets_file = csv.reader(open('2012-10-09.csv'))

output = \

csv.writer(open('2012-10-09.close.csv','w'))

for id, time, changetime, reporter, summary, \

status, owner, type, component in tickets_file:

closetime, descr = get_data(id)

row = [id, time, changetime, closetime, reporter,

summary, status, owner, type, component,

descr.encode('utf-8'), ],)

output.writerow(row)

Page 22: Naive application of Machine Learning to Software Development

Scoring: Train/Test set split

cross_validation.train_test_split

(tickets_train, tickets_test, times_train,

times_test) =

cross_validation.train_test_split(

tickets, times,

test_size=0.2,

random_state=0)

Page 23: Naive application of Machine Learning to Software Development

Scoring: Mean squared error

sklearn.metrics.mean_squared_error

train_error = metrics.mean_squared_error(

times_train, times_train_predict)

test_error = metrics.mean_squared_error(

times_test, times_test_predict)

Page 24: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?for number, created, ... in tickets_file:

row = []

created = dt.datetime.strptime(created,

time_format)

closetime = dt.datetime.strptime(closetime,

time_format)

time_to_fix = closetime - created

row.append(float(number))

tickets.append(row)

times.append(total_seconds(time_to_fix))

Page 25: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?import numpy as np

from sklearn import preprocessing

scaler = preprocessing.Scaler().fit(

np.array(tickets))

tickets = scaler.transform(tickets)

Page 26: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?clf = SVR()

clf.fit(tickets_train, times_train)

times_train_predict = clf.predict(tickets_train)

times_test_predict = clf.predict(tickets_test)

Page 27: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?train_error = metrics.mean_squared_error

(times_train, times_train_predict)

test_error = metrics.mean_squared_error(times_test,

times_test_predict)

print 'Train error: %.1f\n Test error: %.2f' % (

math.sqrt(train_error)/(24*3600),

math.sqrt(test_error)/(24*3600))

# .. in days

Page 28: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?

Train error: 363.4

Test error: 361.41

Page 29: Naive application of Machine Learning to Software Development

Finding best parametersSVM C controls regularization:

larger C leads to ● closer fit to the train data ● with the risk of overfitting

Page 30: Naive application of Machine Learning to Software Development

Finding best parametersCs = np.logspace(-1, 10, 10)

for c in Cs:

learn(c)

Page 31: Naive application of Machine Learning to Software Development

Finding best parameters0.1: Train error: 363.4 Test error: 361.41

1.71: Train error: 363.4 Test error: 361.41

27.8: Train error: 363.4 Test error: 361.39

464.2: Train error: 363.2 Test error: 361.17

7742.6: Train error: 362.5 Test error: 360.41

129155.0: Train error: 362.1 Test error: 360.00

2154434.7: Train error: 362.0 Test error: 359.82

35938136.6: Train error: 361.7 Test error: 359.60

599484250.3: Train error: 361.5 Test error: 359.36

10000000000.0: Train error: 361.1 Test error:

358.91

Page 32: Naive application of Machine Learning to Software Development

Finding best parameterssklearn.grid_search.GridSearchCV

bonus: it can run in parallel

clf = GridSearchCV(estimator=SVR(

param_grid=dict(C=np.logspace(-1,10,10)),

n_jobs=-1)

clf.fit(tickets_train, times_train)

Page 33: Naive application of Machine Learning to Software Development

Finding best parameterssklearn.grid_search.GridSearchCV

bonus: it can run in parallel

clf = GridSearchCV(estimator=SVR(

param_grid=dict(C=np.logspace(-1,10,10)),

n_jobs=-1)

clf.fit(tickets_train, times_train)

Train error: 361.1 Test error: 358.91

Best C: 1.0e+10

Page 34: Naive application of Machine Learning to Software Development

Fun #2: creation date?

row = []

row.append(float(number))

row.append(float(time.mktime(

created.timetuple())))

tickets.append(row)

Page 35: Naive application of Machine Learning to Software Development

Fun #2: creation date?

Train error: 360.6 Test error: 358.39

Best C: 1.0e+10

Page 36: Naive application of Machine Learning to Software Development

String vectorizer and Tfidf transform

from sklearn.feature_extraction.text \

import CountVectorizer, \

TfidfTransformer

Page 37: Naive application of Machine Learning to Software Development

String vectorizer and Tfidf transformreporters = []

for number, ... in tickets_file:

[...]

reporters.append(reporter)

Page 38: Naive application of Machine Learning to Software Development

String vectorizer and Tfidf transformCountVectorizer().fit_transform(reporters) ->

TfidfTransformer().fit_transform( … ) ->

hstack((tickets, …)

note: TF-IDF matrix is sparse!

Page 39: Naive application of Machine Learning to Software Development

String vectorizer and Tfidf transformimport scipy.sparse as sp

tickets = sp.hstack((

tickets,

TfidfTransformer().fit_transform(

CountVectorizer().fit_transform(reporters))))

# remember to re-scale!

scaler = preprocessing.Scaler(with_mean=False

).fit(tickets)

tickets = scaler.transform(tickets)

Page 40: Naive application of Machine Learning to Software Development

Fun #3: reporter

Train error: 338.7 Test error: 353.38

Best C: 1.8e+07

Page 41: Naive application of Machine Learning to Software Development

subjects = []

for number, created, ... in tickets_file:

[...]

subjects.append(summary)

[...]

tickets = sp.hstack((tickets,

TfidfTransformer().fit_transform(

CountVectorizer(ngram_range=(1,3)

).fit_transform(subjects))))

Fun #3: subject

Page 42: Naive application of Machine Learning to Software Development

Train error: 21.0 Test error: XXXX

Best C: 1.0e+10

Fun #3: subject

Page 43: Naive application of Machine Learning to Software Development

Train error: 21.0 Test error: 331.79

Best C: 1.0e+10

Fun #3: subject

Page 44: Naive application of Machine Learning to Software Development

def learn(kernel='rbf', param_grid=None,

verbose=False):

[...]

clf = GridSearchCV(

estimator=SVR(kernel=kernel,

verbose=verbose),

param_grid=param_grid,

n_jobs=-1)

[...]

Different SVM kernels

Page 45: Naive application of Machine Learning to Software Development

RBF

Train error: 21.0 Test error: 331.79

Best C: 1.0e+10

Linear

Train error: 343.1 Test error: 355.56

Best C: 1.0e+02

Different SVM kernels

Page 46: Naive application of Machine Learning to Software Development

components = []

for number, .. component, ... in tickets_file:

[...]

components.append(component)

[...]

tickets = sp.hstack((tickets, TfidfTransformer().

fit_transform(

CountVectorizer().fit_transform(components))))

Fun #5: account for the Component

Page 47: Naive application of Machine Learning to Software Development

RBF

Train error: 18.9 Test error: 327.79

Best C: 1.0e+10

Linear:

Train error: 342.2 Test error: 354.89

Best C: 1.0e+02

Fun #5: account for the Component

Page 48: Naive application of Machine Learning to Software Development

descriptions = []

for number, ... description in tickets_file:

[...]

descriptions.append(description)

[...]

tickets = sp.hstack((tickets, TfidfTransformer().

fit_transform( CountVectorizer(ngram_range=

(1,3)).fit_transform(

descriptions))))

Fun #6: ticket Description

Page 49: Naive application of Machine Learning to Software Development

RBF

Train error: 10.8 Test error: 328.44

Best C: 1.0e+10

Linear

Train error: 14.0 Test error: 331.52

Best C: 3.2e+03

Fun #6: ticket Description

Page 50: Naive application of Machine Learning to Software Development

● All steps of a simple machine learning algo

Conclusions

Page 51: Naive application of Machine Learning to Software Development

● All steps of a simple machine learning algo

● scikit-learn

Conclusions

Page 52: Naive application of Machine Learning to Software Development

● All steps of a simple machine learning algo

● scikit-learn

● data, explicitly available in tickets is NOT ENOUGH to predict closing date

Conclusions

Page 53: Naive application of Machine Learning to Software Development

Developers, what are you hiding?

:)

Page 54: Naive application of Machine Learning to Software Development

Questions?Source code and dataset available at

https://github.com/42/django-trac-learning.git

Contacts:● @akhavr● http://42coffeecups.com/


Top Related