introduction to machine learning with python and scikit-learn
DESCRIPTION
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.TRANSCRIPT
Introduction to Machine Learning with Python and scikit-learn
Python AtlantaNov. 14th 2013
Matt [email protected]
Slide #2 Intro to Machine Learning with Python [email protected]
Machine Learning (ML):• Finding patterns in data
• Modeling patterns
• Use models to make predictions
ML can be easy*• You already have ML applications!
• You can start applying ML methods now with Python & scikit-learn
• Theoretical knowledge of ML not needed (initially)*
*Gaining more background, theory, and experience will help
Slide #3 Intro to Machine Learning with Python [email protected]
Simple Example
Slide #4 Intro to Machine Learning with Python [email protected]
Simple Model
Slide #5 Intro to Machine Learning with Python [email protected]
Slide #6 Intro to Machine Learning with Python [email protected]
import numpy as npfrom sklearn.linear_model import LinearRegression
x,y = np.load('data.npz')x_test = np.linspace(0, 200)
model = LinearRegression()model.fit(x[::, np.newaxis], y)y_test = model.predict(x_test[::, np.newaxis])
Slide #7 Intro to Machine Learning with Python [email protected]
Variance/Bias Trade Off
Slide #8 Intro to Machine Learning with Python [email protected]
• Need models that can adapt to relationships in our data
• Highly adaptable models can over-fit and will not generalize
• Regularization – Common strategy to address variance/bias trade off
Slide #9 Intro to Machine Learning with Python [email protected]
Slide #10 Intro to Machine Learning with Python [email protected]
import numpy as npfrom sklearn.svm import SVRfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler
x,y = np.load('data.npz')x_test = np.linspace(0, 200)
model = Pipeline([ ('standardize', StandardScaler()), ('svr', SVR(kernel='rbf', verbose=0, C=5e6, epsilon=20)) ])model.fit(x[::, np.newaxis], y)y_test = model.predict(x_test[::, np.newaxis])
regularizationterm
Supervised Learning
Slide #11 Intro to Machine Learning with Python [email protected]
031342934
Input, X
1637931767
Output, Y Modeling relationship between inputs and outputs
Sam
ple
Multiple Inputs
Slide #12 Intro to Machine Learning with Python [email protected]
Input, X
031342934
X1
231689123
X2
103127542
X3
…
470291321
Xn
1637931767
Output, Y
Sam
ple
Example: Image Classification
Slide #13 Intro to Machine Learning with Python [email protected]
• Classify handwritten digits with ML models
• Each input is an entire image
• Output is digit in the image
Slide #15 Intro to Machine Learning with Python [email protected]
import numpy as npfrom sklearn.ensemble import RandomForestClassifier
with np.load(’train.npz') as data: pixels_train = data['pixels'] labels_train = data['labels’]with np.load(’test.npz') as data: pixels_test = data['pixels']
# flattenX_train = pixels_train.reshape(pixels_train.shape[0], -1)X_test = pixels_test.reshape(pixels_test.shape[0], -1)
model = RandomForestClassifier(n_estimators=50)model.fit(X_train, labels_train)labels_test = model.predict(X_test)
Trains on 50,000 images in roughly 20 seconds.96% accurate !!
Kaggle Data Science Competition
• Given 6 million training questions labeled with tags
• Predict the tags for 2 million unlabeled test questions
www.users.globalnet.co.uk/~slocks/instructions.htmlstackoverflow.com/questions/895371/bubble-sort-homework
Predicting the tags of Stack Overflow questions with machine learning
Slide #16 Intro to Machine Learning with Python [email protected]
Text Classification Overview
Raw Posts Vector Space Machine Learning Model
Feature Extraction & Selection
Model Selection & Training
Slide #17 Intro to Machine Learning with Python [email protected]
Term Frequency Feature Extraction
“Why is processing a sorted array faster than processing an array this is not sorted?”
Characterize text by the frequency of specific words in each text entry
Example Title:
whyprocessing
sorted
array
faster
1 2 2 2 1
Term Frequencies
Ignore common words (i.e. stop words)
Slide #18 Intro to Machine Learning with Python [email protected]
Frequency of key terms is anticipated to be correlated with the tags of the question
why
processing
sorted
array
faster
need
help
java
homework
Title 1 1 2 2 2 1 0 0 0 0
Title 2 0 0 0 0 0 1 1 1 1
Title 3 0 0 1 1 0 0 1 0 1
Slide #19 Intro to Machine Learning with Python [email protected]
Example Model Coefficients
Slide #22 Intro to Machine Learning with Python [email protected]
ML can be easy*• You already have ML problems!
• You can start applying ML methods now with Python & scikit-learn
• Theoretical knowledge of ML not needed (initially)*
scikit-learn.org
github.com/scikit-learn
Slide #24 Intro to Machine Learning with Python [email protected]
Check out: liveramp.com/careers
Helping companies use their marketing data to delight customers
Opportunities•Backend Engineers•Data Scientists•Full-Stack Engineers
Tools•Java•Hadoop (Map/Reduce)•Ruby
Build and work with large distributed systems that process massive data sets.
Slide #25 Intro to Machine Learning with Python [email protected]