building ml pipelines
TRANSCRIPT
Building ML Pipelines
qplum_team qplum.co [email protected]
What do ML Pipelines Look Like?
TRAINING DATA
AWESOME ML
TECHNIQUEMODEL
TESTING DATA
PREDICTIONS
Let’s build one now!
UserID Pet Children Salary
1 cat 4 90
2 dog 6 24
3 dog 3 44
4 fish 3 27
5 cat 2 32
6 dog 3 59
7 cat 5 36
8 fish 4 27
Predict the salary from the kind of pets and the number of children a person has
You may need to:1. Binarize/normalize data2. Remove noise3. Reduce dimensionality of data4. Make features from raw data…before you get to train your model !!
C D F N S
1 0 0 0.21 90
0 1 0 1.88 24
0 1 0 -0.63 44
0 0 1 -0.63 27
1 0 0 1.46 32
0 1 0 -0.63 59
1 0 0 1.04 36
0 0 1 0.21 27
Neural Net
Training Set
YX
Model
But is this enough?
No ML Pipeline is complete without Cross-validation and Hyper-parameter optimization
So how does our ML Pipeline look now?
RAW DATA
AWESOME ML
TECHNIQUEwith
PARAMETERS 1
BEST MODEL
TESTING DATA
PREDICTIONS
PRE-PROCESSED
DATA
EXTRACT FEATURES
TRAINING DATA
AWESOME ML
TECHNIQUEwith
PARAMETERS K
AWESOME ML
TECHNIQUEwith
PARAMETERS N
What does ‘best’ model mean?
ML Pipeline in Code
Series of transformationsTransformations might involve making modelsModels can be used to transform or predictGrid-search on Parameters
>>> clf.set_params(svm__C=10)
Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)),
('svm', SVC(C=10, cache_size=200, class_weight=None,
coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False))])
>>> from sklearn.grid_search import GridSearchCV
>>> params = dict(reduce_dim__n_components=[2, 5, 10],
... svm__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(clf, param_grid=params)
Extra features:Configurable data sourcesCustomized scoring metrics(average, median of results etc.)
Customize cross-validation based on nature of data
How do you cross-validate on time-series data?
Why use ML pipelines?
DRY
Libraries with ML PipelinesSci-kit Learn, Pandas and Scikit-MapperSparks MLLibWrite your own!!
qplum_team qplum.co [email protected]
References1. https://github.com/paulgb/sklearn-pandas2. http://www.slideshare.net/jeykottalam/pipelines-
ampcamp3. http://scikit-learn.org/stable/modules/pipeline.html
qplum_team qplum.co [email protected]
EXTRA
UserID Pet Children Salary
1 cat 4 90
2 dog 6 24
3 dog 3 44
4 fish 3 27
5 cat 2 32
6 dog 3 59
7 cat 5 36
8 fish 4 27
Need to binarize this column Might also want to normalize this column
Is Pet a Cat? Is Pet a Dog? Is Pet a Fish? Normalized number of children
1 0 0 0.21
0 1 0 1.88
0 1 0 -0.63
0 0 1 -0.63
1 0 0 1.46
0 1 0 -0.63
1 0 0 1.04
0 0 1 0.21