machine learning 101

40
Machine Learning 101 Talha Obaid

Upload: talha-obaid

Post on 20-Jan-2017

51 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Machine Learning 101

MachineLearning101

TalhaObaid

Page 2: Machine Learning 101

Aboutme

• EmailSecurity@Symantec• DoingDataSciencetofightSpamandMalware

• OrganizerforPythonDataScienceGroupSingapore• Monthlyregularmeet-ups—overayear• http://meetup.com/pydata-sg >1.8Kmembers• https://www.facebook.com/groups/pydatasg/ >1k

members• https://twitter.com/pydatasg• https://engineers.sg/organizations/118 recorded

anduploaded

• PreviouslywithCENSAM@MIT

• Co-foundedstartup(s)

• NUSAlumni

• Somequestions• HowmanyofyouhaveheardaboutMachineLearningor

ML?• HowmanyofyouknowhowtodoML?• HowmanyofyouearnalivingdoingML?

• Whatthistalkoffers• Gettinga footinthedoor

• Grosslyoversimplifyingthings• HowtolearnMLfromliterature• RelatetoMLtermswhenthrownatyou• TypesofML• LearningMLmodelsandtheircoding(SciKit-learnand

why?)• LinearRegression• LogisticRegression• Clustering

• LessonsfromPracticalML

@ObaidTal

Page 3: Machine Learning 101

Someterminology

• DataScience• DataAnalytics• BusinessAnalytics• ArtificialIntelligence• MachineLearning

Ref.TuanQ.Phan

Page 4: Machine Learning 101

WhatisData?

• Availabledata(Internal)• Healthrecord• Organization• University• …

• Availabledata(external)• www.data.gov.sg• Publiclyavailablecorpuses

• Qualityofdata• Trustworthyornot

• Missingdata• Hugechallengeinscientificcommunity

• Otherjargon• TinyData:Datafromsensors• BigData:Dataonmassivescale• FastData:Hash-basedlookup

@ObaidTal

Page 5: Machine Learning 101

MachineLearning,defined

• Afieldofstudythatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed

– ArthurSamuel(1959)• Samuelwroteaprogramtoplaycheckers• Eventuallyhisprogramlearnedtoplaybetter

@ObaidTal

Ref:http://infolab.stanford.edu/pub/voy/museum/samuel.html

Page 6: Machine Learning 101

WhendidweallstartwithMachineLearning?

• Takealookatthefollowing(outputs)andguessthe?:• 1,2,3,4,5,6,?,…,?• 2,4,6,8,10,12,?,…,?• 3,6,9,12,?,…,?• 1,3,9,27,?,…,?• 4,7,10,13,?,…,?

• SohowcanIrepresentabove• Input->->output• X->->Y• callthisboxasf()• Output=f(Input)...Inmaths• Y=f(x)

• Answers• Y=X• Y=2*X• Y=3*X+0• Y=3^X• Y=3*X+1

Inschool… Really,how? Howtofind‘…,?’– A:Equation(Singlevariable)

@ObaidTal

Assuminginpu

tis

1,2,3,4,5,6,…

Page 7: Machine Learning 101

LinearRegression– StatisticaltermY=mx+b… fromlastexample,b=?&m=? b=1

m=3Y=mx+b

Outpu

t

Input

SupposethislineisY=3x+1

Assumethatthislineis‘surrounded’by‘+’shapedpoints,whichwehad,i.e.(outputs)4,7,10,13,?,…,?(Y)havinginputs1,2,3,4,5,6,… (x)

ThelineY=3x+1kindof‘fits’inthesepointsastofindout‘…,?’

Page 8: Machine Learning 101

Whereareweheaded…

https://www.ltcconline.net/greenl/courses/154/factor/circle.htmhttp://machinelearningmastery.com/basic-concepts-in-machine-learning/

x …Y …

m,bSincex,Yarealreadyknown,thereforewegotY=mx+b

+

+

+

+

Y=x+2x+3x+(3x+1)+3^x

Y=x1+2*x2+3*x3+(3x4+1)+3^x5

Sofar,thereisasinglevariable‘x’ontheleft-handside.However,itcanbemorethanonevariable.Let’ssumupallthepreviousequationsontheleft-handside:

Let’sassumethe‘x’tobedifferentfromeachotherontheleft-handside:

Howtofitalinebetween‘+’shapedpoints?A:DistanceformulaMakingsureeach‘+’isclosest tothelineorviceversa

x …mx+b

Y

Next,let’smovetodifferenttypesofMachineLearning…

Page 9: Machine Learning 101

SupervisedLearning

• Providingtheoutput,andadataset(input),tocomeupwiththeanswer,i.e.model.• Inliterature,“TheBostonhousingprices”exampleisa“Regressionproblem”, i.e.predictingthecontinuousvalue variable,astheoutcome.• “Classificationproblem”– i.e.thevariabletryingtopredictisdiscretee.g.spamproblem,outputiseither0or1• Thefeature orinputdataset variables canalwaysbemorethanone,i.e.graphwithmultipledimensions.• Codethemodelwithwhattherightanswer,i.e.Y is,andtrainwithnumberofinput sets,i.e.x,andaskthealgorithmormodel toreplicatethesame

TypesofMachineLearning

Ref.AndrewY.Ng.

1

Page 10: Machine Learning 101

UnsupervisedLearningTypesofMachineLearning

• Dataisgiven,andstructuremustbeinferred• Clustering isoneexampleofit• DeepLearningisalsoconsideredhere• Exampleisfindingclustersin– Genedata– Imageprocessing,groupingpixelstogether– Socialnetworkanalysis– Lotsofpeopletalking,extractingthevoiceofsingleperson,consideringvoicesofothersasnoise– Cocktailpartyproblem

– Textprocessing

• IndependentcomponentanalysisICAalgorithmRef.AndrewY.Ng.

2

Page 11: Machine Learning 101

ReinforcementLearning

• Sequenceofdecisionsaremadeovertime• Example• Flyinganautonomoushelicopter

• Rewardfunction• Specifywhatyouwanttogetdone• SpecifyagoodbehaviorandbadbehaviorinRewardfunction• Learningalgorithmwilldecidetomaximizegoodbehaviorandminimizebadbehavior

TypesofMachineLearning

Ref.AndrewY.Ng.

3

Page 12: Machine Learning 101

Gettingready– somemoreterms…

• Dataset/Inputisalsocalledtrainingset,observation• Thepredictoriscalledhypothesisforhistoricalreasons,anditiscalledclassifier,estimator,predictor• Bostonhousingpriceproblem(we’llseemoreofit)• Wewilltrain/learn andpredict price• Features orinputvariable onrightsideofY=mx+b,i.e.x• Price,i.e.Y,outputortargetvariableofY=mx+b• Linearequation,i.e.Y=mx+b canbewrittenaspredictor,wheremisslopeandbisintercept• Costfunction– whichY=mx+b isbetter(wewillseemoreofit)

@ObaidTal

Let’sgetcoding… 1

• Toremember• Willexpandon

Page 13: Machine Learning 101

PopularMachineLearningToolKit–IntroductionProject Language Highlight

R R Alanguageforstatistic analysisandML

Octave Octave Alanguage tosimulateMatlab fornumericalcomputations

Scikit-learn Python Documentation,example,tutorialsavailable. General purposewithsimpleAPI

Tensorflow Py bindings Alibraryfornumericalcomputationusingdataflowgraphs

Orange Python GeneralPurposeMLPackage

PyBrain Python Neuralnetworks,unsupervisedlearning

MLlib Python/Scala Apache’snewlibrarybasedwithinSpark

Mahout Java Apache’sframeworkbasedonHadoop

Weka java GeneralPurposeMLPackage

GoLearn Go MachineLearningbyGo

shogun C++ Userinterfacestovariouslanguages

Page 14: Machine Learning 101

MachineLearningKit– whichtochoose• Factorstoconsider• Language• Performance(runspeed)• Scalability

Ref.T.Obaid&H.Zhang

• WechooseScikit learn• Language:Python• Performance(runspeed):goodenough

• Scalability:notcritical,andcanswitchtoMLlib inSparkformassdata

• Welldocumented,enoughalgorithms,cleanAPI,robust,fastimplementation,easyusage

Scikit Learn– MachineLearninginPython

• Simpleandefficienttoolsfordatamininganddataanalysis

• Accessibletoeverybody,andreusableinvariouscontexts

• BuiltonNumPy,SciPy,andmatplotlib

• Opensource,commerciallyusable– BSDlicense

Page 15: Machine Learning 101

Scikit Learn– Examples• Alotofsamplecodesareinsourcefolder:

scikit-learn-0.16.1/examples• Bostonhousingprices (wewillworkwiththisexampledataset)• Willtryfeaturesonebyone(testonly3oftheminthissession,pleasetrymore)• Excerptofdata…(howourdataactuallylookslike)

1.CRIM 2.ZN 3.INDUS 4.CHAS 5.NOX 6.RM 7.AGE 8.DIS 9.RAD 10.TAX11.PTRATIO 12.B 13.LSTAT 14.MEDV

0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24

0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6

0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7

0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4

0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2

Detailsabouteachfeatureofthisdataarecomingnext…

• Toremember• Pleaseexplore…

Page 16: Machine Learning 101

FeaturesofBostonhousingprices

1. CRIMpercapitacrimeratebytown2. ZNproportionofresidentiallandzonedforlotsover25,000sq.ft.3. INDUSproportionofnon-retailbusinessacrespertown4. CHASCharlesRiverdummyvariable(=1iftractboundsriver;0otherwise)5. NOXnitricoxidesconcentration(partsper10million)6. RMaveragenumberofroomsperdwelling7. AGEproportionofowner-occupiedunitsbuiltpriorto19408. DISweighteddistancestofiveBostonemploymentcenters9. RADindexofaccessibilitytoradialhighways10. TAXfull-valueproperty-taxrateper$10,00011. PTRATIOpupil-teacherratiobytown12. B1000(Bk - 0.63)^2whereBk istheproportionofblacksbytown13. LSTAT%lowerstatusofthepopulation14. MEDVMedianvalueofowner-occupiedhomesin$1000s

FeaturesandtheirdetailsWhichofthesefeaturesaresignificant:• Allofthem?• Afewofthem?• Anotherone,notinthem?Let’sobservethese…

Page 17: Machine Learning 101

Scikit Learn– DemocodeforBostonhouseprice.Tryit!import matplotlib.pyplot as plt # for plottingimport numpy as np # for matrix/array operations from sklearn import datasets, linear_model # classifier

boston = datasets.load_boston()boston_X = boston.data[:, np.newaxis]boston_X_temp = boston_X[:, :, 12] # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by oneboston_X_train = boston_X_temp[:]boston_y_train = boston.target[:]

regr = linear_model.LinearRegression() # estimatorregr.fit(boston_X_train, boston_y_train) # train parameters

fig,ax = plt.subplots()ax.scatter(boston_X_train, boston_y_train, color='black') # we can predict boston_X_testax.plot(boston_X_train, regr.predict(boston_X_train), color='green', linewidth=3) # to predictax.set_xlabel(boston.feature_names[12]) # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by oneax.set_ylabel('Predicted')

fig.show()plt.show()

Ref.T.Obaid&H.Zhang

• Important...• GoodFeature?• NotsoGoodFeature?• Comments

Page 18: Machine Learning 101

Scikit Learn– DemoresultforBostonhouseprice

• Parameters(Coefficients, -0.95692593)(intercept, 34.7411998746244)

• Feature:• %lower status ofthepopulation• y=-0.95692593*LSTAT +34.7411998746244

• Looksgood!

1st TrywithLSTAT%lowerstatusofthepopulation

Page 19: Machine Learning 101

DemoresultContd.

• Parameters(Coefficients, -2.1571753)(intercept, 62.3446274748)

• Feature:• pupil-teacherratiobytown

• y=-2.1571753*PTRATIO +62.3446274748

• Doesn’tlookgood!

2nd TrywithPTRATIOpupil-teacherratiobytown

Page 20: Machine Learning 101

DemoresultContd.

• Parameters(Coefficients, 9.126359)(intercept, -34.7856369115583)

• Feature:• average number ofrooms perdwelling• y=9.126359*RM -34.7856369115583

• Looks good!

3rd TrywithRMaveragenumberofroomsperdwelling

Page 21: Machine Learning 101

Costfunction– thelower thecost,thebetterthemodel

Real LSTAT Predicted Difference Square

... ... ... ... ...

18.3 14.1 21.24854426 2.948544262 8.693913263

21.2 12.92 22.37771686 1.177716859 1.387017

17.5 15.1 20.29161833 2.791618332 7.793132909

16.8 14.33 21.0284513 4.228451298 17.87980038

22.4 9.67 25.48772613 3.087726132 9.534052663

20.6 9.08 26.05231243 5.45231243 29.72771084

23.9 5.64 29.34413763 5.444137629 29.63863453

22 6.48 28.54031985 6.540319848 42.77578372

11.9 7.88 27.20062355 15.30062355 234.1090809

Total: 19478.69458

Total/2 9739.347291

Ref.AndrewY.Ng.

Real RM Predicted Difference Square

... ... ... ... ...

18.3 5.794 18.09248713 -0.207512866 0.043061589

21.2 6.019 20.14591791 -1.054082091 1.111089054

17.5 5.569 16.03905636 -1.460943641 2.134356321

16.8 6.027 20.21892878 3.418928781 11.68907401

22.4 6.593 25.38444798 2.984447975 8.906929718

20.6 6.12 21.06768017 0.467680168 0.21872474

23.9 6.976 28.87984347 4.979843472 24.79884101

22 6.794 27.21884613 5.218846134 27.23635497

11.9 6.03 20.24630786 8.346307858 69.66085487

Total: 22062.73306

Total/2 11031.36653

Predicted=-0.95692593*LSTAT +34.7411998746244 Predicted=9.126359*RM -34.7856369115583

Least-squarescostfunction

=for(i =1;i <m;i++)

Comment:

HereSummationisnothing

butaforloopas:

Howwellarewedoing– Comparethe Goodones • 1GoodFeatureVS

• AnotherGoodFeature

Page 22: Machine Learning 101

Over-fittingandUnder-fittingTheGoodmodelis…the“Justright!”model– Why?

• Under-fitting– highbias notmatchingandcosttoohigh• Justrightiswhatweneed• Over-fitting– Highvariance happensmostlywhen

toomanyfeaturesareusedorthemodelistoocomplex• Themodelshouldlearn,notmemorize

http://i.imgur.com/W0qejU0.png

Page 23: Machine Learning 101

Scikit Learn– Usage

from sklearn import linear_model

X=[][] # source data with (n_samples, n_features)

Y=[] # target value with (n_samples)

clf = linear_model.LinearRegression() # Estimator, or classfier

clf = clf.fit(X, Y) # learn parameters from existing data

Test = [][] # same shape as X

clf.predict(Test) #predictthetargetfordatainTest

Ref.T.Obaid&H.Zhang

Themodelprogramskeletonwouldlooksomethinglike…

• Important1. Model2. Fit3. Predict

• Comments

Page 24: Machine Learning 101

Observationsfromcode

• Thereisalwaysafit functioncall,i.e.learning/trainingX,togiveY.• Sameisapredict functioncall,givenXonly,popoutY.• Pandalibrarypd canalternativelybeusedtohaverelativelysimplerdisplayofdata• train_test_split functioncall servesimportantpurpose,asitshufflesthedatasetsowedon’thaveselectionbias,i.e.ifforinstancedataisorderedbypriceascending,andhalvedfortrainingandhalffortesting,thenthetrainingdatamayhaveallthehousewithlesserprices.

• Toremember• Subtleties• ProbableIssue

Page 25: Machine Learning 101

Scikit Learn– TestData

• Scikit-learncomeswithafewstandarddatasets,forinstancetheirisanddigitsdatasetsforclassificationandtheBostonhousepricesdatasetforregression.• Boston(boston houseprices),iris(irisflower),mlcomp(20newsgroups),svmlight_file/s,diabetes,lfw_pairs(labeledface),sample_image/s(chinaandflower),digits(0-9handwriting),lfw_people(labeld people),linnerud(formultivariateregression)• Scipy.misc.lena()

• Loadtestdata…Tryothers!from sklearn import datasetsiris = datasets.load_iris()digits = datasets.load_digits()

Subsetoflearningdatasets– justsaw Bostonhousingprices

• … Seensofar• Ahead…• PleaseExplore…

Page 26: Machine Learning 101

Scikit Learn– MainAlgorithms

• Supervisedlearning(mosthavebothclassifierandregressor)• Linemodel:LinearRegression,Lasso,Ridge,LogisticRegression,SGD• SVM:LinearSVC,SVC,SVR• NaïveBayes:GaussianNB,MultinomiaNB,BernoulliNB• DecisionTree:DecisionTree(optimizedversionoftheCART)• Ensemblemethod:RandomForest,AdaBoost,GradientBoosting(GBDT)

• Unsupervisedlearning• Clustering:Kmeans(Kmeans+,mini-batch),DBSCAN• Manifoldlearning(dimensionreduction):MDS,Isomap,LocallyLinearEmbedding.

• Algorithmwholelist:http://scikit-learn.org/stable/modules/classes.html

Subsetofsupportedalgorithms– wejustsaw LinearRegression

• … Seensofar• Ahead…

Page 27: Machine Learning 101

Logistic(Classification)Regression

• Regression iswhenourlabelsycantakeanyreal(continuous)value.Examplesinclude:• Predictingstockmarket.• Predictingsales.• Detectingtheageofapersonfromapicture.

• Classification iswhenourlabelsycanonlytakeafinitesetofvalues(categories).Examplesinclude:• Handwrittendigitrecognition:xxisanimagewithahandwrittendigit,yy isadigitbetween0and9.• Spam filtering:xxisane-mail,andyy is0or1whetherthate-mailisaspamornot.

Linear(Regression)vs Logistic(Classification)

Page 28: Machine Learning 101

Linear(Regression)vs Logistic(Classification)Classification(finiteoutputvalues)vs Regression(continuousoutputvalues)

Page 29: Machine Learning 101

LogisticRegression– withIRISexample

• Categoricaloutputinsteadofcontinuousoutput• WilluseIRISdataset– toclassify3speciesofplants• NumberofInstances:150(50ineachofthreeclasses)• NumberofAttributes:4numeric,predictiveattributesandtheclass• Attribute/Feature Information:• sepallengthincm(willusethis)• sepalwidthincm(willusethis)• petallengthincm• petalwidthincm

• Classesi.e.Target:• Iris-Setosa• Iris-Versicolour• Iris-Virginica

IRISisadatabaseofflowerclasses…bearsalittlebitofbotany

Setosa Versicolour Virginica• Petalisthecoloredpartoftheflower• Sepalisthegreenleafbelowthepetal

Page 30: Machine Learning 101

Let’sgocode…Tryit!

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.linear_model import LogisticRegressioniris = load_iris()print "--- Keys ---\n", iris.keys()print "--- Shape ---\n", iris.data.shapeprint "--- Feature Names ---\n", iris.feature_namesprint "--- Description ---\n", iris.DESCRprint "--- Target --- \n", iris.targetiri = pd.DataFrame(iris.data)print "--- Panda Head ---\n", iri.head()

iri.columns = iris.feature_namesprint "--- Panda Columns ---\n", iri.head()logreg = LogisticRegression(C=1e5)X = iris.data[:, :2] # we only take the first two features.Y = iris.targetprint "--- X ---\n", Xprint "--- y ---\n", Y# we create an instance of Neighbors Classifier and fit the data.logreg.fit(X, Y) # again, the infamous fit method

Part1 Part2

• Preparation• Important• Debug

Page 31: Machine Learning 101

Alittlebitmore…Tryit!

# Plotting

h = .02 # step size in the mesh# Plot the decision boundary. For that, we will assign a color to each

# point in the mesh [x_min, m_max]x[y_min, y_max].

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5

y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Prediction

Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plotZ = Z.reshape(xx.shape)plt.figure(1, figsize=(4, 3))plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)# Plot also the training pointsplt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)plt.xlabel('Sepal length')plt.ylabel('Sepal width’)plt.xlim(xx.min(), xx.max())plt.ylim(yy.min(), yy.max())plt.xticks(())plt.yticks(())plt.show()

Part3 Part4

• Plotting• Important• Debug

Page 32: Machine Learning 101

Classification– outputTwofeatures,thusplottedin2Dplane

Page 33: Machine Learning 101

Clustering• Unsupervisedlearning• Outputunknown• Groupingobservation

K-Means• Oneofthemostpopular"clustering"algorithms.• Storeskk centroidsthatitusestodefineclusters.• Ifapointisclosertoacluster'scentroid.• Findbestcentroidsbyalternatingbetween

• assigningdatapointstoclustersbasedonthecurrentcentroids

• singcentroids(pointswhicharethecenterofacluster)basedonthecurrentassignmentofdatapointstoclusters.

34434958708189101116121131145

<=11

<=12

<=15

34434958708189101116121131145

Primitiveclusteringe.g.

11

6

9

12

11

8

12

15

Inpu

tdatasorted

2

Page 34: Machine Learning 101

ClusteringappliedonIRISdata

• WeusedthesameIRISdata,asusedinlogisticregressiondemo,howeverchangedtwothings:• Addedafeature,i.e. threefeatures forclustering,that’swhya3Dplotasoutput• Removedtheoutput,todemonstrateunsupervisedlearning

Threefeatures,thusplottedin3Dplane

Page 35: Machine Learning 101

Let’scode Tryit!

import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dfrom sklearn.cluster import KMeansfrom sklearn import datasetsnp.random.seed(5)iris = datasets.load_iris()X = iris.data # No used of Y hereest = KMeans() #Wetrybeforehandtheno.ofclusters,canbeevenmore,defaultis8

est.fit(X) #NOTICE!,noYhere,“Unsupervised”,Yay!labels = est.labels_fig = plt.figure(1, figsize=(4, 3))plt.clf()ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)plt.cla()ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))ax.w_xaxis.set_ticklabels([])ax.w_yaxis.set_ticklabels([])ax.w_zaxis.set_ticklabels([])ax.set_xlabel('Petal width')ax.set_ylabel('Sepal length')ax.set_zlabel('Petal length')plt.show()

Part1 Part2

• Preparation/Plotting• Important• Debug

Page 36: Machine Learning 101

Lessonslearned!

• Thedatasetonwhichthemodelisexecutedhere,isavailableandwell-formatted,whichisnotthecasealways

• Dataacquisitionandpreparationcomepriortofeatureextraction• Extractingtheinterestingfeatures,“numerifying”(convertingtonumbers,ifnotalready)andlaternormalizing them,comespriortorunningmodelonit

• Featuresordatacolumnscanbecategorical orinferentialvariables,orcancausesingularity problem;theseaffecttheperformanceofdatamodelandhenceresidualcost

• Selectionofmodel,linearorlogistic,andobservingcosttoselectappropriatefeatures,canalsobeachievedusingR,i.e.agoldstandardofp(Probabilityofincorrectlyrejectingatruenullhypothesis)wouldbe~0.05 (Atleast23%(andtypicallycloseto50%))

• Crossvalidation(CV)isdonebyrunningtestandtrainingafewtimesandmeasuringdifference• Confusion matrix alsoprovidesvisibilityintohowmanypredictionsarerightandwrong

@ObaidTal

Fromreal-lifeMachineLearning

Page 37: Machine Learning 101

Lessonslearned!

• Ifthedataisintime-series,andthereismissingdatawithinthetimewindow,then wecanapplyinterpolation orextrapolation.Interpolationworksgoodforarchiveddata,whereasextrapolationforlivedata

• Before applyinganyregression,itsohappensthatwemayhavetocluster thedataandthenapplyregressionoverit.Thiswouldhelpcontroloutliers,ifany,whichmayimpactthemodelperformance.Outliersarenotalwaysnoiseinthedata

• Selectionbias happenswhenwetrainthemodelondata,whichisnotthetruerepresentationoftherealoccurrences.Forinstance,dissectingthehousingpriceorderedbyascending,andtrainingoverit,wouldskipthehigher-valuedhomes.Thustoavoidit,datashouldbeshuffledtoachieveevendistribution

• Curseofdimensionality,whenchallengedwithtoomanyfeatures.Todealwithit,carefullyreducethenon-significantfeaturesincludingthedependent,categoricalorcompositefeatures,dependingonwhereapplicable

…Continued

@ObaidTal

Page 38: Machine Learning 101

References

• Stanford’sCS229byProfAndrewY.Ng– Highlyrecommended!• https://www.youtube.com/watch?v=UzxYlbK2c7E

• Scikit-Learntutorial• http://scikit-learn.org/stable/• http://scikit-learn.org/stable/install.html

• http://www.shogun-toolbox.org/page/features/• http://daoudclarke.github.io/machine%20learning%20in%20practice/2013/10/08/machine-learning-libraries/

Page 39: Machine Learning 101

References

• http://www-bcf.usc.edu/~gareth/ISL/ – HighlyRecommended!• http://bigdataexaminer.com/uncategorized/how-to-run-linear-regression-in-python-scikit-learn/• http://ipython-books.github.io/featured-04/• http://stanford.edu/~cpiech/cs221/handouts/kmeans.html• http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html• http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values

Continued…

Page 40: Machine Learning 101

Thankyou!

Talha Obaid• linkedin.com/in/talhaobaid• twitter.com/ObaidTal• github.com/TalhaObaid• [email protected]