data science · data science physics chemistry medicine biology city traffic social media web usage...

69
data science lecture 01 2017-10-10 https://michalis.co/dsc17/ michael mathioudakis lyon, france

Upload: lamkhanh

Post on 08-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

datasciencelecture012017-10-10

https://michalis.co/dsc17/

michael mathioudakislyon,france

Materialfor‘DataScience’CourseatGI-INSA,2017by MichaelMathioudakis islicensedundera CreativeCommonsAttribution-ShareAlike4.0InternationalLicense.

2017.10.10 michalis.co/dsc17 2

today

• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks

32017.10.10 michalis.co/dsc17

courselogisticslanguage:english

butletmeknowifyouneedclarificationsoraFrenchscript

coursewebsitehttps://michalis.co/dsc17/

contactemail:[email protected]

subject: dsc17 [yourtopic]

officehoursMonday8h30-10h,byappointment

2017.10.10 4michalis.co/dsc17

courselogistics

sessions6lectures,3TPs,1exam

lecturesmethodsfordatascience

TPsonceeverytwolecturesprogramminginpython

jupyter notebook,scientificpython(scipy)stack,scikit-learn

examclosedbooks,1h50

2017.10.10 5michalis.co/dsc17

courselogistics

prerequisites

familiaritywithprobabilities,statistics&programming

materiallectures:self-contained

itisenoughtoattendtheclass&understandtheslides

TPs:youareexpectedtoconsultonlinedocumentationforpython+softwaretools

2017.10.10 6michalis.co/dsc17

courselogistics

references

Friedman,Jerome,TrevorHastie,andRobertTibshirani. Theelementsofstatistical

learning.Springerseriesinstatistics,Secondedition,2008.

Bishop,ChristopherM. Patternrecognitionandmachinelearning.Springer,2006.

Gelman A,CarlinJB,SternHS,DunsonDB,Vehtari A,RubinDB.Bayesiandataanalysis.

BocaRaton,FL:CRCpress;2014.

2017.10.10 7michalis.co/dsc17

abouttheinstructor

name:‘michael’other:‘michalis’,‘michail’

maîtredeconférencesteaching:GIINSA

researchlab:LIRIS,DataSciencePole

previously…finland,canada,greece

2017.10.10 8michalis.co/dsc17

aboutyou

youname?

yourplans?

2017.10.10 9michalis.co/dsc17

today

• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks

102017.10.10 michalis.co/dsc17

datascience

112017.10.10 michalis.co/dsc17

2017.10.10 12michalis.co/dsc17

courses&degrees

coursera.org

2017.10.10 13michalis.co/dsc17

books

amazon.com

142017.10.10 michalis.co/dsc17

news

news.google.com

2017.10.10 15michalis.co/dsc17

jobs

linkedin.com

2017.10.10 16michalis.co/dsc17

software

github.com

2017.10.10 17

whatisdatascience?

michalis.co/dsc17

2017.10.10 18

datasciencedata

science

bigdata

statisticalmethods

machinelearning

cloudcomputing

michalis.co/dsc17

someterms

2017.10.10 19

datasciencedata

science

physicschemistrymedicinebiology

citytrafficsocialmediawebusage

healthsensors

customerdatapurchases

financialdataindustrialprocesses

etc…

michalis.co/dsc17

what’snewmoredata,thereforemore

opportunitytoextractinsightsbettertechnology

what’snot newwestillwanttoextractinsights,

andusethemtomakepredictions,takedecisionsthe‘logic’weuseinourmethodsisthesame (statistics/probability)

automatedanalysisoncomputers

pen&paper

extractinsightsvia…

whynow?

2017.10.10 20

datasciencedata

science

michalis.co/dsc17

moreterms

‘learning’

what dowelearn?adescription ofthedata

a‘model’thattellsushowthedataaredistributed

why?tomakepredictions(orinferences/guesses) anddecisions

(notonlyaboutthefuture)

212017.10.10 michalis.co/dsc17

example

22

data

learning

thepatient’stemperaturehasjustexceeded40Cwesupplythemedicineandobservetheirtemperaturechangeafter2hours

ok,we‘learned’- thenwhat?predictwhathappenstotemperatureifwesupply200mg?decide minimum dosetobecertain toachieveatleast 2Ctempdrop?

candowiththemodelwithoutthedata

model

2017.10.10 michalis.co/dsc17

thisexampleisacaseofregression

example

23

prediction taskdigitrecognition

given ahandwrittendigit,wewantthecomputertopredict(guess)

whatnumber itrepresentsusefulforpostal&shippingbusinesses

classification

let’ssaywearegivenmanuallylabeleddatahowwouldwecreateamodel?

howwillweuseamodelfordecisiontask?

2017.10.10 michalis.co/dsc17

example

datacustomerinformation

prediction taskwhatwillthenextpointbe?

clusteringdensityestimation

howwouldweuseamodel?

24

income($)

yearlypurchases($)

2017.10.10 michalis.co/dsc17

age<3535-50>50

yourturn

giveanexampleofprediction&decisiontasksthatwouldbeofinterestwith

thefollowingdatasets

2017.10.10 michalis.co/dsc17 25

supermarket

dataforthepast5years,everyday

whatproductswereontheshelves&atwhatpricewhoboughtwhat

predictiontask(s)?decisiontasks(s)?

2017.10.10 michalis.co/dsc17 26

traffic

dataforthepast5years,everyhour

sensorsmeasuredhowmanycarspassedfrombigintersectionsinLyon

weatherconditionsinLyon

predictiontask(s)?decisiontasks(s)?

2017.10.10 michalis.co/dsc17 27

2017.10.10 28

datasciencedata

science

michalis.co/dsc17

moreterms

‘machine’learning

whydoweusemachines?

tomakelearningautomated

andefficient

bigdatacomplexmodels

292017.10.10 michalis.co/dsc17

example:language

30

task:completethesentencelanguageiscomplex

basicrules(syntaxandgrammar)donotsufficeforgoodpredictions

requirescomplexmodels

datamillions/billionsofsentences/queries

userfeaturessessionattributes

2017.10.10 michalis.co/dsc17

today

• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks

312017.10.10 michalis.co/dsc17

datasciencepipeline

32

learninginference/training/fitting

data model prediction decision

modelcandidates probability

howmuchwebelievethatsomethingistrue

2017.10.10 michalis.co/dsc17

today

• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks

332017.10.10 michalis.co/dsc17

probability

howmuchwebelievethatsomethingistrueGIVEN theinformationathand

34

0:nochance 1:certain

VERYIMPORTANT!

‘proposition’

2017.10.10 michalis.co/dsc17

probability

35

aballdropsoutofthebox

proposition: itisgreen

whatistheprobability thatthepropositionistrue

GIVEN thatthereare100balls,40ofthemgreen?

relatedterm:‘frequency’

2017.10.10 michalis.co/dsc17

frequencyandprobability

36

dataweinspected100brownies,12 werelabeledasbitter

relativefrequencyofbitterbrownies?0.12 or12%

whatistheprobabilitythatthenextbrownieisbitter?

2017.10.10 michalis.co/dsc17

yourchocolatefactoryisexperimentingwithanewrecipeforbrownies

frequencyandprobability

372017.10.10 michalis.co/dsc17

yourchocolatefactoryisexperimentingwithanewrecipeforbrownies

dataweinspected2 brownies,1 waslabeledasbitter

relativefrequencyofbitterbrownies?0.50 or50%

whatistheprobabilitythatthenextbrownieisbitter?

frequencyandprobability

38

weinspectedN brownies,n werelabeledasbitter

relativefrequency ofbitterbrownies?f=n/N

p:probability thatthenextbrownieisbitter?

ifN islarge,thenitmakessensethat

p≅ f

2017.10.10 michalis.co/dsc17

yourchocolatefactoryisexperimentingwithanewrecipeforbrownies

probabilityrules

giveninformation I

sumrulep(A | I)+p(notA | I)=1

productrulep(A andB |I)=p(A |B andI)xp(B |I)

2017.10.10 michalis.co/dsc17 39

bayes rule

itfollowsfromtheproductrule

p(A |B andI)=p(B |A andI)xp(A |I)/p(B |I)

homework:explainhow

funfact:Bayesneverwroteit

2017.10.10 michalis.co/dsc17 40

assigningprobabilities

principleofindifference

ifpropositionsA1,A2,A3,…,An aremutuallyexclusive andexhaustiveandnootherinformationisgiven

thenweassignp(Ai)=1/n

2017.10.10 michalis.co/dsc17 41

today

• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks

422017.10.10 michalis.co/dsc17

datasciencepipeline

43

learning

data model prediction decision

modelcandidates probability

assignprobabilitiestopropositions

2017.10.10 michalis.co/dsc17

probability::predictionwhatistheprobabilitythatadoseof300mg

dropstemperaturemorethan2C ?

44

probability oftemp.drop>2C

givendose=300mg andmodel (seefigure)

p(temp.drop>2C |dose=300mg ;model)

giveninformation

proposition

thevalueforthisprobabilityisprovidedbythemodel!2017.10.10 michalis.co/dsc17

datasciencepipeline

45

learning

data model prediction decision

modelcandidates probability

assignprobabilitiestomodels

2017.10.10 michalis.co/dsc17

probability::learning

46

modelM1

modelM2

considermodelswheretemperaturedropsexponentiallywithdose

drop:dose-k anderroruptoεwhatistheprobabilitythattherightmodelisM1/M2/…?

probability ofmodel(k,ε)

givendata

kfrom-5to+5andεfrom-2to+2

2017.10.10 michalis.co/dsc17

probability::learning

47

modelM1

modelM2

p(model(k,ε) |data;kin[-5,+5],εin[-2,+2])p(M |D;I)

fromBayes’Rule,thisisproportionaltop(data |M ;I)xp(M |I)

likelihood prior

wechoosethemodelofmaximumprobabilityp (M |D;I)

(dowehaveto?)2017.10.10 michalis.co/dsc17

datasciencepipeline

48

learning

data model prediction decision

modelcandidates probability

2017.10.10 michalis.co/dsc17

datasciencepipeline– theBayesianway

49

learning

data prediction decisionmodelcandidates

probability funfact:ittookcenturiestoarriveatthispipeline

2017.10.10 michalis.co/dsc17

datasciencepipeline– inpractice

50

learning

data model prediction decision

modelcandidates

2017.10.10 michalis.co/dsc17

1st2nd3rd 4th

6th5th

inthiscourse…

51

learning

data model prediction decision

modelcandidates

2017.10.10 michalis.co/dsc17

today

• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks

522017.10.10 michalis.co/dsc17

53

platformsandsoftware

2017.10.10 michalis.co/dsc17

scientificpython

2017.10.10 michalis.co/dsc17 54

scipy.orglibrariesforscientificcomputinginpython

RandMATLAB

2017.10.10 michalis.co/dsc17 55

www.r-project.org www.mathworks.com/products/matlab.html

scikit-learn

pythonMLlibraryontopofscipy stack

manygeneralMLalgorithmsstandardizedpipeline

idealforfastprototypingonmoderatedatasets

562017.10.10 michalis.co/dsc17

deeplearning::tensorflow

deeplearninglibrarybasedonuser-definedcomputationgraphs

forout-of-pythonoptimization

572017.10.10 michalis.co/dsc17

deeplearning::other

torch.chopensourcemachinelearninglibrary

scientificframework,programminglanguage(Lua)usedbyFacebookResearch

theanohttp://deeplearning.net/software/theano/

deeplearningwithefficientnumericaloperations

microsoft cognitivetoolkit(cntk)https://cntk.ai/

tensorflow alternative

kerassimplertensorflow,theano,cntk inpython

582017.10.10 michalis.co/dsc17

cloud::google

59

CloudMLEngine

basicallyofferstheMLpipelinewithDeepLearningmodelsimplementedinTensorflow

otherservicestrainedmodelsforotherapplications

speech,videoorimagetagging,translationhttps://cloud.google.com/products/machine-learning/

pricing:about0.5$perhour

2017.10.10 michalis.co/dsc17

cloud::other

amazonawsclassificationandregression

withlogisticandlinearregression

microsoft azure‘cortana intelligence’

MLpipeline

602017.10.10 michalis.co/dsc17

apachespark

machinelearningalgorithmsontopofSpark

iterativeoptimization

612017.10.10 michalis.co/dsc17

today

• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks

622017.10.10 michalis.co/dsc17

regression

buildmodelthatprovidesdp(Y=y |X=x;ModelM)

forreal-valuedY

regressionmethodsdifferinthesetofmodelcandidates

theyconsider

eachmethodhascorrespondingalgorithm(s)

tosearchforbestmodel

63

X

Y

partsofthedata‘features’

2017.10.10 michalis.co/dsc17

someregressionmethods

64

linearregressionline+error

segmentedregressionksegments+errors

multinomialregressioncurve+error

p(M |data;I)∝ p(data |M ;I)xp(M |I)thisiswheremethodsdiffer

eachmodelcomeswithitsown2017.10.10 michalis.co/dsc17

classification

buildmodelthatprovidesp(Y=y |X=x;ModelM)forcategorically-valuedY

classificationmethodsdifferinthesetofmodelcandidates

theyconsider

eachmethodhascorrespondingalgorithm(s)

tosearchforbestmodel

65whatisX andY fordigitrecognition?

2017.10.10 michalis.co/dsc17

supervisedandunsupervisedlearning

regressionandclassificationarecasesof‘supervised’learning

buildmodelthatprovidesp(Y=y |X=x;ModelM)

66

somedatafeaturesotherdatafeatures

buildmodelthatprovidesp(X=x,Y=y;ModelM)

‘unsupervised’learning2017.10.10 michalis.co/dsc17

unsupervisedlearning

buildmodelthatprovidesdp(X=x,Y=y;ModelM)

findstructureinthedata

67X

Y

2017.10.10 michalis.co/dsc17

la fin

682017.10.10 michalis.co/dsc17

MLpipeline

69

learning

data model prediction decision

modelcandidates probability

2017.10.10 michalis.co/dsc17