data science · data science physics chemistry medicine biology city traffic social media web usage...
TRANSCRIPT
Materialfor‘DataScience’CourseatGI-INSA,2017by MichaelMathioudakis islicensedundera CreativeCommonsAttribution-ShareAlike4.0InternationalLicense.
2017.10.10 michalis.co/dsc17 2
today
• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks
32017.10.10 michalis.co/dsc17
courselogisticslanguage:english
butletmeknowifyouneedclarificationsoraFrenchscript
coursewebsitehttps://michalis.co/dsc17/
contactemail:[email protected]
subject: dsc17 [yourtopic]
officehoursMonday8h30-10h,byappointment
2017.10.10 4michalis.co/dsc17
courselogistics
sessions6lectures,3TPs,1exam
lecturesmethodsfordatascience
TPsonceeverytwolecturesprogramminginpython
jupyter notebook,scientificpython(scipy)stack,scikit-learn
examclosedbooks,1h50
2017.10.10 5michalis.co/dsc17
courselogistics
prerequisites
familiaritywithprobabilities,statistics&programming
materiallectures:self-contained
itisenoughtoattendtheclass&understandtheslides
TPs:youareexpectedtoconsultonlinedocumentationforpython+softwaretools
2017.10.10 6michalis.co/dsc17
courselogistics
references
Friedman,Jerome,TrevorHastie,andRobertTibshirani. Theelementsofstatistical
learning.Springerseriesinstatistics,Secondedition,2008.
Bishop,ChristopherM. Patternrecognitionandmachinelearning.Springer,2006.
Gelman A,CarlinJB,SternHS,DunsonDB,Vehtari A,RubinDB.Bayesiandataanalysis.
BocaRaton,FL:CRCpress;2014.
2017.10.10 7michalis.co/dsc17
abouttheinstructor
name:‘michael’other:‘michalis’,‘michail’
maîtredeconférencesteaching:GIINSA
researchlab:LIRIS,DataSciencePole
previously…finland,canada,greece
2017.10.10 8michalis.co/dsc17
today
• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks
102017.10.10 michalis.co/dsc17
2017.10.10 18
datasciencedata
science
bigdata
statisticalmethods
machinelearning
cloudcomputing
michalis.co/dsc17
someterms
2017.10.10 19
datasciencedata
science
physicschemistrymedicinebiology
citytrafficsocialmediawebusage
healthsensors
customerdatapurchases
financialdataindustrialprocesses
etc…
michalis.co/dsc17
what’snewmoredata,thereforemore
opportunitytoextractinsightsbettertechnology
what’snot newwestillwanttoextractinsights,
andusethemtomakepredictions,takedecisionsthe‘logic’weuseinourmethodsisthesame (statistics/probability)
automatedanalysisoncomputers
pen&paper
extractinsightsvia…
whynow?
‘learning’
what dowelearn?adescription ofthedata
a‘model’thattellsushowthedataaredistributed
why?tomakepredictions(orinferences/guesses) anddecisions
(notonlyaboutthefuture)
212017.10.10 michalis.co/dsc17
example
22
data
learning
thepatient’stemperaturehasjustexceeded40Cwesupplythemedicineandobservetheirtemperaturechangeafter2hours
ok,we‘learned’- thenwhat?predictwhathappenstotemperatureifwesupply200mg?decide minimum dosetobecertain toachieveatleast 2Ctempdrop?
candowiththemodelwithoutthedata
model
2017.10.10 michalis.co/dsc17
thisexampleisacaseofregression
example
23
prediction taskdigitrecognition
given ahandwrittendigit,wewantthecomputertopredict(guess)
whatnumber itrepresentsusefulforpostal&shippingbusinesses
classification
let’ssaywearegivenmanuallylabeleddatahowwouldwecreateamodel?
howwillweuseamodelfordecisiontask?
2017.10.10 michalis.co/dsc17
example
datacustomerinformation
prediction taskwhatwillthenextpointbe?
clusteringdensityestimation
howwouldweuseamodel?
24
income($)
yearlypurchases($)
2017.10.10 michalis.co/dsc17
age<3535-50>50
yourturn
giveanexampleofprediction&decisiontasksthatwouldbeofinterestwith
thefollowingdatasets
2017.10.10 michalis.co/dsc17 25
supermarket
dataforthepast5years,everyday
whatproductswereontheshelves&atwhatpricewhoboughtwhat
predictiontask(s)?decisiontasks(s)?
2017.10.10 michalis.co/dsc17 26
traffic
dataforthepast5years,everyhour
sensorsmeasuredhowmanycarspassedfrombigintersectionsinLyon
weatherconditionsinLyon
predictiontask(s)?decisiontasks(s)?
2017.10.10 michalis.co/dsc17 27
‘machine’learning
whydoweusemachines?
tomakelearningautomated
andefficient
bigdatacomplexmodels
292017.10.10 michalis.co/dsc17
example:language
30
task:completethesentencelanguageiscomplex
basicrules(syntaxandgrammar)donotsufficeforgoodpredictions
requirescomplexmodels
datamillions/billionsofsentences/queries
userfeaturessessionattributes
2017.10.10 michalis.co/dsc17
today
• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks
312017.10.10 michalis.co/dsc17
datasciencepipeline
32
learninginference/training/fitting
data model prediction decision
modelcandidates probability
howmuchwebelievethatsomethingistrue
2017.10.10 michalis.co/dsc17
today
• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks
332017.10.10 michalis.co/dsc17
probability
howmuchwebelievethatsomethingistrueGIVEN theinformationathand
34
0:nochance 1:certain
VERYIMPORTANT!
‘proposition’
2017.10.10 michalis.co/dsc17
probability
35
aballdropsoutofthebox
proposition: itisgreen
whatistheprobability thatthepropositionistrue
GIVEN thatthereare100balls,40ofthemgreen?
relatedterm:‘frequency’
2017.10.10 michalis.co/dsc17
frequencyandprobability
36
dataweinspected100brownies,12 werelabeledasbitter
relativefrequencyofbitterbrownies?0.12 or12%
whatistheprobabilitythatthenextbrownieisbitter?
2017.10.10 michalis.co/dsc17
yourchocolatefactoryisexperimentingwithanewrecipeforbrownies
frequencyandprobability
372017.10.10 michalis.co/dsc17
yourchocolatefactoryisexperimentingwithanewrecipeforbrownies
dataweinspected2 brownies,1 waslabeledasbitter
relativefrequencyofbitterbrownies?0.50 or50%
whatistheprobabilitythatthenextbrownieisbitter?
frequencyandprobability
38
weinspectedN brownies,n werelabeledasbitter
relativefrequency ofbitterbrownies?f=n/N
p:probability thatthenextbrownieisbitter?
ifN islarge,thenitmakessensethat
p≅ f
2017.10.10 michalis.co/dsc17
yourchocolatefactoryisexperimentingwithanewrecipeforbrownies
probabilityrules
giveninformation I
sumrulep(A | I)+p(notA | I)=1
productrulep(A andB |I)=p(A |B andI)xp(B |I)
2017.10.10 michalis.co/dsc17 39
bayes rule
itfollowsfromtheproductrule
p(A |B andI)=p(B |A andI)xp(A |I)/p(B |I)
homework:explainhow
funfact:Bayesneverwroteit
2017.10.10 michalis.co/dsc17 40
assigningprobabilities
principleofindifference
ifpropositionsA1,A2,A3,…,An aremutuallyexclusive andexhaustiveandnootherinformationisgiven
thenweassignp(Ai)=1/n
2017.10.10 michalis.co/dsc17 41
today
• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks
422017.10.10 michalis.co/dsc17
datasciencepipeline
43
learning
data model prediction decision
modelcandidates probability
assignprobabilitiestopropositions
2017.10.10 michalis.co/dsc17
probability::predictionwhatistheprobabilitythatadoseof300mg
dropstemperaturemorethan2C ?
44
probability oftemp.drop>2C
givendose=300mg andmodel (seefigure)
p(temp.drop>2C |dose=300mg ;model)
giveninformation
proposition
thevalueforthisprobabilityisprovidedbythemodel!2017.10.10 michalis.co/dsc17
datasciencepipeline
45
learning
data model prediction decision
modelcandidates probability
assignprobabilitiestomodels
2017.10.10 michalis.co/dsc17
probability::learning
46
modelM1
modelM2
considermodelswheretemperaturedropsexponentiallywithdose
drop:dose-k anderroruptoεwhatistheprobabilitythattherightmodelisM1/M2/…?
probability ofmodel(k,ε)
givendata
kfrom-5to+5andεfrom-2to+2
2017.10.10 michalis.co/dsc17
probability::learning
47
modelM1
modelM2
p(model(k,ε) |data;kin[-5,+5],εin[-2,+2])p(M |D;I)
fromBayes’Rule,thisisproportionaltop(data |M ;I)xp(M |I)
likelihood prior
wechoosethemodelofmaximumprobabilityp (M |D;I)
(dowehaveto?)2017.10.10 michalis.co/dsc17
datasciencepipeline
48
learning
data model prediction decision
modelcandidates probability
2017.10.10 michalis.co/dsc17
datasciencepipeline– theBayesianway
49
learning
data prediction decisionmodelcandidates
probability funfact:ittookcenturiestoarriveatthispipeline
2017.10.10 michalis.co/dsc17
datasciencepipeline– inpractice
50
learning
data model prediction decision
modelcandidates
2017.10.10 michalis.co/dsc17
1st2nd3rd 4th
6th5th
inthiscourse…
51
learning
data model prediction decision
modelcandidates
2017.10.10 michalis.co/dsc17
today
• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks
522017.10.10 michalis.co/dsc17
scikit-learn
pythonMLlibraryontopofscipy stack
manygeneralMLalgorithmsstandardizedpipeline
idealforfastprototypingonmoderatedatasets
562017.10.10 michalis.co/dsc17
deeplearning::tensorflow
deeplearninglibrarybasedonuser-definedcomputationgraphs
forout-of-pythonoptimization
572017.10.10 michalis.co/dsc17
deeplearning::other
torch.chopensourcemachinelearninglibrary
scientificframework,programminglanguage(Lua)usedbyFacebookResearch
theanohttp://deeplearning.net/software/theano/
deeplearningwithefficientnumericaloperations
microsoft cognitivetoolkit(cntk)https://cntk.ai/
tensorflow alternative
kerassimplertensorflow,theano,cntk inpython
582017.10.10 michalis.co/dsc17
cloud::google
59
CloudMLEngine
basicallyofferstheMLpipelinewithDeepLearningmodelsimplementedinTensorflow
otherservicestrainedmodelsforotherapplications
speech,videoorimagetagging,translationhttps://cloud.google.com/products/machine-learning/
pricing:about0.5$perhour
2017.10.10 michalis.co/dsc17
cloud::other
amazonawsclassificationandregression
withlogisticandlinearregression
microsoft azure‘cortana intelligence’
MLpipeline
602017.10.10 michalis.co/dsc17
apachespark
machinelearningalgorithmsontopofSpark
iterativeoptimization
612017.10.10 michalis.co/dsc17
today
• courselogistics• whatisdatascience?• thedatasciencepipeline• frequenciesandprobabilities• usingprobabilities• softwareandplatforms• commonpredictiontasks
622017.10.10 michalis.co/dsc17
regression
buildmodelthatprovidesdp(Y=y |X=x;ModelM)
forreal-valuedY
regressionmethodsdifferinthesetofmodelcandidates
theyconsider
eachmethodhascorrespondingalgorithm(s)
tosearchforbestmodel
63
X
Y
partsofthedata‘features’
2017.10.10 michalis.co/dsc17
someregressionmethods
64
linearregressionline+error
segmentedregressionksegments+errors
multinomialregressioncurve+error
p(M |data;I)∝ p(data |M ;I)xp(M |I)thisiswheremethodsdiffer
eachmodelcomeswithitsown2017.10.10 michalis.co/dsc17
classification
buildmodelthatprovidesp(Y=y |X=x;ModelM)forcategorically-valuedY
classificationmethodsdifferinthesetofmodelcandidates
theyconsider
eachmethodhascorrespondingalgorithm(s)
tosearchforbestmodel
65whatisX andY fordigitrecognition?
2017.10.10 michalis.co/dsc17
supervisedandunsupervisedlearning
regressionandclassificationarecasesof‘supervised’learning
buildmodelthatprovidesp(Y=y |X=x;ModelM)
66
somedatafeaturesotherdatafeatures
buildmodelthatprovidesp(X=x,Y=y;ModelM)
‘unsupervised’learning2017.10.10 michalis.co/dsc17
unsupervisedlearning
buildmodelthatprovidesdp(X=x,Y=y;ModelM)
findstructureinthedata
67X
Y
2017.10.10 michalis.co/dsc17