lecture 2 stats - github pages · lecture 2: statistical inference and exploratory data analysis...
TRANSCRIPT
CS639:DataManagementfor
DataScienceLecture2:StatisticalInferenceandExploratoryDataAnalysis
TheodorosRekatsinas1
2
Announcements• Waitinglist:youreceiveinvitationstoregisterand
youhavetwodaystoreply.
• Piazza:youneedtoregistertoengageindiscussionsandreceiveannouncements
• Announcements:updatetheclasswebsite;announcementswillbepostedthere
3
Firstassignment(P0)• CreateaGitHubaccountandclonethegithub
repositoryoftheclass.
• DeploytheclassVM(instructionsintheslidesofLecture1)
Today’sLecture
1. QuickRecap:Thedatascienceworkflow
2. StatisticalInference
3. ExploratoryDataAnalysis• Activity:EDAinJupyter notebook
4
1.QuickRecap:TheDSWorkflow
5
Section1
Onedefinitionofdatascience
6
Section2
Datascienceisabroadfieldthatreferstothecollectiveprocesses,theories,concepts,toolsandtechnologies thatenablethereview,analysisandextractionofvaluableknowledgeandinformationfromrawdata.
Source:Techopedia
Datascienceworkflow
7
Section2
https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext
Whatiswronghere?
Datascienceworkflow
8
Section2
Datascienceisnot(only)abouthacking!
9
Yourmind-setshouldbe“statisticalthinkingintheageofbig-data”
2.StatisticalInference
10
Section2
Whatyouwilllearnaboutinthissection
1. UncertaintyandRandomnessinData
2. ModelingData
3. SamplesandDistributions
11
Section2
• Datarepresentsthetraces ofreal-worldprocesses.• Thecollectedtracescorrespondtoasample ofthoseprocesses.
• Thereisrandomness anduncertainty inthedatacollectionprocess.
• Theprocessthatgeneratesthedataisstochastic (random).• Example:Let’stossacoin!Whatwilltheoutcomebe?Headsortails?Therearemanyfactorsthatmakeacointossastochasticprocess.
• Thesamplingprocessintroducesuncertainty.• Example:ErrorsduetosensorpositionduetoerrorinGPS,errorsduetotheanglesoflasertraveletc.
12
Section2
UncertaintyandRandomness
• Datarepresentsthetraces ofreal-worldprocesses.
• Partofthedatascienceprocess:Weneedtomodel thereal-world.
• Amodelisafunction fθ(x)• x:inputvariables(canbeavector)• θ:modelparameters
13
Section2
Models
• Datarepresentsthetraces ofreal-worldprocesses.
• Thereisrandomness anduncertainty inthedatacollectionprocess.
• Amodelisafunction fθ(x)• x:inputvariables(canbeavector)• θ:modelparameters
• Modelsshouldrelyonprobabilitytheorytocaptureuncertaintyandrandomness!
14
Section2
ModelingUncertaintyandRandomness
15
Section2
ModelingExample
16
Section2
ModelingExample
Themodelcorrespondstoalinearfunction
17
Section2
PopulationandSamples
• Populationiscompletesetoftraces/datapoints.• USpopulation314Million,worldpopulationis7billionforexample• Allvoters,allthings
• Sampleisasubsetofthecompleteset(orpopulation).• Howweselectthesampleintroducesbiasesintothedata
• Populationèsampleèmathematicalmodel
18
Section2
PopulationandSamples
• Example:EmailssentbypeopleintheCSdept.inayear.
• Method1:1/10ofallemailsovertheyearrandomlychosen
• Method2:1/10ofpeoplerandomlychosen;alltheiremailovertheyear
• Botharereasonablesampleselectionmethodforanalysis.
• Howeverestimationspdfs(probabilitydistributionfunctions)oftheemailssentbyapersonforthetwosampleswillbedifferent.
19
Section2
BacktoModels
• Abstractionofarealworldprocess
• Howtobuildamodel?
• Probabilitydistributionfunctions(pdfs)arebuildingblocksofstatisticalmodels.
20
Section2
ProbabilityDistributions
• Normal,uniform,Cauchy,t-,F-,Chi-square,exponential,Weibull,lognormal,etc.
• Theyareknownascontinuousdensityfunctions
• Foraprobabilitydensityfunction,ifweintegratethefunctiontofindtheareaunderthecurveitis1,allowingittobeinterpretedasprobability.
• Further,jointdistributions,conditionaldistributionsandmanymore.
21
Section2
FittingaModel
• Fittingamodelmeansestimatingtheparametersofthemodel.• Whatdistribution,whatarethevaluesofmin,max,mean,stddev,etc.
• Itinvolvesalgorithmssuchasmaximumlikelihoodestimation(MLE)andoptimizationmethods.
• Example:y=β1+β2∗𝑥è y=7.2+4.5*x
3.ExploratoryDataAnalysis
22
Section3
Whatyouwilllearnaboutinthissection
1. IntrotoExploratoryDataAnalysis(EDA)
2. Activity:EDAinJupyter
23
Section3
24
Section3
Activity
• Notebooklinkprovidedonwebsite.