lecture 2 stats - github pages · lecture 2: statistical inference and exploratory data analysis...

24
CS639: Data Management for Data Science Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1

Upload: others

Post on 18-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

CS639:DataManagementfor

DataScienceLecture2:StatisticalInferenceandExploratoryDataAnalysis

TheodorosRekatsinas1

Page 2: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

2

Announcements• Waitinglist:youreceiveinvitationstoregisterand

youhavetwodaystoreply.

• Piazza:youneedtoregistertoengageindiscussionsandreceiveannouncements

• Announcements:updatetheclasswebsite;announcementswillbepostedthere

Page 3: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

3

Firstassignment(P0)• CreateaGitHubaccountandclonethegithub

repositoryoftheclass.

• DeploytheclassVM(instructionsintheslidesofLecture1)

Page 4: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

Today’sLecture

1. QuickRecap:Thedatascienceworkflow

2. StatisticalInference

3. ExploratoryDataAnalysis• Activity:EDAinJupyter notebook

4

Page 5: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

1.QuickRecap:TheDSWorkflow

5

Section1

Page 6: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

Onedefinitionofdatascience

6

Section2

Datascienceisabroadfieldthatreferstothecollectiveprocesses,theories,concepts,toolsandtechnologies thatenablethereview,analysisandextractionofvaluableknowledgeandinformationfromrawdata.

Source:Techopedia

Page 7: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

Datascienceworkflow

7

Section2

https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

Whatiswronghere?

Page 8: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

Datascienceworkflow

8

Section2

Datascienceisnot(only)abouthacking!

Page 9: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

9

Yourmind-setshouldbe“statisticalthinkingintheageofbig-data”

Page 10: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

2.StatisticalInference

10

Section2

Page 11: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

Whatyouwilllearnaboutinthissection

1. UncertaintyandRandomnessinData

2. ModelingData

3. SamplesandDistributions

11

Section2

Page 12: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

• Datarepresentsthetraces ofreal-worldprocesses.• Thecollectedtracescorrespondtoasample ofthoseprocesses.

• Thereisrandomness anduncertainty inthedatacollectionprocess.

• Theprocessthatgeneratesthedataisstochastic (random).• Example:Let’stossacoin!Whatwilltheoutcomebe?Headsortails?Therearemanyfactorsthatmakeacointossastochasticprocess.

• Thesamplingprocessintroducesuncertainty.• Example:ErrorsduetosensorpositionduetoerrorinGPS,errorsduetotheanglesoflasertraveletc.

12

Section2

UncertaintyandRandomness

Page 13: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

• Datarepresentsthetraces ofreal-worldprocesses.

• Partofthedatascienceprocess:Weneedtomodel thereal-world.

• Amodelisafunction fθ(x)• x:inputvariables(canbeavector)• θ:modelparameters

13

Section2

Models

Page 14: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

• Datarepresentsthetraces ofreal-worldprocesses.

• Thereisrandomness anduncertainty inthedatacollectionprocess.

• Amodelisafunction fθ(x)• x:inputvariables(canbeavector)• θ:modelparameters

• Modelsshouldrelyonprobabilitytheorytocaptureuncertaintyandrandomness!

14

Section2

ModelingUncertaintyandRandomness

Page 15: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

15

Section2

ModelingExample

Page 16: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

16

Section2

ModelingExample

Themodelcorrespondstoalinearfunction

Page 17: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

17

Section2

PopulationandSamples

• Populationiscompletesetoftraces/datapoints.• USpopulation314Million,worldpopulationis7billionforexample• Allvoters,allthings

• Sampleisasubsetofthecompleteset(orpopulation).• Howweselectthesampleintroducesbiasesintothedata

• Populationèsampleèmathematicalmodel

Page 18: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

18

Section2

PopulationandSamples

• Example:EmailssentbypeopleintheCSdept.inayear.

• Method1:1/10ofallemailsovertheyearrandomlychosen

• Method2:1/10ofpeoplerandomlychosen;alltheiremailovertheyear

• Botharereasonablesampleselectionmethodforanalysis.

• Howeverestimationspdfs(probabilitydistributionfunctions)oftheemailssentbyapersonforthetwosampleswillbedifferent.

Page 19: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

19

Section2

BacktoModels

• Abstractionofarealworldprocess

• Howtobuildamodel?

• Probabilitydistributionfunctions(pdfs)arebuildingblocksofstatisticalmodels.

Page 20: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

20

Section2

ProbabilityDistributions

• Normal,uniform,Cauchy,t-,F-,Chi-square,exponential,Weibull,lognormal,etc.

• Theyareknownascontinuousdensityfunctions

• Foraprobabilitydensityfunction,ifweintegratethefunctiontofindtheareaunderthecurveitis1,allowingittobeinterpretedasprobability.

• Further,jointdistributions,conditionaldistributionsandmanymore.

Page 21: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

21

Section2

FittingaModel

• Fittingamodelmeansestimatingtheparametersofthemodel.• Whatdistribution,whatarethevaluesofmin,max,mean,stddev,etc.

• Itinvolvesalgorithmssuchasmaximumlikelihoodestimation(MLE)andoptimizationmethods.

• Example:y=β1+β2∗𝑥è y=7.2+4.5*x

Page 22: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

3.ExploratoryDataAnalysis

22

Section3

Page 23: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

Whatyouwilllearnaboutinthissection

1. IntrotoExploratoryDataAnalysis(EDA)

2. Activity:EDAinJupyter

23

Section3

Page 24: Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

24

Section3

Activity

• Notebooklinkprovidedonwebsite.