lecture 2 stats - github pages · lecture 2: statistical inference and exploratory data analysis...

Post on 18-Oct-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS639:DataManagementfor

DataScienceLecture2:StatisticalInferenceandExploratoryDataAnalysis

TheodorosRekatsinas1

2

Announcements• Waitinglist:youreceiveinvitationstoregisterand

youhavetwodaystoreply.

• Piazza:youneedtoregistertoengageindiscussionsandreceiveannouncements

• Announcements:updatetheclasswebsite;announcementswillbepostedthere

3

Firstassignment(P0)• CreateaGitHubaccountandclonethegithub

repositoryoftheclass.

• DeploytheclassVM(instructionsintheslidesofLecture1)

Today’sLecture

1. QuickRecap:Thedatascienceworkflow

2. StatisticalInference

3. ExploratoryDataAnalysis• Activity:EDAinJupyter notebook

4

1.QuickRecap:TheDSWorkflow

5

Section1

Onedefinitionofdatascience

6

Section2

Datascienceisabroadfieldthatreferstothecollectiveprocesses,theories,concepts,toolsandtechnologies thatenablethereview,analysisandextractionofvaluableknowledgeandinformationfromrawdata.

Source:Techopedia

Datascienceworkflow

7

Section2

https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

Whatiswronghere?

Datascienceworkflow

8

Section2

Datascienceisnot(only)abouthacking!

9

Yourmind-setshouldbe“statisticalthinkingintheageofbig-data”

2.StatisticalInference

10

Section2

Whatyouwilllearnaboutinthissection

1. UncertaintyandRandomnessinData

2. ModelingData

3. SamplesandDistributions

11

Section2

• Datarepresentsthetraces ofreal-worldprocesses.• Thecollectedtracescorrespondtoasample ofthoseprocesses.

• Thereisrandomness anduncertainty inthedatacollectionprocess.

• Theprocessthatgeneratesthedataisstochastic (random).• Example:Let’stossacoin!Whatwilltheoutcomebe?Headsortails?Therearemanyfactorsthatmakeacointossastochasticprocess.

• Thesamplingprocessintroducesuncertainty.• Example:ErrorsduetosensorpositionduetoerrorinGPS,errorsduetotheanglesoflasertraveletc.

12

Section2

UncertaintyandRandomness

• Datarepresentsthetraces ofreal-worldprocesses.

• Partofthedatascienceprocess:Weneedtomodel thereal-world.

• Amodelisafunction fθ(x)• x:inputvariables(canbeavector)• θ:modelparameters

13

Section2

Models

• Datarepresentsthetraces ofreal-worldprocesses.

• Thereisrandomness anduncertainty inthedatacollectionprocess.

• Amodelisafunction fθ(x)• x:inputvariables(canbeavector)• θ:modelparameters

• Modelsshouldrelyonprobabilitytheorytocaptureuncertaintyandrandomness!

14

Section2

ModelingUncertaintyandRandomness

15

Section2

ModelingExample

16

Section2

ModelingExample

Themodelcorrespondstoalinearfunction

17

Section2

PopulationandSamples

• Populationiscompletesetoftraces/datapoints.• USpopulation314Million,worldpopulationis7billionforexample• Allvoters,allthings

• Sampleisasubsetofthecompleteset(orpopulation).• Howweselectthesampleintroducesbiasesintothedata

• Populationèsampleèmathematicalmodel

18

Section2

PopulationandSamples

• Example:EmailssentbypeopleintheCSdept.inayear.

• Method1:1/10ofallemailsovertheyearrandomlychosen

• Method2:1/10ofpeoplerandomlychosen;alltheiremailovertheyear

• Botharereasonablesampleselectionmethodforanalysis.

• Howeverestimationspdfs(probabilitydistributionfunctions)oftheemailssentbyapersonforthetwosampleswillbedifferent.

19

Section2

BacktoModels

• Abstractionofarealworldprocess

• Howtobuildamodel?

• Probabilitydistributionfunctions(pdfs)arebuildingblocksofstatisticalmodels.

20

Section2

ProbabilityDistributions

• Normal,uniform,Cauchy,t-,F-,Chi-square,exponential,Weibull,lognormal,etc.

• Theyareknownascontinuousdensityfunctions

• Foraprobabilitydensityfunction,ifweintegratethefunctiontofindtheareaunderthecurveitis1,allowingittobeinterpretedasprobability.

• Further,jointdistributions,conditionaldistributionsandmanymore.

21

Section2

FittingaModel

• Fittingamodelmeansestimatingtheparametersofthemodel.• Whatdistribution,whatarethevaluesofmin,max,mean,stddev,etc.

• Itinvolvesalgorithmssuchasmaximumlikelihoodestimation(MLE)andoptimizationmethods.

• Example:y=β1+β2∗𝑥è y=7.2+4.5*x

3.ExploratoryDataAnalysis

22

Section3

Whatyouwilllearnaboutinthissection

1. IntrotoExploratoryDataAnalysis(EDA)

2. Activity:EDAinJupyter

23

Section3

24

Section3

Activity

• Notebooklinkprovidedonwebsite.

top related