1 peter fox data analytics – itws-4963/itws-6965 week 3a, february 4, 2014, sage 3101 preliminary...

Post on 11-Dec-2015

223 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 3a, February 4, 2014, SAGE 3101

Preliminary Analysis, Interpretation, Detailed Analysis,

Assessment

Contents• PDA• Interpreting what you get back (the stats,

the plots)• Detailed analyses/ fitting – a start• How to assess/ intercompare

2

Preliminary Data Analysis• Relates to the sample v. population (for Big

Data) discussion last week• Also called Exploratory DA

– “EDA is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there , as well as those we believe will be there” (John Tukey)

• Distribution analysis and comparison, visual ‘analysis’, model testing, i.e. pretty much the things you did last Friday!

• Thus we are going to review those results 3

Patterns and Relationships• Stepping from elementary/ distribution

analysis to algorithmic-based analysis

• I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non-parametric models

• Relations – associations between/among populations

• Outcome: model and an evaluation of its fitness for purpose

4

Models• Assumptions are often used when

considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit)

• Two key topics:– N=all and the open world assumption– Model of the thing of interest versus model of the

data (data model; structural form)• “All models are wrong but some are useful”

(generally attributed to the statistician George Box) 5

Conceptual, logical and physical models

6

Applied to a database:

However our models will be mathematical, statistical, or a combination.

The concept of the model comes from the hypothesis

The implementation of the physical model comes from the data ;-)

Art or science?• The form of the model, incorporating the

hypothesis determines a “form”

• Thus, as much art as science because it depends both on your world view and what the data is telling you (or not)

• We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc…

7

Exploring the distribution> summary(EPI) # stats

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

32.10 48.60 59.20 58.37 67.60 93.50 68

> boxplot(EPI)

> fivenum(EPI,na.rm=TRUE)[1] 32.1 48.6 59.2 67.6 93.5

Tukey: min, lower hinge, median, upper hinge, max

8

Stem and leaf plot> stem(EPI) # like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully..

3 | 234

3 | 66889

4 | 00011112222223344444

4 | 5555677788888999

5 | 0000111111111244444

5 | 55666677778888999999

6 | 000001111111222333344444

6 | 5555666666677778888889999999

7 | 000111233333334

7 | 5567888

8 | 11

8 | 669

9 | 4 9

Histogram> hist(EPI) #defaults

10

Distributions• Shape• Character• Parameter(s)

• Which one fits?

11

12

> hist(EPI, seq(30., 95., 1.0), prob=TRUE)

> lines (density(EPI,na.rm=TRUE,bw=1.))

> rug(EPI)or> lines (density(EPI,na.rm=TRUE,bw=“SJ”))

13

> hist(EPI, seq(30., 95., 1.0), prob=TRUE)

> lines (density(EPI,na.rm=TRUE,bw=“SJ”))

Why are histograms so unsatisfying?

14

> xn<-seq(30,95,1)

> qn<-dnorm(xn,mean=63, sd=5,log=FALSE)

> lines(xn,qn)

> lines(xn,.4*qn)

> ln<-dnorm(xn,mean=44, sd=5,log=FALSE)

> lines(xn,.26*ln)

15

Eland ~ EPI!Landlock> hist(ELand, seq(30., 95., 1.0), prob=TRUE); lines …

16

No surface water

17

EPIreg<-EPI_data$EPI[EPI_data$EPI_reg

ions=="Europe"]

18

Exploring other distributions> summary(DALY) # stats

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.00 37.19 60.35 53.94 71.97 91.50 39

> fivenum(DALY,na.rm=TRUE)[1] 0.000 36.955 60.350 72.320 91.500

19EPI DALY

Stem and leaf plot> stem(DALY) # The decimal point is 1 digit(s) to the right of the |

0 | 0000111244

0 | 567899

1 | 0234

1 | 56688

2 | 000123

2 | 5667889

3 | 00001134

3 | 5678899

4 | 00011223444

4 | 555799

5 | 12223344

5 | 556667788999999

6 | 0000011111222233334444

6 | 6666666677788889999

7 | 00000000223333444

7 | 66888999

8 | 1113333333

8 | 555557777777777799999

9 | 2220

DALYhist(DALY, seq(0., 99., 1.0), prob=TRUE)

lines(density( DALY, na.rm=TRUE,bw=1.))

lines(density( DALY, na.rm=TRUE,bw=“SJ”))

21

Beyond histograms• Cumulative distribution function: probability that a

real-valued random variable X with a given probability distribution will be found at a value less than or equal to x.

> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)

22

Beyond histograms• Quantile ~ inverse cumulative density function –

points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles

• Quantile-Quantile (versus default=normal dist.)> par(pty="s")

> qqnorm(EPI); qqline(EPI)

23

Beyond histograms• Simulated data from t-distribution (random):

> x <- rt(250, df = 5)

> qqnorm(x); qqline(x)

24

Beyond histograms• Q-Q plot against the generating distribution: x<-

seq(30,95,1)> qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn")

> qqline(x)

25

DALY (ecdf and qqplot)

26

Weibull qqplot……..

27

Testing the fits• shapiro.test(EPI) # null hypothesis – normal?

Shapiro-Wilk normality test

data: EPI

W = 0.9866, p-value = 0.1188

Interpretation: W and probability-value

Reject null hypothesis or not? Here.. ~ NO.

DALY: W = 0.9365, p-value = 1.891e-07 (reject)

28

Kolmogorov–Smirnov• One-sided or two-sided:

> ks.test(EPI,seq(30.,95.,1.0))

Two-sample Kolmogorov-Smirnov test

data: EPI and seq(30, 95, 1)

D = 0.2507, p-value = 0.005451

alternative hypothesis: two-sided

Warning message:

In ks.test(EPI, seq(30, 95, 1)) :

p-value will be approximate in the presence of ties

D=distance between ECDF (blue) of sample and CDF (red) for one-sided: but p-value is important – accept if p-value>0.05.

29

Variability in normal distributions

30

F-test

31

F = S12 / S2

2

where S1 and S2 are the

sample variances.

The more this ratio deviates from 1, the stronger the evidence for unequal population variances.

> var.test(EPI,DALY)

F test to compare two variances

data: EPI and DALY

F = 0.2393, num df = 162, denom df = 191, p-value < 2.2e-16

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.1781283 0.3226470

sample estimates:

ratio of variances

0.2392948 32

T-test

33

Comparing distributions> t.test(EPI,DALY)

Welch Two Sample t-test

data: EPI and DALY

t = 2.1361, df = 286.968, p-value = 0.03352

alternative hypothesis: true difference in means is not

equal to 0

95 percent confidence interval:

0.3478545 8.5069998

sample estimates:

mean of x mean of y

58.37055 53.94313

34

Comparing distributions> boxplot(EPI,DALY)

35

CDF for EPI and DALY

36> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)> plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)

qqplot(EPI,DALY)

37

Oooppss did we forget?

38

Goal?• Find the single most important factor in

increasing the EPI in a given region

• Preceding table gives a nested conceptual model

• Examine distributions down to the leaf nodes and build up an EPI “model”

39

boxplot(ENVHEALTH,ECOSYSTEM)

40

qqplot(ENVHEALTH,ECOSYSTEM)

41

ENVHEALTH/ ECOSYSTEM> shapiro.test(ENVHEALTH)

Shapiro-Wilk normality test

data: ENVHEALTH

W = 0.9161, p-value = 1.083e-08 ------- Reject.

> shapiro.test(ECOSYSTEM)

Shapiro-Wilk normality test

data: ECOSYSTEM

W = 0.9813, p-value = 0.02654 ----- ~reject42

Kolmogorov- Smirnov - KS test -

> ks.test(EPI,DALY)

Two-sample Kolmogorov-Smirnov test

data: EPI and DALY

D = 0.2331, p-value = 0.0001382

alternative hypothesis: two-sided

Warning message:

In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties 43

44

How are the software installs going?

• R/Scipy (et al)/Matlab – getting comfortable?

• Data infrastructure …

• http://hyperpolyglot.org/numerical-analysis (Matlab, R, scipy/numpy) table comparison

45

Tentative assignments• Assignment 2: Datasets and data infrastructures – lab

assignment. Held in week 3 (Feb. 7) 10% (lab; individual);

• Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual);

• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual);

• Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual);

• Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual);

• Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).

46

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri chenil@rpi.edu • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

47

top related