ms1b statistical data miningteh/teaching/datamining/... · 2016. 11. 28. · plenty of retailers...

50
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford 1/1

Upload: others

Post on 31-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

MS1b Statistical Data Mining

Yee Whye TehDepartment of Statistics

Oxford

1 / 1

Page 2: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Outline

2 / 1

Page 3: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Outline

3 / 1

Page 4: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Administrivia

Lectures

- Wednesdays 1100-1200, Weeks 1-8.- Thursdays 1100-1200, Weeks 1,3,5,7.

Part C students- Practical classes: Wednesdays 1100-1200, Weeks 2,4,6,8.- Problem classes: Thursday ????-????, Weeks 2-8.

MSc students

- Practical class: Friday afternoon, Week 3.

4 / 1

Page 5: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Syllabus IPart I: Dimensionality Reduction

- Principal Components Analysis- Multidimensional Scaling- Isomap- Hierarchical clustering

Part II: Clustering- K-means- Vector Quantization- Self Organizing Maps- Mixture Models

Part II: Supervised Learning- Empirical Risk Minimization- Linear Discriminant Analysis- Quadratic Discriminant Analysis- Naive Bayes

5 / 1

Page 6: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Syllabus II

- Bayesian Methods- Logistic Regression

Part III: Supervised Learning- Nearest Neighbours, Prototype Based Methods- Classification and Regression Trees- Neural Networks (not in; lecture15)

Part IV: Ensemble Methods- Bootstrap, Bagging- Random Forests- Boosting- Dropout Neural Networks (not in)

R- Learning how to use R for Data Mining

6 / 1

Page 7: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Outline

7 / 1

Page 8: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

What is Data Mining?

Traditional Problems in Applied StatisticsWell formulated question that we would like to answer.Expensive to gathering data and/or expensive to do computation.Create specially designed experiments to collect high quality data.

Current SituationInformation Revolution

- improvements in data-storage devices (both larger and cheaper).- powerful data capturing devices (microphones, cameras, satellites).

→ lots of data with potentially valuable information available.→ Big Data....

8 / 1

Page 9: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

What is Data Mining?

I To gain insight from secondary data possibly without a specific goal inmind.

I Often working with huge datasets.I Typically many variables (up to thousands or millions).I Often, but not always many observations (dozens to millions).

9 / 1

Page 10: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Applications of Data Mining

I Pattern Recognition

- Sorting Cheques- Reading License Plates- Sorting Envelopes- Eye/ Face/ Fingerprint Recognition

Image data contain a lot of structure. Data mining usually refers to makingsense of less structured data.

10 / 1

Page 11: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Applications of Data Mining

I Business applications- Help companies intelligently find information- Credit scoring- Predict which products people are going to buy- Recommender systems- Autonomous tradingI Scientific applications- Predict cancer occurence/type and health of patients- Make sense of complex physical models

...It is just a nice name for multivariate statistics (‘minus model checking’).

11 / 1

Page 12: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

1/19/09 10:25 AMThe New York Times > Business > Your Money > What Wal-Mart Knows About Customers' Habits

Page 1 of 4http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.html?pagewanted=print&position=

November 14, 2004

What Wal-Mart Knows About Customers' Habits

By CONSTANCE L. HAYS

URRICANE FRANCES was on its way, barreling across the Caribbean, threatening a direct hit onFlorida's Atlantic coast. Residents made for higher ground, but far away, in Bentonville, Ark.,

executives at Wal-Mart Stores decided that the situation offered a great opportunity for one of theirnewest data-driven weapons, something that the company calls predictive technology.

A week ahead of the storm's landfall, Linda M. Dillman, Wal-Mart's chief information officer, pressed herstaff to come up with forecasts based on what had happened when Hurricane Charley struck several weeksearlier. Backed by the trillions of bytes' worth of shopper history that is stored in Wal-Mart's computernetwork, she felt that the company could "start predicting what's going to happen, instead of waiting for itto happen," as she put it.

The experts mined the data and found that the stores would indeed need certain products - and not just theusual flashlights. "We didn't know in the past that strawberry Pop-Tarts increase in sales, like seven timestheir normal sales rate, ahead of a hurricane," Ms. Dillman said in a recent interview. "And the pre-hurricane top-selling item was beer."

Thanks to those insights, trucks filled with toaster pastries and six-packs were soon speeding downInterstate 95 toward Wal-Marts in the path of Frances. Most of the products that were stocked for thestorm sold quickly, the company said.

Such knowledge, Wal-Mart has learned, is not only power. It is profit, too.

Plenty of retailers collect data about their stores and their shoppers, and many use the information to try toimprove sales. Target Stores, for example, introduced a branded Visa card in 2001 and has used it, alongwith an arsenal of gadgetry, to gather data ever since. But Wal-Mart amasses more data about theproducts it sells and its shoppers' buying habits than anyone else, so much so that some privacy advocatesworry about potential for abuse.

With 3,600 stores in the United States and roughly 100 million customers walking through the doors eachweek, Wal-Mart has access to information about a broad slice of America - from individual SocialSecurity and driver's license numbers to geographic proclivities for Mallomars, or lipsticks, or jugs ofantifreeze. The data are gathered item by item at the checkout aisle, then recorded, mapped and updatedby store, by state, by region.

By its own count, Wal-Mart has 460 terabytes of data stored on Teradata mainframes, made by NCR, atits Bentonville headquarters. To put that in perspective, the Internet has less than half as much data,according to experts.

Information about products, and often about customers, is most often obtained at checkout scanners.Wireless hand-held units, operated by clerks and managers, gather more inventory data. In most cases,such detail is stored for indefinite lengths of time. Sometimes it is divided into categories or mappedacross computer models, and it is increasingly being used to answer discount retailing's rabbinicalquestions, like how many cashiers are needed during certain hours at a particular store.

Full NY Times article on http://snipurl.com/ac5hc.

12 / 1

Page 13: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

13 / 1

Page 14: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

NYTimes article on use of R for data mining and data analysis:snipurl.com/badgw

08/01/2010 17:00R, the Software, Finds Fans in Data Analysts - NYTimes.com

Page 1 of 2http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2&em#

R, the Software, Finds Fans in Data Analysts -NYTimes.comTo some people R is just the 18th letter of the alphabet. To others, it’s the rating on racy movies, a measureof an attic’s insulation or what pirates in movies say.

R is also the name of a popular programming language used by a growing number of data analysts insidecorporations and academia. It is becoming their lingua franca partly because data mining has entered agolden age, whether being used to set ad prices, find new drugs more quickly or fine-tune financial models.Companies as diverse as Google, Pfizer, Merck, Bank of America, the InterContinental Hotels Group andShell use it.

But R has also quickly found a following because statisticians, engineers and scientists without computerprogramming skills find it easy to use.

“R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist atGoogle, which uses the software widely. “It allows statisticians to do very intricate and complicatedanalyses without knowing the blood and guts of computing systems.”

It is also free. R is an open-source program, and its popularity reflects a shift in the type of software usedinside corporations. Open-source software is free for anyone to use and modify. I.B.M., Hewlett-Packardand Dell make billions of dollars a year selling servers that run the open-source Linux operating system,which competes with Windows from Microsoft. Most Web sites are displayed using an open-sourceapplication called Apache, and companies increasingly rely on the open-source MySQL database to storetheir critical information. Many people view the end results of all this technology via the Firefox Webbrowser, also open-source software.

R is similar to other programming languages, like C, Java and Perl, in that it helps people perform a widevariety of computing tasks by giving them access to various commands. For statisticians, however, R isparticularly useful because it contains a number of built-in mechanisms for organizing data, runningcalculations on the information and creating graphical representations of data sets.

Some people familiar with R describe it as a supercharged version of Microsoft’s Excel spreadsheetsoftware that can help illuminate data trends more clearly than is possible by entering information into rowsand columns.

What makes R so useful — and helps explain its quick acceptance — is that statisticians, engineers andscientists can improve the software’s code or write variations for specific tasks. Packages written for R addadvanced algorithms, colored and textured graphs and mining techniques to dig deeper into databases.

Close to 1,600 different packages reside on just one of the many Web sites devoted to R, and the number ofpackages has grown exponentially. One package, called BiodiversityR, offers a graphical interface aimed atmaking calculations of environmental trends easier.

Another package, called Emu, analyzes speech patterns, while GenABEL is used to study the human

14 / 1

Page 15: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Types of Data Mining

Unsupervised Learning‘Unclassified’ data from which we would like to uncover hidden ‘structure’ orgroupings

- Given detailed phone usage from many people, find interesting groups ofpeople with similar behaviour.

- Shopping habits for people using loyalty cards: find groups of ‘similar’shoppers.

- Given expression measurements of 1000s of genes for 100s of patients,find groups of functionally similar genes.

Goal: Hypothesis generation

15 / 1

Page 16: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Types of Data Mining

Supervised LearningA database of ‘classified’ examples with predefined groupings

- Given detailed phone usage of many users along with their historic churn,predict when/if people are going to change contracts again.

- Given expression measurements of 1000s of genes for 100s of patientsalong with a binary variable indicating absence or presence of a specificcancer, predict if the cancer is present for a new patient.

- Given expression measurements of 1000s of genes for 100s of patientsalong with survival length, predict survival time.

Goal: Prediction.

16 / 1

Page 17: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Outline

17 / 1

Page 18: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Notation

I Data consists of p measurements (variables/attributes) on n examples(observations/cases)

I X is a n× p-matrix with Xij := the j-th measurement for the i-th example

X =

X11 X12 . . . X1j . . . X1p

X21 X22 . . . X2j . . . X2p...

.... . .

.... . .

...Xi1 Xi2 . . . Xij . . . Xip...

.... . .

.... . .

...Xn1 Xn2 . . . Xnj . . . Xnp

18 / 1

Page 19: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Crabs Data (n = 200, p = 5)

Campbell (1974) studied rock crabs of the genus leptograpsus. One species,L. variegatus, had been split into two new species, previously grouped bycolour, orange and blue. Preserved specimens lose their colour, so it washoped that morphological differences would enable museum material to beclassified.

Data are available on 50 specimens of each sex of each species, collected onsight at Fremantle, Western Australia. Each specimen has measurements onthe width of the frontal lip FL, the rear width RW, and length along the midlineCL and the maximum width CW of the carapace, and the body depth BD in mm.

19 / 1

Page 20: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Crabs DataLooking at the crabs dataset, n = 200 measurements on p = 5 morphologicalfeatures of crabs

I ’FL’ frontal lobe size (mm)I ’RW’ rear width (mm)I ’CL’ carapace length (mm)I ’CW’ carapace width (mm)I ’BD’ body depth (mm)

Also available, the colour (’B’ or ’O’) and sex (’M’ or ’F’).

## load package MASS containing the datalibrary(MASS)## look at datacrabs

sp sex index FL RW CL CW BD1 B M 1 8.1 6.7 16.1 19.0 7.02 B M 2 8.8 7.7 18.1 20.8 7.4...

20 / 1

Page 21: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

R code

## assign predictor and class variablesCrabs <- crabs[,4:8]Crabs.class <- factor(paste(crabs[,1],crabs[,2],sep=""))

## plot data using pair plotsplot(Crabs,col=unclass(Crabs.class))

##boxplotsboxplot(Crabs)

## parallel coordinatesparcoord(Crabs)

21 / 1

Page 22: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Simple Pairwise Scatterplots

FL

6 8 10 12 14 16 18 20 20 30 40 50

1015

20

68

1012

1416

1820

RW

CL

1520

2530

3540

45

2030

4050

CW

10 15 20 15 20 25 30 35 40 45 10 15 20

1015

20

BD

22 / 1

Page 23: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Univariate Boxplots

FL RW CL CW BD

1020

3040

50

23 / 1

Page 24: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Univariate HistogramsHistogram of Frontal Lobe Size

Frontal Lobe Size (mm)

Fre

quen

cy

10 15 20

05

1015

20

Histogram of Rear Width

Rear Width (mm)

Fre

quen

cy

6 8 12 16 20

05

1015

Histogram of Carapace Length

Carapace Length (mm)

Fre

quen

cy

15 25 35 45

05

1015

20

Histogram of Carapace Width

Carapace Width (mm)

Fre

quen

cy

20 30 40 50

05

1015

20

Histogram of Body Depth

Body Depth (mm)

Fre

quen

cy

10 15 20

05

1015

2025

30

24 / 1

Page 25: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Parallel Coordinate Plots

FL RW CL CW BD

25 / 1

Page 26: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

These summary plots are helpful, but do not really help very much if thedimensionality of the data is high (a few dozen or thousands).Possible approaches for higher-dimensional problems.

I We are constrained to view data in 2 or 3 dimensionsI Look for ‘interesting’ projections of X into lower dimensionsI Hope that for large p, considering only k� p dimensions is just as

informative

26 / 1

Page 27: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Overview of PCA

I Seek to rotate data to a new basis that represents the data in a more‘interesting’ way.

I PCA considers interesting to be directions with greatest variance.I Builds up an orthogonal basis where new basis vectors are chosen to

explain the greatest variance in data, the first few PCs should representmost of the variance-covariance structure in the data, i.e. the subspacespanned by first k PCs represents the ‘best’ k-dimensional view of thedata.

27 / 1

Page 28: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Overview of PCA

I Consider a set of real-valued variables X = (X1 . . .Xp)T .

I For the 1st PC, we seek a derived variable of the form

Z1 = a11X1 + a21X2 + · · ·+ ap1Xp = XTa1

where a1i ∈ R are chosen to maximise var(Z1).I To get a well defined problem, we fix aT

1 a1 = 1.I The 1st PC attempts to capture the common variation in all variables

using a single derived variable.I The 2nd PC Z2 is chosen to be orthogonal with the 1st and is computed in

a similar way. It will have the largest variance in the remaining p− 1dimensions, etc.

28 / 1

Page 29: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Principal Components Analysis

−4 −2 0 2 4

−4

−2

02

4

X1

X2

29 / 1

Page 30: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Principal Components Analysis

−4 −2 0 2 4

−4

−2

02

4

X1

X2

30 / 1

Page 31: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Principal Components Analysis

−4 −2 0 2 4

−4

−2

02

4

PC_1

PC

_2

31 / 1

Page 32: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

How to Obtain the Coefficients?To find the 1st PC given by Z1 = XTa1

I Maximise var(Z1) = var(Xa1) = aT1 cov(X)a1 ≈ aT

1 Sa1 subject to aT1 a1 = 1

where S = n−1XTX is a p× p sample covariance matrix of the centredn× p data matrix X.

I Rewriting this as a constrained maximisation problem,

Maximise F (a1) = aT1 Sa1 − λ1

(aT

1 a1 − 1)

w.r.t. a1.

I The corresponding vector of partial derivatives yields

∂F∂a1

= 2Sa1 − 2λ1a1.

I Setting this to zero reveals the eigenvector equation, i.e. a1 must be aneigenvector of S and λ1 the corresponding eigenvector.

I Since aT1 Sa1 = λ1aT

1 a1 = λ1, the 1st PC must be the eigenvectorassociated with the largest eigenvalue of S.

32 / 1

Page 33: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

How to Obtain the Coefficients?

How about the 2nd PC?

I Proceed as before but include the additional constraint that the 2nd PCmust be orthogonal to the 1st PC

Maximise F (a2) = aT2 Sa2 − λ2

(aT

2 a2 − 1)− µ

(aT

1 a2)

w.r.t. a2

I Solving this shows that a2 must be the eigenvector of S associated withthe 2nd largest eigenvalue, and so on

I The eigenvalue decomposition of S is given by S = AΛAT where Λ is adiagonal matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0 and A is a p× porthogonal matrix whose columns are the p eigenvectors of S.

33 / 1

Page 34: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Properties of the Principal Components

I PCs are uncorrelated

cov(XTai,XTaj) ≈ aTi Saj = 0 for i 6= j.

I The total sample variance is given by

Total sample variance =

p∑i=1

sii = λ1 + . . .+ λp,

so the proportion of total variance explained by the kth PC is

λk

λ1 + λ2 + . . .+ λpk = 1, 2, . . . , p

34 / 1

Page 35: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

R code

This is what we have had before:

library(MASS)Crabs <- crabs[,4:8]Crabs.class <- factor(paste(crabs[,1],crabs[,2],sep=""))plot(Crabs,col=unclass(Crabs.class))

Now perform PCA analysis with function princomp.Alternatively, use eigen or svd instead.

Crabs.pca <- princomp(Crabs,cor=FALSE)plot(Crabs.pca)pairs(predict(Crabs.pca),col=unclass(Crabs.class))

35 / 1

Page 36: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

PCA Example 1: Original crabs data

FL

6 8 10 12 14 16 18 20 20 30 40 50

1015

20

68

1012

1416

1820

RW

CL

1520

2530

3540

45

2030

4050

CW

10 15 20 15 20 25 30 35 40 45 10 15 20

1015

20

BD

36 / 1

Page 37: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

PCA Example 1: Rotated crabs data

Comp.1

−2 −1 0 1 2 3 −1.0 −0.5 0.0 0.5 1.0

−20

−10

010

2030

−2

−1

01

23

Comp.2

Comp.3

−2

−1

01

2

−1.

0−

0.5

0.0

0.5

1.0

Comp.4

−20 −10 0 10 20 30 −2 −1 0 1 2 −0.5 0.0 0.5

−0.

50.

00.

5

Comp.5

37 / 1

Page 38: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

PCA Example 1: Crabs Data (n = 200, p = 5)

−2 −1 0 1 2 3

−2

−1

01

2

Comp.2

Com

p.3

38 / 1

Page 39: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

PCA Example 2: Yeast Cell Cycle Data (n = 384,p = 17)

Cho et al (1998) present gene expression data on the cell cycle of yeast. Theyidentify a subset of genes that can be categorised into five different phases ofthe cell-cycle. Changes in expression for the genes are measured over twocell cycles (17 time points). The data were normalised so that the expressionvalues for each gene has mean zero and unit variance across the cell cycles.

We visualise the 384 genes in the space of the first two prinicipal components.

39 / 1

Page 40: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

PCA Example 2: Yeast Cell Cycle Data (n = 384,p = 17)

−2 0 2 4

−4

−2

02

4

Comp.1

Com

p.2

40 / 1

Page 41: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Comments on the use of PCA

Emphasis on variance is where the weaknesses of PCA stem from:

I The PCs depend heavily on the units measurement. Where the datamatrix contains measurements of vastly differing orders of magnitude, thePC will be greatly biased in the direction of larger measurement. It istherefore recommended to calculate PCs from cor(X) instead of cov(X).

I Robustness to outliers is also an issue. Variance is affected by outlierstherefore so are PCs.

Although PCs are uncorrelated, scatterplots sometimes reveal structures inthe data other than linear correlation.PCA commonly used to project data X onto the first k PCs giving the ‘best’k-dimensional view of the data.PCA commonly used for lossy compression of high dimensional data.

41 / 1

Page 42: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Biplots

I When viewing projections of data matrix X into its PC space, it isinstructive to view the contribution from the original variables to the PCsthat are plot.

I Biplots overlay projection of unit vectors of the original variables into thePC space

I As PCs are linear combinations of the original variables, it isstraightforward to invert this relationship to yield the contributions of theoriginal variables to the PCs

42 / 1

Page 43: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Biplots

Biplots show us an image of the data and unit vectors of the original axes intothe projected space,

I Unit vectors of the original variables give us a common denominator tocompare how much weighting each PC gives to the original variables.

I It can be shown that cos θ (where θ is the angle that subtends twoprojected original axes) approximates the correlation between thesevariables,

However, the quality of this image depends on the proportion of varianceexplained by the PCs used

43 / 1

Page 44: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Biplot Example 1: Fisher’s Iris Data

50 sample from 3 species of iris: iris setosa,versicolor, and virginica

Each measuring the length and widths ofboth sepal and petals

Collected by E. Anderson (1935) andanalysed by R.A. Fisher (1936)

Using again function princomp and biplot.

iris1 <- irisiris1 <- iris1[,-5]biplot(princomp(iris1,cor=T))

44 / 1

Page 45: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

−0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

0.2

Comp.1

Com

p.2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

2425

26

27

28

29

3031

32

33

34

35

36

3738

39

40

41

42

43

44

45

46

47

48

49

50

51

52 53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

8182

8384

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125126

127

128

129

130

131

132

133134

135

136

137

138

139

140141142

143

144

145

146

147

148

149

150

−10 −5 0 5 10

−10

−5

05

10

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

45 / 1

Page 46: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Biplot Example 2: US Arrests

This data set contains statistics, in arrests per 100,000 residents for assault,murder, and rape in each of the 50 US states in 1973. Also given is thepercent of the population living in urban areas.

pairs(USArrests)usarrests.pca <- princomp(USArrests,cor=T)plot(usarrests.pca)

pairs(predict(usarrests.pca))biplot(usarrests.pca)

46 / 1

Page 47: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Pairs Plot: US Arrests

Murder

50 150 250

●● ●

●●

●●

●●

●●

●● ●

● ●

●●

● ●

●●

10 20 30 40

510

15

●● ●

●●

●●

●●

●●

5015

025

0

● ●

●●

●●

●●

●●

●Assault

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● UrbanPop

3050

7090

●●

●●

● ●

●●

●●

5 10 15

1020

3040

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

30 50 70 90

●●

●●

●●

●●

● ●

●●

●●●

●●

Rape

47 / 1

Page 48: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Biplot Example 2: US Arrests

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

AlabamaAlaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa

Kansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−5 0 5

−5

05

Murder

Assault

UrbanPop

Rape

48 / 1

Page 49: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Biplot Example 3: US State dataThis data set contains statistics like illiteracy and life expectancy on 50 USstates.

data(state) ## load state datastate <- state.x77[, 2:7] ## extract useful inforow.names(state)<-state.abbstate[1:5,] ## lets have a look

Income Illiteracy Life Exp Murder HS Grad FrostAL 3624 2.1 69.05 15.1 41.3 20AK 6315 1.5 69.31 11.3 66.7 152AZ 4530 1.8 70.55 7.8 58.1 15AR 3378 1.9 70.66 10.1 39.9 65CA 5114 1.1 71.71 10.3 62.6 20

## calculate the pc’s of the data and show biplotstate.pca <- princomp(state,cor=TRUE)biplot(state.pca, pc.biplot=TRUE, cex=0.8,font=2, expand=0.9)

49 / 1

Page 50: MS1b Statistical Data Miningteh/teaching/datamining/... · 2016. 11. 28. · Plenty of retailers collect data about their stores and their shoppers, and many use the information to

Pairs Plot: US States

−2 −1 0 1 2 3

−2−1

01

23

Comp.1

Com

p.2

AL

AK

AZ

AR

CA

COCTDE

FL

GA

HI

ID

IL

IN

IA

KS

KY

LA

ME

MD

MA

MI

MNMS

MOMT NE

NV

NH

NJ

NM

NY

NCND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UT

VT

VA

WA

WV

WI

WY

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Income

Illiteracy

Life Exp

Murder HS Grad

Frost

50 / 1