pricipal component analysis using r

R is a language and environment for statistical computing and graphics

R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests,

time-series analysis, classification, clustering, and others.

R can be considered as a different implementation of S.

It compiles and runs on a wide variety of platforms such as UNIX, Windows and Mac OS.

An effective data handling and storage facility

A suite of operators for calculations on arrays and matrices

A large, coherent, integrated collection of tools for data analysis

Graphical facilities for data analysis and display either on-screen or on hardcopy

A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

R provides a comprehensive set of statistical analysis techniques• Classical statistical tests • Linear and nonlinear modeling• Time-series analysis• Classification & cluster analysis• Spatial statistics• Basically any statistical technique you can think

of is part of a contributed package to R

Why Principal Component Analysis used?

Data Dimension Reduction Technique. Principal Component Analysis (PCA) is a powerful tool

during the Analysis, when the data have ‘n’ variables. PCA finds the combination of each and every variable without losing the original data.

PCA are formed some as linear combinations of the data which is used to preserve the information

Principal Component Analysis - the extraction of hidden predictive information from large database organizations, can identify valuable customers, predict future behaviors, and enable firms to make proactive, knowledge-driven decisions.

There are four students application

Graduate Admission Office wants to select two graduate students Who should be selected ?

STUDENT GPA GREPROFESSOR RATING

1. 3.2 1270 38

2. 3.9 1600 42

3. 2.9 1500 22

4. 3.0 1400 32

There are five steps by PCA using R-STATISTICS to select two best graduate students from rest of the other in the given

table. Implementing data in R-statistics. Calculate the correlation matrix. Calculate the eigenvectors and eigen values of the

correlation matrix Choose the number of principal components to be

retained Derive the new data set.

R CODE

> Gpa <- c(3.2,3.9,2.9,3.0) > Gre <- c(1270,1600,1500,1400) > Professorrating <- c(38,42,22,32) > Student <- data. frame(Gpa,Gre,Professorrating) > Student

Gpa Gre Professorrating 1 3.2 1270 38 2 3.9 1600 42 3 2.9 1500 22 4 3.0 1400 32

>data= cor(Student) > stud Gpa Gre

Prof.rat

Gpa 1.0000000 0.531991767 0.824316301

Gre 0.5319918 1.000000000

0.009509527

Prof.rat 0.8243163 -0.009509527 1.000000000

It is used to find the linear relationship between two random variables

> eigen(stud) $values [1] 1.97676210 1.00866512 0.01457279

$vectors [,1] [,2] [,3][1,] 0.7086607 -0.003993348 0.7055382[2,] 0.3801843 -0.840227900 -0.3866225[3,] 0.5943568 0.542218710 -0.5939183

>barplot(eigen(stud)$vectors)

pc1

pc2

pc3

>pc1=0.7086607*Gpa+0.3801843*Gre+0.5943568*Professorrating

> pc2=0.003993348*Gpa0.840227900*Gre+0.542218710*Professorrating

> pc3= 0.7055382*Gpa- 0.3866225*Gre - 0.5939183*Professorrating

Student 2 and 3 will be selected if first component (pc1) is

used for calculating the score.

STUDENT GPA

GREPROFESSOR RATING SCORE

1. 3.2

1270 38 507.6873

2.

3.9 1600 42

636.0216

3. 2.9

1500 22 585.4074

4. 3.0

1400 32 553.4034

PCA is limited to re-expressing the data as a linear combination of its basis vectors.

• PCA is a non-parametric method –independent of user and can’t be configured for specific inputs.

• Principal components are orthogonal.• Mean and variance are sufficient

pricipal component analysis using r

Education