pricipal component analysis using r
DESCRIPTION
TRANSCRIPT
R is a language and environment for statistical computing and graphics
R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests,
time-series analysis, classification, clustering, and others.
R can be considered as a different implementation of S.
It compiles and runs on a wide variety of platforms such as UNIX, Windows and Mac OS.
An effective data handling and storage facility
A suite of operators for calculations on arrays and matrices
A large, coherent, integrated collection of tools for data analysis
Graphical facilities for data analysis and display either on-screen or on hardcopy
A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
R provides a comprehensive set of statistical analysis techniques• Classical statistical tests • Linear and nonlinear modeling• Time-series analysis• Classification & cluster analysis• Spatial statistics• Basically any statistical technique you can think
of is part of a contributed package to R
Why Principal Component Analysis used?
Data Dimension Reduction Technique. Principal Component Analysis (PCA) is a powerful tool
during the Analysis, when the data have ‘n’ variables. PCA finds the combination of each and every variable without losing the original data.
PCA are formed some as linear combinations of the data which is used to preserve the information
Principal Component Analysis - the extraction of hidden predictive information from large database organizations, can identify valuable customers, predict future behaviors, and enable firms to make proactive, knowledge-driven decisions.
There are four students application
Graduate Admission Office wants to select two graduate students Who should be selected ?
STUDENT GPA GREPROFESSOR RATING
1. 3.2 1270 38
2. 3.9 1600 42
3. 2.9 1500 22
4. 3.0 1400 32
There are five steps by PCA using R-STATISTICS to select two best graduate students from rest of the other in the given
table. Implementing data in R-statistics. Calculate the correlation matrix. Calculate the eigenvectors and eigen values of the
correlation matrix Choose the number of principal components to be
retained Derive the new data set.
R CODE
> Gpa <- c(3.2,3.9,2.9,3.0) > Gre <- c(1270,1600,1500,1400) > Professorrating <- c(38,42,22,32) > Student <- data. frame(Gpa,Gre,Professorrating) > Student
Gpa Gre Professorrating 1 3.2 1270 38 2 3.9 1600 42 3 2.9 1500 22 4 3.0 1400 32
>data= cor(Student) > stud Gpa Gre
Prof.rat
Gpa 1.0000000 0.531991767 0.824316301
Gre 0.5319918 1.000000000
0.009509527
Prof.rat 0.8243163 -0.009509527 1.000000000
It is used to find the linear relationship between two random variables
> eigen(stud) $values [1] 1.97676210 1.00866512 0.01457279
$vectors [,1] [,2] [,3][1,] 0.7086607 -0.003993348 0.7055382[2,] 0.3801843 -0.840227900 -0.3866225[3,] 0.5943568 0.542218710 -0.5939183
>barplot(eigen(stud)$vectors)
pc1
pc2
pc3
>pc1=0.7086607*Gpa+0.3801843*Gre+0.5943568*Professorrating
> pc2=0.003993348*Gpa0.840227900*Gre+0.542218710*Professorrating
> pc3= 0.7055382*Gpa- 0.3866225*Gre - 0.5939183*Professorrating
Student 2 and 3 will be selected if first component (pc1) is
used for calculating the score.
STUDENT GPA
GREPROFESSOR RATING SCORE
1. 3.2
1270 38 507.6873
2.
3.9 1600 42
636.0216
3. 2.9
1500 22 585.4074
4. 3.0
1400 32 553.4034
PCA is limited to re-expressing the data as a linear combination of its basis vectors.
• PCA is a non-parametric method –independent of user and can’t be configured for specific inputs.
• Principal components are orthogonal.• Mean and variance are sufficient