multidimensional data analysis language r · school of business informatics department for...
TRANSCRIPT
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
1
The Government of the Russian Federation
The Federal State Autonomous Institution of Higher Education “National Research University – Higher School of Economics”
Faculty of Business and Management
School of Business Informatics
Department for Management of Information Systems and Digital Infrastructure
Multidimensional Data Analysis Language R
Bachelor’s program 38.03.05 “Business informatics”
Author: S.V. Petropavlovsky, associate professor
Approved at the meeting of the
Department for Management of Information Systems and Digital Infrastructure
«___»____________ 2017
Head of Department
_______________ / E.A. Isaev /
Approved by the Academic Council of Business Informatics
«___»____________ 2017
Chairman
_______________/ A.V. Dmitriev
Moscow, 2017
The document cannot be used by other HSE departments as well as other universities and
educational institutions without permission from the course authors
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
2
1. Applicability and Normative References
The program provides the contents of the course and describes the learning outcomes,
competences and practical skills obtained upon completion of the course. It also sets pre-
requisites for taking the course and provides criteria for assessing students’ performance. The
program is designed for instructors teaching the course, teaching assistants and undergraduate
students following educational track 38.03.05 "Business Informatics", Bachelor’s level.
2. Course Objectives
The course provides a theoretical background of multivariate data analysis and aims at
developing practical skills of data mining, processing and interpretation.
3. Course Description
"Multidimensional Data Analysis Language R" is an elective course taken in the 4th
academic year of the Bachelor’s program. The course gives insight into basic as well as more
advanced methods of handling multivariate data such as dimension reduction methods, cluster
analysis and Markov Chain Monte Carlo methods. A special section is devoted to modern
approaches for evaluating, selecting and regularizing the multivariate linear statistical models
such as cross validation, bootstrapping, the ridge and Lasso regression, etc. We also discuss
some non-linear models for classification and regression such as regression splines and
generalized additive models. The last part of the course covers the black-box methods of data
analysis. In particular, machine learning algorithms such as regression and classification trees,
support vector machines, neural networks and multivariate regression splines are introduced and
applied to spam detection and building a prototype of an on-line trading system.
The course has a practical bias, so the theoretical concepts are illustrated by applications
in various fields such as computer science, engineering, economics, finance, marketing, social
sciences etc.
The students are supposed to use the R language for implementing the algorithms
throughout the course (but not limited to), so a brief introduction to R is done at the very
beginning. The duration of the course is two modules. The course is taught in English and worth
4 credits.
4. Learning Outcomes and Competencies
At the end of the course, students should:
Be aware of:
the need, applicability and basic concepts of multivariate data analysis;
basic algorithms and techniques used in multivariate data analysis;
details of implementing the algorithms in R.
Be able:
to select the appropriate algorithms for multivariate data analysis as per the research
goals;
to collect and pre-process the raw data;
to interpret the results of the analysis and use them in the decision-making process.
Learn how to:
process the multivariate data using modern software;
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
3
present the results of the analysis.
Pre-requisites:
Programming (R is a plus but not essential), mathematics (algebra and calculus), probability
theory and statistics. Good command of English.
Competencies:
Competencies
Code
accord-
ing to
Federal standard/H
SE
Descriptors – basic signs of
mastering (indicators of
achieving a result)
Ways and methods of
teaching leading to
formation and development
of the competencies
Being able to explicate the scientific essence of problems in the professional field
СК-1 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments
Being able to solve problems in the professional field on the basis of analysis and synthesis
СК-Б4 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments
Being able to realize scientific and practical activities in international environment
СК-Б11
Mastering and using Lectures, practice in computer labs, preparation of class and home
assignments
Being able to control and develop the content of an enterprise and Internet- resources, to control the processes of creating and using information services
ПК-13 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments
Consulting with respect to the rational choice of methods and tools for con- trolling the IT-infrastructure of an enterprise
ПК-24 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
4
Being able to use the relevant mathematical and technical tools for processing, analysis and systematization of data on the topic of research
ПК-22 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments
Being able to prepare scientific reports and presentations
ПК-23 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments
5. Role of the course in the curriculum
The course is a part of major (professional) block of disciplines. It is an elective course. The
course is based on a number of preceding disciplines:
Calculus;
Linear Algebra;
Micro- and Macroeconomics, Marketing, Finance;
Probability Theory and Statistics;
Modeling of processes and systems.
The concepts and methods provided by the current course may be helpful in studying the
subsequent courses such as:
Analysis of business processes;
Fractal analysis of market data;
Semantic informational systems;
Quantitative methods of market forecasting.
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
5
6. Course Structure and Contents
6.1. Course Structure
№ Topic
In-class hours
Self-
study Total
Lectures
Practice
in
computer
labs
Total
1st module 16 16 32 40 72
1. Introduction to R 2 2 4 4 8
2. Multivariate Data Handling and
Visualization 2 2 4 4 8
3. Dimension Reduction Methods 6 6 12 12 24
4. Cluster Algorithms 2 2 4 8 12
5. Markov Chain Monte Carlo Methods 4 4 8 12 20
2nd
module 16 16 32 40 72
6. Cross Validation and Linear Model
Selection 8 8 16 20 36
7. Machine Learning Algorithms in
Practice 8 8 16 20 36
Total 32 32 64 80 144
6.2. Syllabus
Topic 1. Introduction to R.
Data objects in R, installing and using packages. Loading data from local files and on-line
databases. Plotting data in R. Advanced graphics. Time series objects. Overview of basic
statistics in R. Major programming constructs: conditional operators, loops, functions.
Reading:
1. Core Text: [1]
2. Further Reading: [2]
Topic 2. Multivariate Data Handling and Visualization
Multivariate normal distribution. Testing multivariate normality (chi^2 QQ-plots). Scatter plots,
imposing marginal distributions. Bivariate boxplots. The convex hull of the bivariate data.
Removing outliers. The bubble and glyph plots and their interpretation. Analysis of the scatter
plot matrix.
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
6
Reading:
1. Core Text: [2]
2. Further Reading: [1]
Topic 3. Dimension Reduction Methods
Principal component analysis (PCA). Geometrical view on data. Cloud of individuals and cloud
of variables. Rotating the frame and optimal projecting. PCA through diagonalization of the
covariance matrix. PCA through the singular value decomposition. Coordinates of individuals
and variables in the reduced basis. Quality of projecting. Interpretation. Simultaneous analysis of
individuals and variables. Demonstrations in R.
Correspondence analysis (CA). Data for the CA. chi^2 tests for association between categorical
variables. Geometrical view: chi^2 metric. Raw and column profiles. Implementation of the CA.
Quality of dimension reduction. Link between row and column representations. Demonstrations
in R.
Multiple CA (MCA). Data for the MCA. Indicator matrix. Distances between individuals and
categories. Implementation of the MCA. Numerical indicators of quality representation.
Demonstrations in R.
Multidimensional scaling (MDS). Data for the MDS: dissimilarity matrices. Goals of
multidimensional scaling. Computing dissimilarities: Euclidean versus non-Euclidean distances.
Classical multidimensional scaling. Metric and non-metric MDS. Goodness-of-fit measures for
the metric MDS. Shepard’s diagrams. Distance scaling. Issues of the non-metric MDS.
Interpretation of the MDS analysis. Embedding external variables.
Reading:
1. Core Text: [2,3]
2. Further Reading: [4]
Topic 4. Cluster Algorithms
Cluster algorithms. Distances between clusters of observations (linkage).
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
7
Agglomerative hierarchical clustering (AHC). Constructing an indexed hierarchy. Ward’s
algorithm. Quality of partition. Agglomeration according to inertia. Properties of the
agglomeration criterion. Impact of different linkage type on the performance of the AHC.
Direct search for partitions. K-means and K-medoids approaches.
Probabilistic clustering. Gaussian mixture model (GMM). Expectation maximization algorithm.
Clustering and principal component methods.
Reading:
1. Core Text: [2]
2. Further Reading: [3]
Topic 5. Markov Chain Monte Carlo Methods
Goals of Markov Chain Monte Carlo (MCMC). Markov processes. Properties of Markov chains
(finiteness, aperiodicity, irreducibility, ergodicity, mixing, etc). The stationary state of the chain.
Monte Carlo simulations of distributions. Inverse CDF method. Rejection sampling. The Gibbs
sampler. The Metropolis-Hastings algorithm. Issues in chain efficacy. MCMC implementation in
R and examples. Applications of MCMC: modeling S&P500 index.
Reading:
1. Core Text: [4]
2. Further Reading: [2]
Topic 6. Cross Validation and Linear Model Selection
Cross validation and bootstrapping. The idea and applications. The validation set approach.
Leave-One-Out cross validation. k-fold cross validation. Bias-variance trade-off for k-fold cross
validation. Cross validation on classification problems. Bootstrapping.
Linear model selection and regularization. Best subset selection. Stepwise selection. Choosing
the optimal model. Shrinkage methods: ridge regression, the Lasso, selecting the tuning
parameter.
Dimension reduction method in regression. Principal components regression, partial least
squares.
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
8
Regression splines. Piecewise polynomials. Constraints and splines. The spline basis
representation. Choosing the number and locations of knots. Comparison to polynomial
regression. Smoothing splines. Choosing the smoothing parameter.
Generalized additive models. GAMs for regression and classification problems.
Reading:
1. Core Text: [5,6]
2. Further Reading: [4]
Topic 7. Machine Learning Algorithms in Practice
Types of machine learning algorithms. The limits of machine learning.
Classification using Nearest Neighbors algorithm: measuring similarity with distance, choosing
an appropriate number of neighbors, preparing data for use with k-NN. Examples of k-NN
algorithm.
Probabilistic learning using naïve Bayes approach: the basic idea, the Laplace estimator,
numerical features of the naïve Bayes approach. Examples (filtering out the spam).
Classification using decision trees and rules. Advantages and disadvantages of trees. Tree-based
classification and regression. Application area of tree-based methods. Trees versus linear models.
Divide and conquer algorithm. The 1R algorithm. The RIPPER algorithm. Boosting the accuracy
of decision trees, pruning the trees. Bagging classification. Random forests. The Gini index.
Fitting the decision trees.
Black box methods. Neural networks. Activation functions. Network topology. Training a model
on the data. Evaluating and improving model performance. Support vector machines.
Classification with hyperplanes (linearly and non-linearly separable data). Using kernels for non-
linear spaces.
Reading:
1. Core Text: [1]
2. Further Reading: [1-3]
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
9
7. Reading List
A. Core Texts
1. J. Verzani, Using R for Introductory Statistics, Second Edition, Chapman & Hall/CRC The
R Series, Taylor & Francis, 2014. URL https://books.google.ru/books?id=O86uAwAAQBAJ
2. Brian Everitt and Torsten Hothorn. An introduction to applied multivariate analysis with R.
Springer, New York, 2011. URL http://dx.doi.org/10.1007/978-1-4419-9650-3
3. F. Husson, S. Le, J. Pages, Exploratory Multivariate Analysis by Example Using R, Second
Edition, Chapman & Hall/CRC Computer Science & Data Analysis, CRC Press, 2017.
URL https://books.google.com/books?id=nLrODgAAQBAJ
4. Kim Seefeld and Ernst Linder, Statistics Using R with Biological Examples, preprint,
University of New Hampshire, 2007.
5. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning: With
Applications in R, Springer Publishing Company, Inc., 2014.
6. Brett Lantz, Machine Learning with R, Second Edition, Packt Publishing, 2015.
B. Further reading
1. J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1): 81–106, 1986.
2. Anthony, M. & Bartlet, P. Neural Network Learning: Theoretical Foundations,
Cambridge University Press, 1999.
3. Barber, D. Bayesian reasoning and machine learning, Cambridge University
Press, 2012.
8. Assessment of student’s performance
Type of
assessment
Means of assessment 3 year
1 2 3 4
Mid-term test
(last week of
module 1)
Test and practice
assignment in computer
lab
*
Final Exam *
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
10
8.1. Criteria of assessment
To pass the mid-term test, the students should be able to solve the problems that were discussed
in class. To pass the final exam, the students should demonstrate the knowledge of basic
concepts of multivariate data analysis and the ability to implement them in practice.
8.2. Topics suggested for the mid-term test
Computer-based problems on Topics 1-5.
8.3. Sample concept questions for final exam
1. Describe the idea behind principal component analysis.
2. Implementation of correspondence analysis.
3. Algorithm of multiple correspondence analysis.
4. Multidimensional scaling: the basic idea, data preparation, algorithm and interpretation.
5. Explain the difference between agglomerative hierarchical clustering and K-means algorithm.
6. Probabilistic clustering: the idea and implementation. Gaussian mixture model.
7. Expectation maximization algorithm.
8. Markov processes and their properties.
9. Monte Carlo simulations of distributions. Inverse CDF method.
10. The Gibbs sampler.
11. The Metropolis-Hastings algorithm.
12. The Lasso regression.
13. Describe the procedure of cross validation.
14. Applications of bootstrapping.
15. Methods of dimension reduction in regression: principal component regression, partial least
squares.
16. Regression splines.
17. Smoothing splines.
19. Generalized additive models for regression.
20. Generalized additive models for classification.
21. Nearest Neighbors algorithm for classification.
22. Probabilistic learning: naïve Bayes approach.
23. Decision trees.
24. Basic concepts of neural networks.
25. Support vector machine algorithms.
The final exam is computer based and lasts for 90 minutes. The assignment consists of one
concept questions and two practice tasks.
8.3. Sample practice assignments for final exam
1. Get the distances between 10-12 Russian cities/towns. You can retrieve this information at
https://www.avtodispetcher.ru/distance/table/c172-rossiya/
• Do the classical multidimensional scaling using command cmdscale from MVA package:
• Provide the MDS configuration of points the Euclidean distances between which are
close to the original dissimilarities. Clearly state the dimensionality you use.
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
11
• Based on the computed eigenvalues, discuss the quality of representation in a low
dimensional space.
• Plot a two-dimensional map representing the towns. Compare the result with the actual
geographical distribution of the towns over the country.
• Plot the Shepard diagram and discuss it.
• Check whether the MDS configuration you obtained does restore the original distances
in a sufficiently high dimensional space.
2. Get the multivariate data. You have at least three options:
• Go to the JSE archive http://ww2.amstat.org/publications/jse/jse_data_archive.htm .
• If are unsatisfied with the data from the previous source or these have been already
picked up by your classmates, visit https://www.census.gov/data/tables/2015/econ/asm/2015-
asm.html or, more generally, https://www.data.gov/. However, some minor research and
preprocessing may be needed here to get a meaningful and compact dataset.
• Suggest your own dataset from some other source. Free sources of data are listed here
http://guides.emich.edu/data/free-data. Some research is needed.
3. Use FactorMineR package to conduct the principle component analysis:
(a) Plot the individuals on the plane corresponding to the first principal components (PCs),
Comment on the resulting cloud.
(b) Justify the choice of the PCs by plotting the eigenvalues. Calculate how much
of the total variability is explained by the first two PCs.
(c) Discuss the quality of the PCA representation: provide cos2 and the contributions for
each individual.
(d) Rebuild the cloud of individuals in two other PCs coordinates. Justify the choice
of those PCs and compare the cloud with that of (2a).
(e) If there are categorical variables, paint the individuals with diff erent colors according
to the categories. Draw the confidence ellipses and interpret them.
8.4. Sample home assignment
Assignment on Multiple Correspondent Analysis 1. Get the multivariate data. You have at least three options:
• Use the following links as an example https://www.flysfo.com/media/customer-survey-data
or https://data.qld.gov.au/dataset/customer-satisfaction-survey-2015 .
• Do your own search in the Internet using, for example, the keywords “customer satisfaction
survey dataset”.
• Compose a meaningful survey data on your own. You may use templates like those at
https://www.surveymonkey.com/mp/survey-templates/ or
https://www.questionpro.com/survey-templates/ and fill it out with the answers of imaginary
individuals. Describe that imaginary survey.
Remember that:
• There MUST be at least 5 questions (=variables) with a number of answers (=categories).
• You MUST consider some of the variables as supplementary (typically, gender, age,
profession etc)
Clearly specify the data you have chosen in your report.
2. Use FactorMineR package:
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
12
(a) Conduct the MCA. Visualize individuals, categories (both active and supplementary), see
Section 3.6 of [1].
(b) Provide a detailed interpretation of the obtained patterns. Focus on variability of
individuals and categories, comment on the extreme cases.
(c) Comparing the graphs for individuals and categories, study the links between certain
individuals the categories, see pp. 159-160 of [1].
(d) Provide a table of eigenvalues, comment on the values of the largest ones and justify
the choice of principal components. Do you need to look at the PCs other than the
first two ones?
(e) Apply the command dimdesc for an automatic description of the dimensions by the
categorical variables or the categories.
(f) Draw the confidence ellipses around the categories and interpret the results, p.147 of [1].
See pp. 155-166 of [1] for the examples.
References
[1] F. Husson, S. Le, J. Page`s, Exploratory Multivariate Analysis by Example Using R, Second
Edition, Chapman & Hall/CRC Computer Science & Data Analysis, CRC Press, 2017.
URL https://books.google.com/books?id=nLrODgAAQBAJ
9. Grading
The formula for the final grade finO
fin accm exam0.7 0.3O O O
is comprised of the grade accmO accumulated over two modules and the grade examO for the final
exam. The accumulated grade accmO is calculated as follows:
accm HA MT0.6 0.4O O O
where HAO and MTO are the grades for the home assignments and the mid-term test,
respectively.
10. Software and Technical Tools
R, RStudio, Python, Matlab, Microsoft Excel
11. Recommendations for instructors In general, lectures should give insight into the concepts and ideas underlying the topic under
review. The theoretical core of presentation should be preceded and followed up by clear
examples. The lecture slides may contain pieces of (quasi) code illustrating implementation of
the algorithms in some programming language (presumably, in R). It is highly recommended to
provide students with the lecture slides prior to the lecture so that they could familiarize
themselves with the material in advance and prepare some questions. The lecturer should refer
the students for technicalities to the recommended textbooks, reviews and papers as needed
throughout the presentation.
Practice classes play the key role in providing the course. The instructor should focus on the
implementation of data analysis algorithms on computers. The difficult tasks should be
discussed and worked out together with students. The tasks being discussed should be close to
those of home assignment so as students could solve similar problems on their own. The students
National Research University «Higher School of Economics»
“Multidimensional Data Analysis Language R” – Course syllabus
Bachelor’s program 38.03.05 “Business informatics”
13
are supposed to prepare a report on a particular home assignment and submit it to the instructor
electronically or in paper form. Some requirements for these reports may be set, e.g.:
The questions should be addressed in the same order they appear in the assignment. The
text of the question must be retained and placed before each answer. The working
language is English.
The answer to a particular question may take a form of a plot, formula etc followed by a
brief explanation and a conclusion. All conclusions must be justified numerically, i.e., by
some computed quantities, plots, etc. The answers do not need to be lengthy but they
must be convincing in mathematical and statistical sense, i.e., in terms of some
quantitative measures.
Each student must use a unique data set. It is the student’s responsibility to make sure that
no one else is using the same data. To facilitate the distribution of datasets among the
students, the instructor can create an editable shared check-in list on Google Drive or
some other cloud resource.
The deadlines for the reports should be clearly specified.
The instructor should notify the students about the penalties for late submission of the
reports.
The solutions should normally contain code in R or some other language.
It is good practice to suggest the students some datasets for the home assignments. For example,
a great amount of market data can be found at Yahoo Finance, Google Finance, Federal Reserve
Economic Data repository http://research.stlouisfed.org/fred2/ and so on. Other possible data
sources include the JSE archive http://ww2.amstat.org/publications/jse/jse_data_archive.htm, a
huge repository at https://www.data.gov/ and a list of freely available sources at
http://guides.emich.edu/data/free-data. Remarkably, most of these data can be downloaded in R
directly by using the respective functions which should be pointed out to students.
12. Recommendations for students
When completing homework assignments, first read the lecture slides and the recommended
textbook. Then think a little and try some problems and then read and think some more. This
procedure should be iterated until the problem becomes clear. You should not spend much time
on pure reading with no practice but, at the same time, you should not tackle a problem without
understanding of the underlying theory. Plan your timetable so that to do the homework shortly
after the lecture and/or practice class so as to keep the basic ideas fresh in your mind.
Author of the program:
Associate professor Sergey V. Petropavlovsky