multidimensional data analysis language r · school of business informatics department for...

National Research University «Higher School of Economics»

“Multidimensional Data Analysis Language R” – Course syllabus

Bachelor’s program 38.03.05 “Business informatics”

1

The Government of the Russian Federation

The Federal State Autonomous Institution of Higher Education “National Research University – Higher School of Economics”

Faculty of Business and Management

School of Business Informatics

Department for Management of Information Systems and Digital Infrastructure

Multidimensional Data Analysis Language R


Author: S.V. Petropavlovsky, associate professor

[email protected]

Approved at the meeting of the

Department for Management of Information Systems and Digital Infrastructure

«___»____________ 2017

Head of Department

_______________ / E.A. Isaev /

Approved by the Academic Council of Business Informatics

«___»____________ 2017

Chairman

_______________/ A.V. Dmitriev

Moscow, 2017

The document cannot be used by other HSE departments as well as other universities and

educational institutions without permission from the course authors

http://www.hse.ru/text/image/4011945.html

mailto:[email protected]




2

1. Applicability and Normative References

The program provides the contents of the course and describes the learning outcomes,

competences and practical skills obtained upon completion of the course. It also sets pre-

requisites for taking the course and provides criteria for assessing students’ performance. The

program is designed for instructors teaching the course, teaching assistants and undergraduate

students following educational track 38.03.05 "Business Informatics", Bachelor’s level.

2. Course Objectives

The course provides a theoretical background of multivariate data analysis and aims at

developing practical skills of data mining, processing and interpretation.

3. Course Description

"Multidimensional Data Analysis Language R" is an elective course taken in the 4th

academic year of the Bachelor’s program. The course gives insight into basic as well as more

advanced methods of handling multivariate data such as dimension reduction methods, cluster

analysis and Markov Chain Monte Carlo methods. A special section is devoted to modern

approaches for evaluating, selecting and regularizing the multivariate linear statistical models

such as cross validation, bootstrapping, the ridge and Lasso regression, etc. We also discuss

some non-linear models for classification and regression such as regression splines and

generalized additive models. The last part of the course covers the black-box methods of data

analysis. In particular, machine learning algorithms such as regression and classification trees,

support vector machines, neural networks and multivariate regression splines are introduced and

applied to spam detection and building a prototype of an on-line trading system.

The course has a practical bias, so the theoretical concepts are illustrated by applications

in various fields such as computer science, engineering, economics, finance, marketing, social

sciences etc.

The students are supposed to use the R language for implementing the algorithms

throughout the course (but not limited to), so a brief introduction to R is done at the very

beginning. The duration of the course is two modules. The course is taught in English and worth

4 credits.

4. Learning Outcomes and Competencies

At the end of the course, students should:

Be aware of:

the need, applicability and basic concepts of multivariate data analysis;

basic algorithms and techniques used in multivariate data analysis;

details of implementing the algorithms in R.

Be able:

to select the appropriate algorithms for multivariate data analysis as per the research

goals;

to collect and pre-process the raw data;

to interpret the results of the analysis and use them in the decision-making process.

Learn how to:

process the multivariate data using modern software;





3

present the results of the analysis.

Pre-requisites:

Programming (R is a plus but not essential), mathematics (algebra and calculus), probability

theory and statistics. Good command of English.

Competencies:

Competencies

Code

accord-

ing to

Federal standard/H

SE

Descriptors – basic signs of

mastering (indicators of

achieving a result)

Ways and methods of

teaching leading to

formation and development

of the competencies

Being able to explicate the scientific essence of problems in the professional field

СК-1 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments

Being able to solve problems in the professional field on the basis of analysis and synthesis

СК-Б4 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments

Being able to realize scientific and practical activities in international environment

СК-Б11

Mastering and using Lectures, practice in computer labs, preparation of class and home

assignments

Being able to control and develop the content of an enterprise and Internet- resources, to control the processes of creating and using information services

ПК-13 Mastering and using Lectures, practice in computer labs, preparation of class and home assignments

Consulting with respect to the rational choice of methods and tools for con- trolling the IT-infrastructure of an enterprise






4

Being able to use the relevant mathematical and technical tools for processing, analysis and systematization of data on the topic of research


Being able to prepare scientific reports and presentations


5. Role of the course in the curriculum

The course is a part of major (professional) block of disciplines. It is an elective course. The

course is based on a number of preceding disciplines:

Calculus;

Linear Algebra;

Micro- and Macroeconomics, Marketing, Finance;

Probability Theory and Statistics;

Modeling of processes and systems.

The concepts and methods provided by the current course may be helpful in studying the

subsequent courses such as:

Analysis of business processes;

Fractal analysis of market data;

Semantic informational systems;

Quantitative methods of market forecasting.





5

6. Course Structure and Contents

6.1. Course Structure

№ Topic

In-class hours

Self-

study Total

Lectures

Practice

in

computer

labs

Total

1st module 16 16 32 40 72

1. Introduction to R 2 2 4 4 8

2. Multivariate Data Handling and

Visualization 2 2 4 4 8

3. Dimension Reduction Methods 6 6 12 12 24

4. Cluster Algorithms 2 2 4 8 12

5. Markov Chain Monte Carlo Methods 4 4 8 12 20

2nd

module 16 16 32 40 72

6. Cross Validation and Linear Model

Selection 8 8 16 20 36

7. Machine Learning Algorithms in

Practice 8 8 16 20 36

Total 32 32 64 80 144

6.2. Syllabus

Topic 1. Introduction to R.

Data objects in R, installing and using packages. Loading data from local files and on-line

databases. Plotting data in R. Advanced graphics. Time series objects. Overview of basic

statistics in R. Major programming constructs: conditional operators, loops, functions.

Reading:

1. Core Text: [1]

2. Further Reading: [2]

Topic 2. Multivariate Data Handling and Visualization

Multivariate normal distribution. Testing multivariate normality (chi^2 QQ-plots). Scatter plots,

imposing marginal distributions. Bivariate boxplots. The convex hull of the bivariate data.

Removing outliers. The bubble and glyph plots and their interpretation. Analysis of the scatter

plot matrix.





6

Reading:

1. Core Text: [2]


Topic 3. Dimension Reduction Methods

Principal component analysis (PCA). Geometrical view on data. Cloud of individuals and cloud

of variables. Rotating the frame and optimal projecting. PCA through diagonalization of the

covariance matrix. PCA through the singular value decomposition. Coordinates of individuals

and variables in the reduced basis. Quality of projecting. Interpretation. Simultaneous analysis of

individuals and variables. Demonstrations in R.

Correspondence analysis (CA). Data for the CA. chi^2 tests for association between categorical

variables. Geometrical view: chi^2 metric. Raw and column profiles. Implementation of the CA.

Quality of dimension reduction. Link between row and column representations. Demonstrations

in R.

Multiple CA (MCA). Data for the MCA. Indicator matrix. Distances between individuals and

categories. Implementation of the MCA. Numerical indicators of quality representation.

Demonstrations in R.

Multidimensional scaling (MDS). Data for the MDS: dissimilarity matrices. Goals of

multidimensional scaling. Computing dissimilarities: Euclidean versus non-Euclidean distances.

Classical multidimensional scaling. Metric and non-metric MDS. Goodness-of-fit measures for

the metric MDS. Shepard’s diagrams. Distance scaling. Issues of the non-metric MDS.

Interpretation of the MDS analysis. Embedding external variables.

Reading:

1. Core Text: [2,3]


Topic 4. Cluster Algorithms

Cluster algorithms. Distances between clusters of observations (linkage).





7

Agglomerative hierarchical clustering (AHC). Constructing an indexed hierarchy. Ward’s

algorithm. Quality of partition. Agglomeration according to inertia. Properties of the

agglomeration criterion. Impact of different linkage type on the performance of the AHC.

Direct search for partitions. K-means and K-medoids approaches.

Probabilistic clustering. Gaussian mixture model (GMM). Expectation maximization algorithm.

Clustering and principal component methods.

Reading:

1. Core Text: [2]


Topic 5. Markov Chain Monte Carlo Methods

Goals of Markov Chain Monte Carlo (MCMC). Markov processes. Properties of Markov chains

(finiteness, aperiodicity, irreducibility, ergodicity, mixing, etc). The stationary state of the chain.

Monte Carlo simulations of distributions. Inverse CDF method. Rejection sampling. The Gibbs

sampler. The Metropolis-Hastings algorithm. Issues in chain efficacy. MCMC implementation in

R and examples. Applications of MCMC: modeling S&P500 index.

Reading:

1. Core Text: [4]


Topic 6. Cross Validation and Linear Model Selection

Cross validation and bootstrapping. The idea and applications. The validation set approach.

Leave-One-Out cross validation. k-fold cross validation. Bias-variance trade-off for k-fold cross

validation. Cross validation on classification problems. Bootstrapping.

Linear model selection and regularization. Best subset selection. Stepwise selection. Choosing

the optimal model. Shrinkage methods: ridge regression, the Lasso, selecting the tuning

parameter.

Dimension reduction method in regression. Principal components regression, partial least

squares.





8

Regression splines. Piecewise polynomials. Constraints and splines. The spline basis

representation. Choosing the number and locations of knots. Comparison to polynomial

regression. Smoothing splines. Choosing the smoothing parameter.

Generalized additive models. GAMs for regression and classification problems.

Reading:

1. Core Text: [5,6]


Topic 7. Machine Learning Algorithms in Practice

Types of machine learning algorithms. The limits of machine learning.

Classification using Nearest Neighbors algorithm: measuring similarity with distance, choosing

an appropriate number of neighbors, preparing data for use with k-NN. Examples of k-NN

algorithm.

Probabilistic learning using naïve Bayes approach: the basic idea, the Laplace estimator,

numerical features of the naïve Bayes approach. Examples (filtering out the spam).

Classification using decision trees and rules. Advantages and disadvantages of trees. Tree-based

classification and regression. Application area of tree-based methods. Trees versus linear models.

Divide and conquer algorithm. The 1R algorithm. The RIPPER algorithm. Boosting the accuracy

of decision trees, pruning the trees. Bagging classification. Random forests. The Gini index.

Fitting the decision trees.

Black box methods. Neural networks. Activation functions. Network topology. Training a model

on the data. Evaluating and improving model performance. Support vector machines.

Classification with hyperplanes (linearly and non-linearly separable data). Using kernels for non-

linear spaces.

Reading:

1. Core Text: [1]

2. Further Reading: [1-3]





9

7. Reading List

A. Core Texts

1. J. Verzani, Using R for Introductory Statistics, Second Edition, Chapman & Hall/CRC The

R Series, Taylor & Francis, 2014. URL https://books.google.ru/books?id=O86uAwAAQBAJ

2. Brian Everitt and Torsten Hothorn. An introduction to applied multivariate analysis with R.

Springer, New York, 2011. URL http://dx.doi.org/10.1007/978-1-4419-9650-3

3. F. Husson, S. Le, J. Pages, Exploratory Multivariate Analysis by Example Using R, Second

Edition, Chapman & Hall/CRC Computer Science & Data Analysis, CRC Press, 2017.

URL https://books.google.com/books?id=nLrODgAAQBAJ

4. Kim Seefeld and Ernst Linder, Statistics Using R with Biological Examples, preprint,

University of New Hampshire, 2007.

5. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning: With

Applications in R, Springer Publishing Company, Inc., 2014.

6. Brett Lantz, Machine Learning with R, Second Edition, Packt Publishing, 2015.

B. Further reading

1. J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1): 81–106, 1986.

2. Anthony, M. & Bartlet, P. Neural Network Learning: Theoretical Foundations,

Cambridge University Press, 1999.

3. Barber, D. Bayesian reasoning and machine learning, Cambridge University

Press, 2012.

8. Assessment of student’s performance

Type of

assessment

Means of assessment 3 year

1 2 3 4

Mid-term test

(last week of

module 1)

Test and practice

assignment in computer

lab

*

Final Exam *


https://books.google.ru/books?id=O86uAwAAQBAJ

http://dx.doi.org/10.1007/978-1-4419-9650-3

https://books.google.com/books?id=nLrODgAAQBAJ




10

8.1. Criteria of assessment

To pass the mid-term test, the students should be able to solve the problems that were discussed

in class. To pass the final exam, the students should demonstrate the knowledge of basic

concepts of multivariate data analysis and the ability to implement them in practice.

8.2. Topics suggested for the mid-term test

Computer-based problems on Topics 1-5.

8.3. Sample concept questions for final exam

1. Describe the idea behind principal component analysis.

2. Implementation of correspondence analysis.

3. Algorithm of multiple correspondence analysis.

4. Multidimensional scaling: the basic idea, data preparation, algorithm and interpretation.

5. Explain the difference between agglomerative hierarchical clustering and K-means algorithm.

6. Probabilistic clustering: the idea and implementation. Gaussian mixture model.

7. Expectation maximization algorithm.

8. Markov processes and their properties.

9. Monte Carlo simulations of distributions. Inverse CDF method.

10. The Gibbs sampler.

11. The Metropolis-Hastings algorithm.

12. The Lasso regression.

13. Describe the procedure of cross validation.

14. Applications of bootstrapping.

15. Methods of dimension reduction in regression: principal component regression, partial least

squares.

16. Regression splines.

17. Smoothing splines.

19. Generalized additive models for regression.

20. Generalized additive models for classification.

21. Nearest Neighbors algorithm for classification.

22. Probabilistic learning: naïve Bayes approach.

23. Decision trees.

24. Basic concepts of neural networks.

25. Support vector machine algorithms.

The final exam is computer based and lasts for 90 minutes. The assignment consists of one

concept questions and two practice tasks.

8.3. Sample practice assignments for final exam

1. Get the distances between 10-12 Russian cities/towns. You can retrieve this information at

https://www.avtodispetcher.ru/distance/table/c172-rossiya/

• Do the classical multidimensional scaling using command cmdscale from MVA package:

• Provide the MDS configuration of points the Euclidean distances between which are

close to the original dissimilarities. Clearly state the dimensionality you use.


https://www.avtodispetcher.ru/distance/table/c172-rossiya/




11

• Based on the computed eigenvalues, discuss the quality of representation in a low

dimensional space.

• Plot a two-dimensional map representing the towns. Compare the result with the actual

geographical distribution of the towns over the country.

• Plot the Shepard diagram and discuss it.

• Check whether the MDS configuration you obtained does restore the original distances

in a sufficiently high dimensional space.

2. Get the multivariate data. You have at least three options:

• Go to the JSE archive http://ww2.amstat.org/publications/jse/jse_data_archive.htm .

• If are unsatisfied with the data from the previous source or these have been already

picked up by your classmates, visit https://www.census.gov/data/tables/2015/econ/asm/2015-

asm.html or, more generally, https://www.data.gov/. However, some minor research and

preprocessing may be needed here to get a meaningful and compact dataset.

• Suggest your own dataset from some other source. Free sources of data are listed here

http://guides.emich.edu/data/free-data. Some research is needed.

3. Use FactorMineR package to conduct the principle component analysis:

(a) Plot the individuals on the plane corresponding to the first principal components (PCs),

Comment on the resulting cloud.

(b) Justify the choice of the PCs by plotting the eigenvalues. Calculate how much

of the total variability is explained by the first two PCs.

(c) Discuss the quality of the PCA representation: provide cos2 and the contributions for

each individual.

(d) Rebuild the cloud of individuals in two other PCs coordinates. Justify the choice

of those PCs and compare the cloud with that of (2a).

(e) If there are categorical variables, paint the individuals with diff erent colors according

to the categories. Draw the confidence ellipses and interpret them.

8.4. Sample home assignment

Assignment on Multiple Correspondent Analysis 1. Get the multivariate data. You have at least three options:

• Use the following links as an example https://www.flysfo.com/media/customer-survey-data

or https://data.qld.gov.au/dataset/customer-satisfaction-survey-2015 .

• Do your own search in the Internet using, for example, the keywords “customer satisfaction

survey dataset”.

• Compose a meaningful survey data on your own. You may use templates like those at

https://www.surveymonkey.com/mp/survey-templates/ or

https://www.questionpro.com/survey-templates/ and fill it out with the answers of imaginary

individuals. Describe that imaginary survey.

Remember that:

• There MUST be at least 5 questions (=variables) with a number of answers (=categories).

• You MUST consider some of the variables as supplementary (typically, gender, age,

profession etc)

Clearly specify the data you have chosen in your report.

2. Use FactorMineR package:


http://ww2.amstat.org/publications/jse/jse_data_archive.htm

https://www.census.gov/data/tables/2015/econ/asm/2015-%20%20%20%20asm.html

https://www.census.gov/data/tables/2015/econ/asm/2015-%20%20%20%20asm.html

https://www.data.gov/

https://www.flysfo.com/media/customer-survey-data

https://data.qld.gov.au/dataset/customer-satisfaction-survey-2015

https://www.surveymonkey.com/mp/survey-templates/

https://www.questionpro.com/survey-templates/




12

(a) Conduct the MCA. Visualize individuals, categories (both active and supplementary), see

Section 3.6 of [1].

(b) Provide a detailed interpretation of the obtained patterns. Focus on variability of

individuals and categories, comment on the extreme cases.

(c) Comparing the graphs for individuals and categories, study the links between certain

individuals the categories, see pp. 159-160 of [1].

(d) Provide a table of eigenvalues, comment on the values of the largest ones and justify

the choice of principal components. Do you need to look at the PCs other than the

first two ones?

(e) Apply the command dimdesc for an automatic description of the dimensions by the

categorical variables or the categories.

(f) Draw the confidence ellipses around the categories and interpret the results, p.147 of [1].

See pp. 155-166 of [1] for the examples.

References

[1] F. Husson, S. Le, J. Page`s, Exploratory Multivariate Analysis by Example Using R, Second

Edition, Chapman & Hall/CRC Computer Science & Data Analysis, CRC Press, 2017.

URL https://books.google.com/books?id=nLrODgAAQBAJ

9. Grading

The formula for the final grade finO

fin accm exam0.7 0.3O O O

is comprised of the grade accmO accumulated over two modules and the grade examO for the final

exam. The accumulated grade accmO is calculated as follows:

accm HA MT0.6 0.4O O O

where HAO and MTO are the grades for the home assignments and the mid-term test,

respectively.

10. Software and Technical Tools

R, RStudio, Python, Matlab, Microsoft Excel

11. Recommendations for instructors In general, lectures should give insight into the concepts and ideas underlying the topic under

review. The theoretical core of presentation should be preceded and followed up by clear

examples. The lecture slides may contain pieces of (quasi) code illustrating implementation of

the algorithms in some programming language (presumably, in R). It is highly recommended to

provide students with the lecture slides prior to the lecture so that they could familiarize

themselves with the material in advance and prepare some questions. The lecturer should refer

the students for technicalities to the recommended textbooks, reviews and papers as needed

throughout the presentation.

Practice classes play the key role in providing the course. The instructor should focus on the

implementation of data analysis algorithms on computers. The difficult tasks should be

discussed and worked out together with students. The tasks being discussed should be close to

those of home assignment so as students could solve similar problems on their own. The students


https://books.google.com/books?id=nLrODgAAQBAJ




13

are supposed to prepare a report on a particular home assignment and submit it to the instructor

electronically or in paper form. Some requirements for these reports may be set, e.g.:

The questions should be addressed in the same order they appear in the assignment. The

text of the question must be retained and placed before each answer. The working

language is English.

The answer to a particular question may take a form of a plot, formula etc followed by a

brief explanation and a conclusion. All conclusions must be justified numerically, i.e., by

some computed quantities, plots, etc. The answers do not need to be lengthy but they

must be convincing in mathematical and statistical sense, i.e., in terms of some

quantitative measures.

Each student must use a unique data set. It is the student’s responsibility to make sure that

no one else is using the same data. To facilitate the distribution of datasets among the

students, the instructor can create an editable shared check-in list on Google Drive or

some other cloud resource.

The deadlines for the reports should be clearly specified.

The instructor should notify the students about the penalties for late submission of the

reports.

The solutions should normally contain code in R or some other language.

It is good practice to suggest the students some datasets for the home assignments. For example,

a great amount of market data can be found at Yahoo Finance, Google Finance, Federal Reserve

Economic Data repository http://research.stlouisfed.org/fred2/ and so on. Other possible data

sources include the JSE archive http://ww2.amstat.org/publications/jse/jse_data_archive.htm, a

huge repository at https://www.data.gov/ and a list of freely available sources at

http://guides.emich.edu/data/free-data. Remarkably, most of these data can be downloaded in R

directly by using the respective functions which should be pointed out to students.

12. Recommendations for students

When completing homework assignments, first read the lecture slides and the recommended

textbook. Then think a little and try some problems and then read and think some more. This

procedure should be iterated until the problem becomes clear. You should not spend much time

on pure reading with no practice but, at the same time, you should not tackle a problem without

understanding of the underlying theory. Plan your timetable so that to do the homework shortly

after the lecture and/or practice class so as to keep the basic ideas fresh in your mind.

Author of the program:

Associate professor Sergey V. Petropavlovsky


http://research.stlouisfed.org/fred2/

https://www.data.gov/

multidimensional data analysis language r · school of business informatics department for...

Documents