st790: introduction to data mining and machine …lu/st7901/lecture notes...1 principle and theory...

35
ST790: Introduction to Data Mining and Machine Learning Wenbin Lu Department of Statistics North Carolina State University Fall 2019 Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 34

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

ST790: Introduction to Data Mining and MachineLearning

Wenbin Lu

Department of StatisticsNorth Carolina State University

Fall 2019

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 34

Page 2: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Table of Contents

1 Introduction

2 Examples

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 34

Page 3: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Course Information

Course homepage:https://www.stat.ncsu.edu/people/lu/courses/ST7901/

Prerequisite:

ST701, ST702 and ST705 or equivalentProgramming language: R (recommended)

Textbooks: The Elements of Statistical Learning (electronic versionavailable at course website)

Reference Books:1 Principle and Theory for Data Mining and Machine Learning by Clark,

Forkoue, Zhang (2009)2 Pattern Recognition and Neural Networks by B. Ripley (1996)3 Learning with Kernels by Scholkopf and Smola (2000)4 Nature of Statistical Learning Theory by Vapnik (1998)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 34

Page 4: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

What is Data Mining

the nontrivial process of identifying valid, novel, potentially useful,and ultimately understandable patterns in data

the process of extracting previously unknown, comprehensible, andactionable information from large databases and using it to makecrucial business decisions

a set of methods used in the knowledge discovery process

the process of discovering advantageous patterns in data

a decision support process where we look in large databases forunknown and unexpected patterns of information

......

DM is a process of discovering patterns and relationships in data, with anemphasis on large observational databases.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 34

Page 5: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Emerging Discipline, Why Now?

Driving Forces

explosive growth of data in a great variety of fields

revolution in biotechniques (microarray, GWAS, next generationsequencing)internet, network, search engines, digital images, multi-mediainformation

Rapidly increasing computer power

cheaper storage devices with higher capacity

faster communications; better database management systems

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 34

Page 6: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

What is Big Data

Wikipedia says,

“Big data is a broad term for data sets so large or complex thattraditional data processing applications are inadequate. ”

NSF Big Data Initiative (2012),

“from a scientific perspective, that scientists manage, analyze, visualize,and extract useful information from large, diverse, distributed and

heterogeneous data sets so as to accelerate the progress of scientificdiscovery and innovation”

Wall Street Journal (04-20-2012),

“from a business perspective, an enterprise mine all the data it collectsright across its operations to unlock golden nuggets of business

intelligence”

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 34

Page 7: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Big Data References

The Wall Street Journal, Article 11-29-2012, “Big Data is on the rise,bringing big questions”

The Wall Street Journal, Article 04-29-2012, “Big Data’s bigproblem: little talent”

McKinsey & Company, Report 05-2011, “Big Data: The next frontierfor innovation, competition, and productivity.”

Google search “Big Data” gives 0.8 billion results.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 34

Page 8: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Sizes of Big Data

The sizes of modern datasets are increasing faster than ever:

Bytes=8 bits, kilobyte= 103 bytes, Megabyte= 106 bytes,Gigabyte= 109 bytes, Terabyte= 1012 bytes,Petabyte= 1015 bytes, Exabyte= 1018 bytes,Zettabyte= 1021 bytes, Yottabyte = 1024 bytes, ...

2 Megabytes: A high resolution photograph

50 Terabytes: The contents of a large MASS Storage System

2 Petabytes: All US academic research libraries

5 Exabytes: All words ever spoken by human beings

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 34

Page 9: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Big Data Features and Challenges

Big data are featured with

massive volume, ultra-high dimension

complex structure, heterogeneous

multiple-source, multiple-type data

Big Data Challenges:

data storage, transfer [enormous memory, parallel processing].

data search, retrieval [at a high speed]

data cleaning, organization, visualization [present/display data in ameaningful way]

data analysis [gain deep insight about data]

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 34

Page 10: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Examples

Wal-Mart made > 20 million transactions daily, and constructed an11 terabyte database of customer transactions

AT&T had 100 million customers and carried on the order of 300million calls a day on its long distance network

molecular data: DNA copy-number alteration, mRNA expression,protein expression

images, network data, tweet data, ...

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 34

Page 11: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Role of Data Scientists

Data Science is an interdisciplinary field:

statistics machine Learning, computer science, mathematics, patternrecognition, signal Processing

Goal: to extract useful informationn from flood of data, tofind hidden patterns in data

feasible and fast computation. Example: Hadoop system is softwarefor distributed storage and processing of very large data sets oncomputer clusters.

develop new tools and algorithms for data analysis

provide theoretical foundations for learning algorithms

help researchers gain deeper understanding of cancer

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 34

Page 12: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Applications (I)

Business

Walmart data warehouse, credit card companies, bank data, stock data

Marketing

Given data on age, income, etc., predict the spending capacity of eachcustomer;Discover the relationship of customers’ spending behaviors;recommend products (example: diaper-beer)

Genomics

Human genome projects: DNA sequences; microarray data, SNP, nextgeneration sequence

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 34

Page 13: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 34

Page 14: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Applications (II)

Information Retrieval

search engine, text mining, document search.For example, given some key words in a document, determine its topicand content. (many words in a document, and many, many documentsavailable on the web).speech recognition, image analysis, multimedia information

Healthfare

personalized medicinecancer classification and treatment, given gene expressions and clinicalmeasurements.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 34

Page 15: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

What is Role of Statistics

Statistics machine learning plays a central role in data mining.

provide theoretical foundations for learning algorithms

give useful tools to analyze an algorithm’s statistical properties andperformance guarantee

help researchers gain deeper understanding of the approaches, designbetter algorithms, and select appropriate methods for a given problem

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 34

Page 16: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Examples of Learning Problems

Predict whether a patient, hospitalized due to a heart attack, willhave a second heart attack, based on demographic, diet and clinicalmeasurements (Regression and Classification)

Predict the price of a stock in 6 months, based on companyperformance measures and economic data (Regression)

Identify a handwritten ZIP code, from a digitized image(Classification)

Identify risk factors for prostate cancer, based on clinical, genetic anddemographic variables (Feature Selection)

Identify groups of genes with similar functions from DNA microarraydata (High Dimensional Clustering)

In fraud detection, what covariates are useful in building a model topredict the probability of being a fraud order? how to handlecovariate correlations? (Classification Problem)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 34

Page 17: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Goals of This Course

understand different types of data mining problems

supervised, unsupervised, semi-supervised learningregression, classification, clustering, graphical modelstext mining, multimedia mining, image recognition, social networks,recommendation system, network models

learn basic statistical concepts and principles (which are favored overa black-box learning algorithm)

statistical inference on uncertainty, distribution, loss, riskmodel building, evaluation, selection, prediction, and optimality;bias-variance tradefoffparameter tuning, training error, test error, cross-validation,

learn statistical and machine learning methods for big data

SVM, PCA, lasso, boosting, tree, random forest

learn R software packages to analyze data

take into serious consideration scalable, parallel algorithms

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 34

Page 18: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

What is “Special” about this course

To combine the “art” of designing good learning algorithms and the“science” of analyzing statistical properties and performance of theapproaches

emphasize “statistical” principles behind the approaches andalgorithms

learn how to formulate a learning problem in a “statistical” framework

understand existing techniques from a “statistical” perspective; whatare the limitations and strengths? Can we do better by relaxingassumptions?

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 34

Page 19: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Terminologies

Statistics Data Mining

X predictor, covariate input, feature

Y response, outcome output

{(Xi ,Yi )}ni=1 data points, samples examples, training data

f̂ (X) model fitting machine learning

data analysis consistency, inference prediction, speedgoals confidence interval risk bound

convergence rate learning theory

Practical vs Reluctant to use Willing to use ad hocTheoretical methods without theoretical methods if they seemsConcerns justification (even if the to work well (though

justification is appearances may beactually meaningless) misleading)

Computing more and more important heavy

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 34

Page 20: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 34

Page 21: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Different Learning Problems

Typically, we collect data (Xi ,Yi ), i = 1, · · · , n.

Yi is the outcome or response variable.

Xi is the input or prediction variables.

Various Machine Learning Problems

Supervised learning (Y observed)

Unsupervised learning (Y unobserved)

Semi-supervised learning (Y partially available)

Various Statistical problems

Regression (Y quantitative, a numerical quantity)

Classification (Y qualitative; a class label)

Density estimation (no Y )

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 34

Page 22: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Introduction

Supervised Learning vs Unsupervised Learning

Supervised Learning Unsupervised Learning

Response Y observed unobserved

Major predict Y , given find interestingGoal the observed input X patterns in data

Examples linear regression clusteringnonparametric regression density estimation

classification dimension reduction

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 34

Page 23: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Example 1: Email or Spam (Textbook Page 2)

Training data: 4, 601 email messages with known email type.

Input X: the relative frequencies of 57 of the most commonly occurringwords and punctuation marks in the email message.Outcome Y : −1 =email , + =spam

Goal: Design an automatic spam detector to predict whether amessage is email or spam. In the future, the detector can be used tofilter out the junk emails before clogging the users’ mailbox.

Supervised learning, binary classification, n > p. (data matrix n × p “longand narrow”)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 34

Page 24: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Table: Average percentage of words (the largest difference between spam and email

george you your hp free hpl ! our re edu removespam 0.00 2.26 1.38 0.02 0.52 0.01 0.51 0.51 0.13 0.01 0.28email 1.27 1.27 0.44 0.90 0.07 0.43 0.11 0.18 0.42 0.29 0.01

Possible classification rules:

if (george < 0.6) & (you > 1.5), then spam; otherwise email.

if (0.2 you -0.3 george)> 0, then spam; otherwise email.

Two types of decision errors:

(1) false positive: classify email to spam (filter out email)

(2) false negative: classify spam to email (email box jammed)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 34

Page 25: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Example 2: Prostate Cancer

Read Textbook page 3

Training data: 97 male patient with different stages of prostatecancer.

Input X: eight clinical measures: log-cancer volume (lcavol), logprostate weight (lweight), age, and the other five.

Outcome Y : the log of the level of prostate specific antigen (lpsa)

Questions of interest:

What is the relationship between lpsa and clinical measures?

Is the linear model sufficient? Nonlinear effect? Interactions?

Which clinical measures are more relevant to the prediction?

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 34

Page 26: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

ExamplesElements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 1lpsa

-1 1 3

oooooo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo ooo oo ooo oo ooo oo o ooooooo o oooo ooo o ooo o ooooooo oooo ooo oo ooo

ooo o

o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo ooo ooooo oooo ooo oo oo o oo ooo ooooooooo oo oo ooooo o oooooo oo ooo

oooo40 60 80

o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo oo o ooo oo ooo oooo oo ooo oo oo oooo o oooo ooo oo o ooooo ooo oooo ooo oo o oo

oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo oooo oo oo ooo oo oooo oo oo o ooo oo o o oo oo oo oo o oooo ooo o oo ooo oo oooo oo

0.0 0.4 0.8

oooooooooooooooooooooooooooooooooooooo oooooooo ooooooooooooooo oo ooooooo oo oooooo oooo ooo oo oooo oo

oooo

oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo oooo o ooo ooo ooo o ooo ooooo oo oo o ooo o oo o o o ooo ooo oo oo oo o oooo o ooo o

6.0 7.5 9.0

oo oooooooooo oooo ooooo oo ooo ooooooooo o oooo ooo ooo oooo oo oooooo ooo o ooo oooo oooo oooooooooo oooo oooo oo

oooo

0123

45

oo oooooooooo oooo ooooo oo ooo oo oooooooooo oo oooo oo oooo oo oooooo o oo o ooo o oooo ooo oo ooo ooo oo oo oo o o oo o ooo oo

-10

12

34

oooo

o

o

oo

ooo

o

oooo

o

o

oooo

o

o

oooo

o

o

oo

o

oo

ooo

o

oooo

oooo

ooooo

o

oo

oooooo

ooooooo

o

ooooo

oo

oooo

oooo

o

o

oo

o

o

ooo

oooo

lcavol

o oo

o

o

o

oo

ooo

o

oo oo

o

o

oo

oo

o

o

oo

oo

o

o

o o

o

oo

ooo

o

oooo

oooo

ooo o

o

o

oo

ooo oo

o

oo

ooooo

o

oo

o oo

oo

oooo

oo oo

o

o

oo

o

o

oooo

ooo

o oo

o

o

o

oo

ooo

o

o oo o

o

o

oo

oo

o

o

oo

oo

o

o

oo

o

o o

o oo

o

o oo

o

o ooo

oo

ooo

o

oo

ooo o

oo

oo

ooo

o o

o

oo

ooo

oo

oooo

oo oo

o

o

oo

o

o

oo o

oo o

o

oooo

o

o

o o

ooo

o

oooo

o

o

oo

oo

o

o

oo

oo

o

o

oo

o

oo

ooo

o

ooo

o

o ooo

oo

ooo

o

oo

ooooo

o

oo

ooo

o o

o

oo

ooo

o o

oooo

ooo o

o

o

oo

o

o

ooo

oo o

o

oooo

o

o

oo

ooo

o

oooo

o

o

oooo

o

o

oooo

o

o

oo

o

oo

ooo

o

oooo

oooo

ooooo

o

oo

oooooo

oo

ooooo

o

oo

o oo

oo

oo oo

oo oo

o

o

o o

o

o

ooooooo

oooo

o

o

oo

ooo

o

oo oo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

o o

o oo

o

ooo

o

oooo

oo

ooo

o

oo

o ooooo

oo

ooo

oo

o

oo

o oo

o o

oo oo

ooo o

o

o

o o

o

o

oo o

ooo

o

ooo

o

o

o

oo

ooo

o

oooo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

oo

o oo

o

o oo

o

oooo

ooo oo

o

oo

ooo o

oo

oo

ooooo

o

oooo

o

oo

oooo

ooo o

o

o

o o

o

o

ooooooo

ooo

o

o

o

oo

ooo

o

oooo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

oo

ooo

o

o oo

o

o ooo

ooo oo

o

oo

ooo o

oo

oo

ooo

oo

o

oo

ooo

o o

oo oo

ooo o

o

o

o o

o

o

oo o

oo o

o

oooooooooo

ooooooooooooooooooooo

o

ooo

oo

o

o

ooooooooooooooooo

o

oooo

ooooooooo

oooooo

oooooooooooo

o

oooooo

oo

oo

oo oo ooo o

oooo

oo

o oo

oo oo ooo

ooo o

o

o

ooo

oo

o

o

oooo ooo

oooo

o oo

oo

o

o

oo

oo

o oooo oo

oo

ooo

ooo

oooo

ooooo oo

o

o

oo

oo ooo o lweight

oo

oooo ooo o

ooo o

oo

ooo

oooo o o

ooooo

o

o

oo o

oo

o

o

o oooo

oooo

o ooo

oo

oo

o

oo

oo

o oooo o ooo

o oo

ooo

o ooo

oo

ooo ooo

o

oo

o ooo

oo

oooooo o oooo ooooo

ooo

oo oo o o

oo o

ooo

o

ooo

oo

o

o

ooo oo

oooo

o oo o

oo

oo

o

oo

oo

oooo o o o

oo

o oo

oo

o

oooo

oo

o o oo oo

o

oo

oooo

oo

ooooooooooooooooooooooooooooooo

o

ooo

oo

o

o

ooooooo

oooooooooo

o

oooo

oooooo

ooo

oooooo

oooo

oo

ooo ooo

o

oo

oooooo

oooooooooooo

ooo

oo o

ooo oo ooo

ooo o

o

o

oo o

oo

o

o

ooo o o

oooo

o ooo

oo

oo

o

oooo

oooo o oo

oo

ooo

oo

o

oooo

oo

o oo ooo

o

ooo o o

oo o

oo

oooooooooo

ooo

ooo

ooo oo ooo

oooo

o

o

ooo

oo

o

o

o ooo ooo

oooo

oooooo

o

ooo

o

o ooo ooo

oo

ooo

ooo

ooooooo ooooo

o

oo

oooooo

34

56

oo

oooooooooo

ooo

ooo

ooo oo ooo

oooo

o

o

ooo

oo

o

o

o oooo

oooo

oooo

oo

oo

o

oo

oo

o ooo o oo

oo

ooo

oo

o

oooo

oo

o oo ooo

o

oo

o ooo

oo

4050

6070

80

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

ooooooooooooo

o

oo

o

oo

oo

ooooo

o

o

o

ooooo

oo

oo

o

o

o

o

oooooooo

o

o

o

o

oooo

ooo

o

o

oooooo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

o ooo o oooo

o

o o

o

oo

oo

oo o

oo

o

o

o

oooo

o

oo

oo

o

o

o

o

ooooooo o

o

o

o

o

ooo

o

ooo

o

o

oo

oo

o o

o

ooo

oo

o o

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

oooooo oooo o o

o

o

o o

o

oo

oo

ooo

oo

o

o

o

o oooo

oo

o o

o

o

o

o

oooooooo

o

o

o

o

oooo

ooo

o

o

oo

oo

o o

o

ooo

oo

oo

ageo

o

o

oo

o

oo

o

oo ooo

o

oo

o

o

o

o ooo

ooo ooo ooo

o

o o

o

oo

oo

oooo

o

o

o

o

oooo

o

oo

o o

o

o

o

o

ooo o

ooo o

o

o

o

o

o ooo

oo o

o

o

oo

oo

oo

o

oo

o

oo

oo

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

ooooooooooooo

o

oo

o

oo

oo

ooooo

o

o

o

ooooo

oo

oo

o

o

o

o

oooo

oooo

o

o

o

o

oooo

ooo

o

o

oo

oo

oo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo oo o oooo

o

oo

o

oo

oo

oo oo

o

o

o

o

ooo o

o

oo

oo

o

o

o

o

ooo o

ooo o

o

o

o

o

o oo

o

ooo

o

o

oo

oo

o o

o

oo

o

oo

o o

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo ooooooo

o

o o

o

oo

oo

oo o

oo

o

o

o

o oo o

o

oo

oo

o

o

o

o

ooo oooo o

o

o

o

o

oooo

ooo

o

o

ooo

ooo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo oo ooooo

o

oo

o

oo

oo

ooo

oo

o

o

o

o oo o

o

oo

oo

o

o

o

o

ooo o

oooo

o

o

o

o

o oo

o

ooo

o

o

oo

oo

o o

o

oo

o

oo

oo

oooooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

oooo oo

o

o

o oo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

o oo

o

o

o oo ooo

o

o

ooo

o

oo oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o oo

oo

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

ooo

o

o

o o oooo

o

o

o oo

o

o oo o

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o o

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

oo

oo

o

oo

o

o

oo

o

o

o

oo o

o

o lbph

oooooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

oooooo

o

o

ooo

o

oo oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o o

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

o oo

o

o

oo oooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

-10

12

oo oooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

ooo

o

o

0.00.4

0.8

oooooooooooooooooooooooooooooooooooooo

o

ooooooo

o

oooooooooooooo

o

o

o

oooooo

o

o

oooo

oo

o

ooo

o

oo

o

o

ooo

o

oooooo

oooo oo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo

o

oo oo ooo

o

o ooo oo o ooooooo

o

o

o

oo ooo o

o

o

o o oo

oo

o

oo o

o

oo

o

o

o oo

o

oo ooo o

o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo

o

oo ooooo

o

ooo ooo oo oo o oo o

o

o

o

oooooo

o

o

oo oo

oo

o

oo o

o

oo

o

o

o oo

o

oooooo

o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo

o

o o ooo oo

o

oo oooo oo ooo oo o

o

o

o

oo o ooo

o

o

oo oo

o o

o

ooo

o

oo

o

o

oo o

o

o oo o oo

oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo

o

ooo oo oo

o

oo oo oooo oo oo o o

o

o

o

o o o oo o

o

o

o oo o

oo

o

o oo

o

o o

o

o

oo o

o

oooo oo

svi

oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo

o

ooo o ooo

o

oo ooo o ooo ooooo

o

o

o

o o ooo o

o

o

o o o o

oo

o

oo o

o

oo

o

o

o oo

o

o o ooo o

oo oooooooooo oooo ooooo oo ooo ooooooooo o oo

o

o ooo ooo

o

ooo oo oooooo ooo

o

o

o

o oooo o

o

o

o ooo

oo

o

ooo

o

oo

o

o

ooo

o

oooooo

oo oooooooooo oooo ooooo oo ooo oo oooooooooo

o

o oooo oo

o

ooo oo oooooo o oo

o

o

o

o o oooo

o

o

o oo o

oo

o

oo o

o

oo

o

o

o o o

o

o ooo oo

oooooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

oo

o

oooo

ooo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

oooo oo ooo ooooo

o

oo

o

o o o

o

o

o

o oo

o

o

o

oo ooo

o

o

o

o

o

o o

o

o

o

o

o

o

ooo o

o

o

oo

o

oooo

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

o oo ooooooooooo

o

oo

o

o oo

o

o

o

oooo

o

o

oooo

oo

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o oo o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

o o oooo ooo oooo

o

o

oo

o

o oo

o

o

o

oooo

o

o

oo oo

oo

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o oo o

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo o

o

o

oooooo o oooo ooo

o

oo

o

o oo

o

o

o

ooo

o

o

o

oooo

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

o o

o

o

o o

o

oo o o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

oooooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

oo

o

oooo

ooo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

lcp

oo oooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

ooooo

o

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o ooo

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

-10

12

3

oo ooooooooooo

o

o

oo

o

ooo

o

o

o

ooo

o

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o o oo

o oo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

6.07.0

8.09.0

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

ooo

oooooo

o

o

ooo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

oo

o

o

o

o

o

ooo

o

oooo

o

ooooooooo

o

oo

o

ooo

o

oooooo

oo

o

o oo ooo ooo

ooo

o

o

oo o o

o

o

o

o o

oo o

ooo ooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

o o ooo

o

oo

o

o

o

o

o

o oo

o

o ooo

o

ooooooo oo

o

o o

o

o oo

o

oo ooo o

o o

o

ooooooooo

oo o

o

o

oo oo

o

o

o

oo

ooo

o o oooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

o oo oo

o

oo

o

o

o

o

o

ooo

o

ooo o

o

oo ooooo o o

o

oo

o

o oo

o

oooooo

o o

o

ooo ooo ooo

o oo

o

o

oo oo

o

o

o

oo

ooo

ooo oo o

o

o

o oo

o

o

o

o oo

o

o

o o

o

o

o oo oo

o

oo

o

o

o

o

o

o o o

o

oo oo

o

oo o ooooo o

o

o o

o

oo o

o

o oo o oo

oo

o

ooo o oooo o

ooo

o

o

oo oo

o

o

o

oo

o oo

o ooooo

o

o

o oo

o

o

o

o oo

o

o

o o

o

o

ooo oo

o

o o

o

o

o

o

o

o o o

o

oo oo

o

o o oooo ooo

o

oo

o

oo o

o

oooo oo

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

ooo

oooooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

oo

o

o

o

o

o

ooo

o

o oo o

o

oooo oooo o

o

o o

o

ooo

o

oooooo

oo

o

ooooooooo

oo o

o

o

oooo

o

o

o

oo

oo o

ooooo o

o

o

o oo

o

o

o

ooo

o

o

o o

o

o

o ooo o

o

oo

o

o

o

o

o

o oo

o

o oo o

o

o ooo ooo oo

o

o o

o

o oo

o

o o ooo ogleason

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

oo o

oooooo

o

o

o oo

o

o

o

o oo

o

o

oo

o

o

ooooo

o

o o

o

o

o

o

o

o oo

o

o ooo

o

o ooo ooo oo

o

o o

o

o o o

o

o ooo oo

0 2 4oooooooooooo

o

ooo

o

oooooo

o

oo

o

o

o

oooooooooo

o

o

ooooo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

ooo

o

oo

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

oooo

o oo ooo ooo

o

ooo

o

oo o oo

o

o

o o

o

o

o

ooo ooo ooo

o

o

o

oo o

oo

o

o

oo

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

o

3 4 5 6o oo

ooooooooo

o

o oo

o

oo oooo

o

oo

o

o

o

o o oooo ooo

o

o

o

ooo

oo

o

o

oo

o

o

o

o

oo

o oo

o

o

o

o

o

oo

o

ooo

o

o o

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

oo o

oooo ooo ooo

o

oo o

o

oo oooo

o

oo

o

o

o

ooo oo oooo

o

o

o

ooo

oo

o

o

o o

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

ooo

o

oo

o

o

oo

o

ooo

o

o

o

o

o

o

o

oo

oo

oo

o

o

o

-1 0 1 2oooooo o oooo o

o

ooo

o

oo ooo

o

o

oo

o

o

o

o ooooo ooo

o

o

o

ooooo

o

o

o o

o

o

o

o

oo

o oo

o

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

ooooooooooooo

o

ooo

o

oooooo

o

oo

o

o

o

ooooooooo

o

o

o

ooooo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

ooo

o

o o

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

-1 1 2 3oooooooooooo

o

o oo

o

ooooo

o

o

oo

o

o

o

ooooo ooooo

o

o

oo ooo

o

o

o o

o

o

o

o

oo

oooo

o

o

o

o

oo

o

oo o

o

o o

o

o

oo

o

ooo

o

o

o

o

o

o

o

oo

oo

oo

o

o

ooo

oooooooooo

o

ooo

o

ooooo

o

o

oo

o

o

o

oooooo ooo

o

o

o

oo o

oo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

0 40 80

020

60100

pgg45

Figure 1.1: S atterplot matrix of the prostate an erdata. The �rst row shows the response against ea h ofthe predi tors in turn. Two of the predi tors, svi andgleason, are ategori al.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 34

Page 27: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3

Shrinkage Factor s

Coeffi

cients

0.0 0.2 0.4 0.6 0.8 1.0

-0.20.0

0.20.4

0.6

••

•• • • • • • • • • • • • • • • • lcavol

• • • • ••

••

•• • • • • • • • • • • • • • • • lweight

• • • • • • • • • • • • • ••

• • • • • • • • • •age

• • • • • • • • • ••

••

•• • • • • • • • • • • lbph

• • • • • • ••

••

••

•• • • • • • • • • • • •svi

• • • • • • • • • • • • • • ••

••

••

••

••

• lcp

• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •

••

•• • • • • • • • • •

••pgg45

Figure 3.9: Pro�les of lasso oeÆ ients, as tuningparameter t is varied. CoeÆ ients are plotted versuss = t=Pp1 j�̂j j. A verti al line is drawn at s = 0:5, thevalue hosen by ross-validation. Compare Figure 3.7on page 7; the lasso pro�les hit zero, while those forridge do not.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 34

Page 28: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Handwritten Digit Recognition

Handwritten Digit Recognition (Textbook page 4)

Goal: identify single digits 0 ∼ 9 based on the images.

Raw Data: images that are scaled segments from five digit ZIP codes.

Each digit has an image of 16× 16 eight-bit gray scale mapsPixel intensities range from 0 (black) to 255 (white)Images are normalized to have approximately the same size andorientation

Input X is a 16× 16 matrix, or a 256 dimensional vector.

Output G ∈ G = {0, 1, ..., 9}.The error rate should be very low to avoid misdirection of mail. Someobjects are assigned to “do not know” category and sorted by hand.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 34

Page 29: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 28 / 34

Page 30: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Example 4: DNA Expression Microarray

DNA = Deoxy-ribo-nucleic acid, a basic material making up humanchromosomes.DNA Microarray Technique: for each sample from a tissue, the expressionlevel (the amount of mRNA) of thousands of genes are measured.

Training data: p = 6, 830 genes (rows), n = 64 samples (columns)(cancer tumors) taken from two classes.

Input X: the level of expression for each geneGoal: discover the relationship between gene and cancer type, or findthe gene signature of each cancer subtype

Challenge: p >> n (data matrix n × p “short and fat”’)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 29 / 34

Page 31: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

How DNA Microarray Technique Works

A breakthrough technology in biology, facilitating the quantitative study ofthousands of genes simultaneously from a single sample of cells

1 The nucleotide sequences for a few thousand genes are printed on aglass slide;

2 A target sample and a reference sample are labeled with red andgreen dyes, and each are hybridized with the DNA on the slide;

3 Through fluroscopy, the log (red/green) intensities of RNAhybridizing each site is measured;

4 The results are a few thousand numbers, typically ranging from -6 to6, measuring the expression level of each gene in the target relative tothe reference sample (positive values indicate higher expression in thetarget versus the reference, and vice versa for negative values).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 34

Page 32: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

ExamplesElements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 1

SID42354SID31984SID301902SIDW128368SID375990SID360097SIDW325120ESTsChr.10SIDW365099SID377133SID381508SIDW308182SID380265SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079SIDW428642TUPLE1TUP1ERLUMENSIDW416621SID43609ESTsSID52979SIDW357197SIDW366311ESTsSMALLNUCSIDW486740ESTsSID297905SID485148SID284853ESTsChr.15SID200394SIDW322806ESTsChr.2SIDW257915SID46536SIDW488221ESTsChr.5SID280066SIDW376394ESTsChr.15SIDW321854WASWiskottHYPOTHETICALSIDW376776SIDW205716SID239012SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31SIDW376928SIDW310141SIDW298203PTPRCSID289414SID127504ESTsChr.3SID305167SID488017SIDW296310ESTsChr.6SID47116MITOCHONDRIAL60ChrSIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812DNAPOLYMERSID377451ESTsChr.1MYBPROTOSID471915ESTsSIDW469884HumanmRNASIDW377402ESTsSID207172RASGTPASESID325394H.sapiensmRNAGNALSID73161SIDW380102SIDW299104

BREA

STRE

NAL

MELA

NOMA

MELA

NOMA

MCF7

D-rep

roCO

LON

COLO

NK5

62B-r

epro

COLO

NNS

CLC

LEUK

EMIA

RENA

LME

LANO

MABR

EAST CN

SCN

SRE

NAL

MCF7

A-rep

roNS

CLC

K562

A-rep

roCO

LON

CNS

NSCL

CNS

CLC

LEUK

EMIA

CNS

OVAR

IANBR

EAST

LEUK

EMIA

MELA

NOMA

MELA

NOMA

OVAR

IANOV

ARIAN

NSCL

CRE

NAL

BREA

STME

LANO

MAOV

ARIAN

OVAR

IANNS

CLC

RENA

LBR

EAST

MELA

NOMA

LEUK

EMIA

COLO

NBR

EAST

LEUK

EMIA

COLO

NCN

SME

LANO

MANS

CLC

PROS

TATE

NSCL

CRE

NAL

RENA

LNS

CLC

RENA

LLE

UKEM

IAOV

ARIAN

PROS

TATE

COLO

NBR

EAST

RENA

LUN

KNOW

N

Figure 1.3: DNA mi roarray data: expression matrix of6830 genes (rows) and 64 samples ( olumns), for the humantumor data. Only a random sample of 100 rows are shown.The display is a heat map, ranging from bright green (nega-tive, under expressed) to bright red (positive, over expressed).Missing values are gray. The rows and olumns are displayedin a randomly hosen order.Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 31 / 34

Page 33: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Microarray Data Analysis

Typical questions of interest:

Which samples are most similar to each other, in terms of theirexpression profiles across genes?

Which genes are most similar to each other, in terms of theirexpression profiles across sample?

Do certain genes show very high (or low) expression for certain cancersamples?

.......

This task could be viewed as

a supervised or an unsupervised problem

a regression or a classification problem

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 32 / 34

Page 34: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

Statistical Problems in Data Mining

Regression (linear, nonlinear, logistic regression, GLM, graphicalmodel, boosting, neural networks)

Classification (Discriminant Analysis; LDA, QDA, SVM, trees,random forest)

Feature Selection (Variable Selection; Sparse Analysis; LASSO,screening)

Dimension Reduction (PCA; ICA; SDR)

Regularization (control model complexity, aim for parsimonious andinterpretable solutions)

Clustering Analysis (k-means, association role)

Other Variations

Semi-supervised Learning

Mixture of the above

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 33 / 34

Page 35: ST790: Introduction to Data Mining and Machine …lu/ST7901/lecture notes...1 Principle and Theory for Data Mining and Machine Learning by Clark, Forkoue, Zhang (2009) 2 Pattern Recognition

Examples

New Challenges to Statisticians

Data Complexity: involves many variables which are often related incomplex (nonlinear) ways.

Feature Selection: many features are available but some areredundant, leading to the feature selection or dimension reductionproblem.

Optimization: many methods involve finding the “best” parametersvalues by solving complex and large (containing many parameters)optimization problems. Therefore, efficient optimization techniquesare required.

Visualization: much harder in the high dimensional space.

This is the so-called curse of dimensionality.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 34 / 34