part i: introductory materials introduction to data mining dr. nagiza f. samatova department of...

26
Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory

Upload: lucas-carmen

Post on 15-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

Part I: Introductory MaterialsIntroduction to Data Mining

Dr. Nagiza F. SamatovaDepartment of Computer ScienceNorth Carolina State University

andComputer Science and Mathematics Division

Oak Ridge National Laboratory

Page 2: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

2

What is common among all of them?

Page 3: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

Who are the data producers? What data?Application Data

• Application Category: Finance• Producer: Wall Street• Data: stocks, stock prices, stock purchases,

• Application Category: Academia• Producer: NCSU• Data: students admission data (name, DOB,

GRE scores, transcripts, GPA, university/school attended, recommendation letters, personal statement, etc.

3

Page 4: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

Application Categories

• Finance (e.g., banks)• Entertainment (e.g., games)• Science (e.g., weather forecasting)• Medicine (e.g., disease diagnostics)• Cybersecurity (e.g., terrorists, identity theft)• Commerce (e.g., e-Commerce)• …

4

Page 5: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

What questions to ask about the data?DataQuestions

• Academia:NCSU:Admission data1. Is there any correlation between the students’ GRE

scores and their successful completion of a PhD program?

2. What are the groups of students that share common academic performance?

3. Are there any admitted students who would stand out as an anomaly? What type of anomaly is that?

4. If the student majors in Physics, what other major is he/she likely double-major?

5

Page 6: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

Questions by Types?

• Correlation, similarity, comparison,…• Association, causality, co-occurrence,…• Grouping, clustering,…• Categorization, classification,…• Frequency or rarity of occurrence,…• Anomalous or normal objects, events,

behaviors,• Forecasting: future classes, future activity,…• …

6

Page 7: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

What information we need to answer?QuestionsData Objects and Object Features• Academia:NCSU:Admission data

– Objects: Students– Object’s Features=Variables=Attributes=Dimensions

& Types• Name:String (e.g., Name=Neil Shah)• GPA:Numeric (e.g., GPA=5.0)• Recommendation:Text (e.g., … the top 2% in my

career…)• Etc.

7

Page 8: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

How to compare two objects?Data Object Object Pairs

• Academia:NCSU:Admission data– Objects: Students– Based on a single feature:

• Similar GPA• The same first letter in the last name

– Based on a set of features:• Similar academic records (GPA, GRE, etc.)• Similar demographic records

– Can you compute a numerical value for your similarity measure used for comparison? Why or Why not?

8

Page 9: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

How to represent data mathematically?Data Object & its Features Data Model

9

• What mathematical objects have you studied?– Scalar– Points– Vectors– Vector spaces– Matrices– Sets– Graphs, networks (maybe)– Tensors (maybe)– Time series (maybe)– Topological manifolds (maybe)– …

9

Page 10: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

Data object as vector with components…

10

City=(Latitude, Longitude)--2-dimensional object

Vector components:• Features, or• Attributes, or• Dimensions

Raleigh=(35.46, 78.39) Boston=(42.21, 71.5)

Proximity(Raleigh, Boston)=?• Geodesic distance• Euclidean distance• Length of the interstate route

Page 11: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

A set of data objects as vector spaces

11

3-dimensional vector space

Latitude

Longitude

Altitude

Raleigh

Moscow

Mining such data ~ studying vector spaces

Page 12: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

Multi-dimensional vectors…

12

S1=(John Smith, 5.0, 180, 6.0, 200)

S2=(Jane Doe, 3.0, 140, 5.4, 70)

Vector components:• Features, or• Attributes, or• Dimensions

Student=(Name, GPA, Weight, Height, Income in K, …) - mutli-dimensional

Proximity(S1, S2)=?

• How to compare when vector components are of heterogeneous type, or different scales?• How to show the results of the comparison?

Page 13: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

as matrices…

13

Original Documents

t-d term-document matrix Terms=Features=Dimensions

D1: Child Safety at HomeD2: Infant & Toddler First Aid

D3:Your Baby's Health and Safety: From Infant to Toddler

Parsed Documents

D1: Child Safety HomeD2: Infant Toddler

D3:Bab Health Safety Infant Toddler

T1: BabT2: ChildT3: HealthT4: HomeT5: InfantT6: SafetyT7: Toddler

D1: D2: D3:T1: 0 0 1T2: 1 0 0T3: 0 0 1T4: 1 0 0T5: 0 1 1T6: 1 0 1T7: 0 1 1

Example: A collection of text documents on the Web

Mining such data ~ studying matrices

Page 14: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

or as trees

14

t-d term-document matrix

D1: D2: D3:T1: 0 0 1T2: 1 0 0T3: 0 0 1T4: 1 0 0T5: 0 1 1T6: 1 0 1T7: 0 1 1

president government party election political elected national districts held district independence vice minister parties

population area climate city miles province land topography total season 1999 square rate

economy million products 1996 growth copra economic 1997 food scale exports rice fish

D3

D2

document

terms

Is D2 similar to D3?

What if there are 10,000 terms?

Mining such data ~ studying trees

Page 15: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

0r as networks, or graphs w/ nodes & links

15

population area climate city miles province land topography total season 1999 square rate

president government party election political elected national districts held district independence vice minister parties

economy million products 1996 growth copra economic 1997 food scale exports rice fish

Nodes=DocumentsLinks=Document similarity (e.g., if document references another document )

Mining such data ~ studying graphs, or graph mining

Page 16: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

What apps naturally deal w/ graphs?

16Credit: Images are from Google images via search of keywords

Semantic WebSocial Networks World Wide Web

Drug Design,Chemical compounds

Computer networks Sensor networks

Page 17: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

What questions to ask about graph data?Graph Data Graph Mining Questions

• Academia:NCSU:Admission data1. Nodes=students; links=similar

academics/demographics 2. How many distinct academically performing groups of

students admitted to NCSU?3. Which academic group is the largest?4. Given a new student applicant, can we predict which

academic group the student will likely belong to?5. Are groups of student with similar demographics

usually share similar academic performance?6. Over the last decade, has the diversity in

demographics of accepted student groups increased or decreased?

7. …

17

Page 18: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

Recap: Data Mining and Graph Mining

18

DataApplication Questions Data Objects + Features

Mathematical Data Representation (Data Model)

Vectors

Matrices Graphs

Time series Tensors

SetsManifolds

Not one hat fits all

More than one models are needed

Models are related

Page 19: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

19

How much data?

Astrophysics CosmologyClimateBiologyEcology Web

30TB/day20-40TB/simulation1PB/year 850TB

1 TB (TeraByte) – 1012 Bytes1 PB (PetaByte) – 1015 Bytes

My laptop:60 GB (GigaBytes) – 109 Bytes

Page 20: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

20

It is not just the Size

Petabytes DataNoisyNoisy

Non-linear correlations

Non-linear correlations

‘‘+’ a

nd ‘―’ f

eedback

s

+’ and ‘―

’ fee

dbacksHigh-dim

ensional

High-dimensional

– but the Complexity

Page 21: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

21

Data Describes Complex Patterns/Phenomena

How to untangle the riddles of the complexity?How to untangle the riddles of the complexity?

Complex regulation Single gene

~30k genes

50 trans elements control single gene expression

Challenge:How to “connect the dots” to answer important science/business questions?

Analytical tools that find the “dots”

from data significantly reduce data.

Page 22: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

22

Connecting the Dots

Sheer Volume of DataClimateNow: 20-40 Terabytes/year5 years: 5-10 Petabytes/yearFusionNow: 100 Megabytes/15 min5 years: 1000 Megabytes/2 min

Advanced Math+AlgorithmsHuge dimensional spaceCombinatorial challengeComplicated by noisy dataRequires high-performance computers

Providing Predictive Understanding

Produce bioenergy Stabilize CO2

Clean toxic waste

Understanding the DotsFinding the Dots Connecting the Dots

Page 23: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

23

Why Would Data Mining Matter? Enables solving many large-scale data problems

Finding the Dots Connecting the Dots Understanding the Dots

• How to effectively How to effectively produce bioenergy?produce bioenergy?• How to stabilize carbon How to stabilize carbon dioxide?dioxide?• How to convert toxic How to convert toxic into non-toxic waste?into non-toxic waste?......

Science Questions

Page 24: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

24

kB/sGB/$M

MIPS/$M

CPU, Disk, Network Trend

CPU: every 1.2 yearsDisk: every 1.4 yearsWAN: 0.7 years

Doubling:

Src: Richard Mount, SLAC

How to Move and Access the Data? Technology trends are a rate limiting factor

Most of these data will NEVER be touched!Most of these data will NEVER be touched!

Latency and Speed – Storage Performance

105

Ret

riev

al R

ate

Mb

ytes

/s

log10(Object Size Bytes)

MemoryDiskTape

J. W. Toigo, Avoiding a Data Crunch, Scientific American, May 2000

Naturally distributedNaturally distributed but effectively immovablebut effectively immovable

Streaming/DynamicStreaming/Dynamic but not re-computablebut not re-computable

Data doubles every 9 months; CPU ―18 months.

Page 25: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

25

How to Make Sense of Data?Know Your Limits & Be Smart

To see 1 percent of a petabyte at 10 megabytes per second takes:

TerabytesPetabytes

Gigabytes

Megabytes

Scalability of

analysis in

full context

More analysis

Mo

re d

ata

Human

Bandwidth

Overload?

Ultrascale Computations:Must be smart about which probe combinations to see!

Physical Experiments:Must be smart about probe placement!

Not humanly possible to browse a petabyte of data. Analysis must reduce data to quantities of interest.Not humanly possible to browse a petabyte of data. Analysis must reduce data to quantities of interest.

35 8-hour days!

Page 26: Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer

26

What Analysis Algorithms to Use?Even a simple big O analysis can elucidate simplicity.

Algorithmic Complexity:

Calculate means O(n)

Calculate FFT O(n log(n))

Calculate SVD O(r • c)

Clustering algorithms O(n2)

For illustration chart assumes 10-12 sec. (1Tflop/sec) calculation time per data point

3 yrs.

0.1 sec.10-2 sec.

10GB

3 hrs10-3 sec.

10-4 sec.

100MB

1 sec.10-5 sec.

10-6 sec.

1MB

10-4sec.10-8 sec.

10-8 sec.

10KB

10-8

sec.10-10

sec.10-

10sec.100B

n2nlog(n)

n

Algorithm ComplexityData size n

Analysis algorithms fail for a few gigabytes. Analysis algorithms fail for a few gigabytes.

If n=10GB, then what is O(n) or O(n2) on a teraflop computers?

1GB = 109 bytes 1Tflop = 1012 op/sec