comp53311 knowledge discovery in databases overview prepared by raymond wong presented by raymond...

29
COMP5331 1 COMP5331 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

Upload: brendan-oliver

Post on 17-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 1

COMP5331

Knowledge Discovery in Databases

Overview

Prepared by Raymond WongPresented by Raymond Wong

raywong@cse

Page 2: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 2

Course Details

Reference books/materials: Papers Data Mining: Concepts and Techniques.

Jiawei Han and Micheline Kamber. Morgan Kaufmann Publishers (3rd edition)

Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar Boston : Pearson Addison Wesley (2006)

Page 3: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 3

Area

DB or AI This course can count towards one

of the areas ONLY and cannot be double counted towards the required credits

Page 4: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 4

Course Details

Grading Scheme: Assignment 30% Project 30% Final Exam 40%

Page 5: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 5

Assignment

If the students can answer the selected questions in class correctly, for each corrected answer,

I will give him/her a coupon This coupon can be used to waive one

question in an assignment which means that s/he can get full marks

for this question without answering this question

Page 6: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 6

Assignment Guideline

For each assignment, each student can waive at most one question only.

s/he can waive any question he wants and obtain full marks for this question (no matter whether s/he answer this question or not)

s/he may also answer this question. But, we will also mark it but will give full marks to this question.

When the student submits the assignment, please staple the coupon to the submitted assignment please write down the question no. s/he wants to

waive on the coupon

Page 7: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 7

Project

Each project is completed by a group.

The number of students in a group depends on the class size.

The duration of each presentation depends on the class size.

It will be announced soon.

Page 8: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 8

Project

Project Type (One of the following) Survey

Implementation-oriented Project

Research-oriented Project

Your group only needs to read about 2~5 papers

Your group only needs to read about 1~2 papers

You can read some papers and conduct research

Page 9: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 9

Project

Project Type (One of the following) Survey

Implementation-oriented Project

Research-oriented Project

1. Proposal2. Presentation3. Final report

1. Proposal2. Presentation3. Final report4. Coding

1. Proposal2. Presentation3. Final report (containing your

proposed methodology)4. Coding (if any)

Full Score = 80%

Full Score = 90%

Full Score = 100%

Page 10: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 10

Project

Project Topic Some pre-selected topics/papers Your own choice

For fairness, please do not choose the topic which is closely related to your own research

Page 11: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 11

Exam

You are allowed to bring a calculator with you.

Please remember to prepare a calculator for the exam

Page 12: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 12

Major Topics

1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making

Page 13: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 13

1. Association

Customer

Apple Orange Milk

Raymond Apple Orange

Ada Orange Milk

Grace Apple Orange

… … … …Items/Itemsets Frequency

Apple 2

Orange 3

Milk 1

{Apple, Orange} 2

{Orange, Milk} 1

We are interested in the items/itemsets with frequency >= 2

Frequent Pattern(or Frequent Item)

Frequent Pattern(or Frequent Item)

Frequent Pattern(or Frequent Itemset)

Page 14: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 14

1. Association

Customer

Apple Orange Milk

Raymond Apple Orange

Ada Orange Milk

Grace Apple Orange

… … … …Items/Itemsets Frequency

Apple 2

Orange 3

Milk 1

{Apple, Orange} 2

{Orange, Milk} 1

We are interested in the items/itemsets with frequency >= 2

Association Rule:1. Apple Orange( customers who buy apple will probably buy orange.)

2. Orange Apple( customer who buy orange will probably buy apple.)

100%

2

2

67%

3

2

Problem: to find all frequent patterns and association rules

Page 15: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 15

Major Topics

1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making

Page 16: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 16

2. Clustering

Computer

History

Raymond

100 40

Louis 90 45

Wyman 20 95

… … …Computer

History

Cluster 1(e.g. High Score in Computer and Low Score in History)

Cluster 2(e.g. High Score in Historyand Low Score in Computer)

Problem: to find all clusters

Page 17: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 17

Major Topics

1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making

Page 18: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 18

3. Classification

root

child=yes child=no

Income=high Income=low

100% Yes0% No

100% Yes0% No

0% Yes100% No

Decision tree

Race Income

Child Insurance

white

high no ?

Suppose there is a person.

Page 19: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 19

Major Topics

1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making

Page 20: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 20

4. Warehouse

Databases Users

Databases UsersData Warehouse

Need to wait for a long time (e.g., 1 day to 1

week)

Pre-computed results

Query

Page 21: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 21

Major Topics

1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making

Page 22: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 22

5. Data Mining over Static Data

1. Association2. Clustering3. Classification

StaticData

Output(Data Mining Results)

Page 23: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 23

5. Data Mining over Data Streams

1. Association2. Clustering3. Classification

Output(Data Mining Results)

Unbounded Data

Real-time Processing

Page 24: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 24

Major Topics

1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making

Page 25: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 25

6. Web Databases

Raymond Wong

Page 26: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 26

How to rank the webpages?

Page 27: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 27

Major Topics

1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making

Page 28: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 28

7. Multi-criteria Decision Making

Hotel Price Distance to beach (km)

a 1000 4

b 2400 5

c 3000 1

3 hotels

Suppose we want to look for a hotel which is close to a beach.

We have two attributes. Which hotel should we select?

Suppose we compare hotel a and hotel b

We know that hotel a is “better”than hotel bbecause 1. Price of hotel a is smaller2. Distance of hotel a is smaller

Page 29: COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 29

7. Multi-criteria Decision Making

Hotel Price Distance to beach (km)

a 1000 4

b 2400 5

c 3000 1

3 hotels

Suppose we want to look for a hotel which is close to a beach.

We have two attributes. Which hotel should we select?

Suppose we compare hotel a and hotel c

We cannot determine hotel a is “better”than hotel c (wrt two attributes).We cannot determine hotel c is “better”than hotel a (wrt two attributes)..This is because 1. Price of hotel a is smaller2. Distance of hotel c is smaller