data mining and knowledge discovery

24
1 From Data to Wisdom i Data 4 The raw material of information i Information 4 Data organized and presented by someone i Knowledge 4 Information read, heard or seen and understood and integrated i Wisdom 4 Distilled knowledge and understanding which can lead to decisions Wisdom Knowledge Information Data The Information Hierarchy

Upload: james-wong

Post on 23-Jan-2017

88 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data mining and knowledge discovery

1

From Data to Wisdomi Data

4 The raw material of information

i Information4 Data organized and

presented by someonei Knowledge

4 Information read, heard or seen and understood and integrated

i Wisdom4 Distilled knowledge and

understanding which can lead to decisions

Wisdom

Knowledge

Information

Data

The Information Hierarchy

Page 2: Data mining and knowledge discovery

Why Data Mining? i The Explosive Growth of Data: from terabytes to

petabytes4 Data collection and data availability

h Automated data collection tools, database systems, Web, computerized society

4 Major sources of abundant datah Business: Web, e-commerce, transactions, stocks, … h Science: Remote sensing, bioinformatics, scientific simulation, … h Society and everyone: news, images, video, documentsh Internet …

2

Page 4: Data mining and knowledge discovery

How much data?i Google: ~20-30 PB a dayi Wayback Machine has ~4 PB + 100-200 TB/monthi Facebook: ~3 PB of user data + 25 TB/dayi eBay: ~7 PB of user data + 50 TB/dayi CERN’s Large Hydron Collider generates 15 PB a yeari In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB

640K ought to be enough for anybody.

Page 5: Data mining and knowledge discovery

Big Data Growing

5

The Untapped Data Gap:Most of the useful data will not be tagged or analyzed – partly due to skill shortage

IDC predicts: From 2005 to 2020, the digital universe will double every 2 years and grow from 130 exabytes to 40,000 exabytesor 5,200 GB / person in 2020.

Page 6: Data mining and knowledge discovery

What Is Data Mining? i We are drowning in data, but starving for knowledge! i “Necessity is the mother of invention”—Data mining—

Automated analysis of massive data sets

6

The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories

i Data Mining: A Definition

4 Non-trivial: obvious knowledge is not useful4 implicit: hidden difficult to observe knowledge4 previously unknown4 potentially useful: actionable; easy to understand

Page 7: Data mining and knowledge discovery

7

Data Mining: Confluence of Multiple Disciplines

Data Mining

MachineLearning Statistics

Applications

Algorithm

PatternRecognition

High-PerformanceComputing

Visualization

Database Technology

Page 8: Data mining and knowledge discovery

8

Data Mining’s Virtuous Cycle

1. Identifying the problem

2. Mining data to transform it into actionable information

3. Acting on the information

4. Measuring the results

Page 9: Data mining and knowledge discovery

9

The Knowledge Discovery Processi Data Mining v. Knowledge Discovery in Databases (KDD)

4 DM and KDD are often used interchangeably4 actually, DM is only part of the KDD process

- The KDD Process

Page 10: Data mining and knowledge discovery

10

Types of Knowledge Discoveryi Two kinds of knowledge discovery: directed and undirected

i Directed Knowledge Discovery4 Purpose: Explain value of some field in terms of all the others (goal-oriented)4 Method: select the target field based on some hypothesis about the data; ask the

algorithm to tell us how to predict or classify new instances4 Examples:

h what products show increased sale when cream cheese is discountedh which banner ad to use on a web page for a given user coming to the site

i Undirected Knowledge Discovery4 Purpose: Find patterns in the data that may be interesting (no target field)4 Method: clustering, affinity grouping4 Examples:

h which products in the catalog often sell togetherh market segmentation (find groups of customers/users with similar

characteristics or behavioral patterns)

Page 11: Data mining and knowledge discovery

From Data Mining to Data Science

11

Page 12: Data mining and knowledge discovery

12

Data Mining: On What Kinds of Data?

i Database-oriented data sets and applications

4 Relational database, data warehouse, transactional database

4 Object-relational databases, Heterogeneous databases and legacy databases

i Advanced data sets and advanced applications

4 Data streams and sensor data

4 Time-series data, temporal data, sequence data (incl. bio-sequences)

4 Structure data, graphs, social networks and information networks

4 Spatial data and spatiotemporal data

4 Multimedia database

4 Text databases

4 The World-Wide Web

Page 13: Data mining and knowledge discovery

13

Data Mining: What Kind of Data?i Structured Databases

4 relational, object-relational, etc.4 can use SQL to perform parts of the processe.g., SELECT count(*) FROM Items WHERE type=video GROUP BY category

Page 14: Data mining and knowledge discovery

14

Data Mining: What Kind of Data?i Flat Files

4 most common data source4 can be text (or HTML) or binary4 may contain transactions, statistical data, measurements, etc.

i Transactional databases4 set of records each with a transaction id, time stamp, and a set of items4 may have an associated “description” file for the items4 typical source of data used in market basket analysis

Page 15: Data mining and knowledge discovery

15

Data Mining: What Kind of Data?i Other Types of Databases

4 legacy databases4 multimedia databases (usually very high-dimensional)4 spatial databases (containing geographical information, such as maps, or

satellite imaging data, etc.)4 Time Series Temporal Data (time dependent information such as stock market

data; usually very dynamic)i World Wide Web

4 basically a large, heterogeneous, distributed database4 need for new or additional tools and techniques

h information retrieval, filtering and extractionh agents to assist in browsing and filteringh Web content, usage, and structure (linkage) mining tools

4 The “social Web”h User generated meta-data, social networks, shared resources, etc.

Page 16: Data mining and knowledge discovery

16

What Can Data Mining Doi Many Data Mining Tasks

4 often inter-related4 often need to try different techniques/algorithms for each task4 each tasks may require different types of knowledge discovery

i What are some of data mining tasks4 Classification4 Prediction4 Clustering4 Affinity Grouping / Association discovery4 Sequence Analysis4 Characterization4 Discrimination

Page 17: Data mining and knowledge discovery

17

Some Applications of Data miningi Business data analysis and decision support

4 Marketing focalizationh Recognizing specific market segments that respond to particular

characteristicsh Return on mailing campaign (target marketing)

4 Customer Profilingh Segmentation of customer for marketing strategies and/or product

offeringsh Customer behavior understandingh Customer retention and loyaltyh Mass customization / personalization

Page 18: Data mining and knowledge discovery

18

Some Applications of Data miningi Business data analysis and decision support (cont.)

4 Market analysis and managementh Provide summary information for decision-makingh Market basket analysis, cross selling, market segmentation.h Resource planning

4 Risk analysis and managementh "What if" analysish Forecastingh Pricing analysis, competitive analysish Time-series analysis (Ex. stock market)

Page 19: Data mining and knowledge discovery

19

Some Applications of Data miningi Fraud detection

4 Detecting telephone fraud:h Telephone call model: destination of the call, duration, time of day or weekh Analyze patterns that deviate from an expected normh British Telecom identified discrete groups of callers with frequent intra-group calls,

especially mobile phones, and broke a multimillion dollar fraud scheme

4 Detection of credit-card fraud4 Detecting suspicious money transactions (money laundering)

i Text mining:4 Message filtering (e-mail, newsgroups, etc.)4 Newspaper articles analysis4 Text and document categorization

i Web Mining4 Mining patterns from the content, usage, and structure of Web resources

Page 20: Data mining and knowledge discovery

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

20

Page 21: Data mining and knowledge discovery

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

21

Applications:• document clustering or

categorization• topic identification / tracking• concept discovery• focused crawling• content-based

personalization• intelligent search tools

Page 22: Data mining and knowledge discovery

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

Applications:• user and customer behavior

modeling• Web site optimization• e-customer relationship

management• Web marketing• targeted advertising• recommender systems

22

Page 23: Data mining and knowledge discovery

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

Applications:• document retrieval and

ranking (e.g., Google)• discovery of “hubs” and

“authorities”• discovery of Web

communities• social network analysis

23

Page 24: Data mining and knowledge discovery

24

The Knowledge Discovery Process

- The KDD Process

i Next: We first focus on understanding the data and data preparation/transformation