data mining for scientific applications

60
© Copyright 2006, Natasha Balac 1 Introduction to Data Mining Natasha Balac, Ph.D.

Upload: tommy96

Post on 13-Jan-2015

1.980 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 1

Introduction to Data Mining

Natasha Balac, Ph.D.

Page 2: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 2

Outline

Motivation: Why Data Mining?

What is Data Mining?

History of Data Mining

Data Mining Functionality and Terminology

Data Mining Applications

Are all the Patterns Interesting?

Issues in Data Mining

Page 3: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 3

Necessity is the Mother of Invention

Data explosion

Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in

databases, data warehouses and other information

repositories

We are drowning in data, but starving for

knowledge!

Page 4: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 4

Necessity is the Mother of Invention

We are drowning in data, but starving for

knowledge!

Solution

Data Mining

Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases

Page 5: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 5

Why DATA MINING?

Huge amounts of data Electronic records of our decisions

Choices in the supermarket Financial records Our comings and goings

We swipe our way through the world – every swipe is a record in a database

Data rich – but information poor Lying hidden in all this data is information!

Page 6: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 6

Data vs. Information

Society produces massive amounts of data business, science, medicine, economics, sports, …

Potentially valuable resource Raw data is useless

need techniques to automatically extract information Data: recorded facts Information: patterns underlying the data

Page 7: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 7

What is DATA MINING?

Extracting or “mining” knowledge from large amounts of data

Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data

Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data

Page 8: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 8

Data mining: Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) information or patterns from data in large databases

What Is Data Mining?

Page 9: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 9

Data Mining is NOT

Data Warehousing (Deductive) query processing

SQL/ Reporting Software Agents Expert Systems Online Analytical Processing (OLAP) Statistical Analysis Tool Data visualization

Page 10: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 10

Data Mining

Programs that detect patterns and rules in the data

Strong patterns can be used to make non-trivial predictions on new data

Page 11: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 11

Data Mining Challenges

Problem 1: most patterns are not interesting

Problem 2: patterns may be inexact or completely spurious when noisy data present

Page 12: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 12

Machine Learning Techniques

Technical basis for data mining: algorithms for

acquiring structural descriptions from examples

Methods originate from artificial intelligence,

statistics, and research on databases

Page 13: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 13

Machine Learning Techniques

Structural descriptions represent patterns explicitly can be used to predict outcome in new situation understand and explain how prediction is derived

(maybe even more important)

Page 14: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 14

Multidisciplinary Field

Data Mining

Database Technology

Statistics

OtherDisciplines

Artificial Intelligence

MachineLearning Visualization

Page 15: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 15

Multidisciplinary Field

Database technology Artificial Intelligence

Machine Learning including Neural Networks Statistics Pattern recognition Knowledge-based systems/acquisition High-performance computing Data visualization

Page 16: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 16

History of Data Mining

Page 17: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 17

History

Emerged late 1980s Flourished –1990s Roots traced back along three family lines

Classical Statistics Artificial Intelligence Machine Learning

Page 18: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 18

Statistics

Foundation of most DM technologies Regression analysis, standard

distribution/deviation/variance, cluster analysis, confidence intervals

Building blocks Significant role in today’s data mining –

but alone is not powerful enough

Page 19: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 19

Artificial Intelligence

Heuristics vs. Statistics Human-thought-like processing Requires vast computer processing power Supercomputers

Page 20: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 20

Machine Learning

Union of statistics and AI Blends AI heuristics with advanced statistical

analysis Machine Learning – let computer programs

learn about data they study - make different decisions based on the quality of studied data

using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms

Page 21: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 21

Data Mining

Adoption of the Machine learning techniques to the real world problems

Union: Statistics, AI, Machine learning Used to find previously hidden trends or

patterns Finding increasing acceptance in science and

business areas which need to analyze large amount of data to discover trends which could not be found otherwise

Page 22: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 22

Terminology

Gold Mining Knowledge mining from databases Knowledge extraction Data/pattern analysis Knowledge Discovery Databases or KDD Information harvesting Business intelligence

Page 23: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 23

KDD Process

Database

Selection Transformation

Data Preparation

Data Data MiningMining

Training Data

Evaluation, Verification

Model, Patterns

Page 24: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 24

LEARNING ALGORITHMS

Fundamental idea:

learn rules/patterns/relationships automatically from the data

Page 25: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 25

Data Mining Tasks

Exploratory Data Analysis Predictive Modeling: Classification and Regression Descriptive Modeling

Cluster analysis/segmentation Discovering Patterns and Rules

Association/Dependency rules Sequential patterns Temporal sequences

Deviation detection

Page 26: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 26

Data Mining Tasks

Concept/Class description: Characterization and discrimination Generalize, summarize, and contrast data

characteristics, e.g., dry vs. wet regions

Association (correlation and causality) Multi-dimensional or single-dimensional association

age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”)

Page 27: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 27

Data Mining Tasks

Classification and Prediction

Finding models (functions) that describe and distinguish classes or concepts for future prediction

Example: classify countries based on climate, or classify cars based on gas mileage

Presentation: If-THEN rules, decision-tree, classification rule,

neural network Prediction: Predict some unknown or missing

numerical values

Page 28: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 28

Cluster analysis Class label is unknown: Group data to form

new classes, Example: cluster houses to find distribution

patterns

Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Data Mining Tasks

Page 29: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 29

Data Mining Tasks

Outlier analysis Outlier: a data object that does not comply with the

general behavior of the data

Mostly considered as noise or exception, but is

quite useful in fraud detection, rare events analysis

Trend and evolution analysis Trend and deviation: regression analysis

Sequential pattern mining, periodicity analysis

Page 30: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 30

Data Mining: Classification Schemes

General functionality Descriptive data mining Vs. Predictive data mining

Different views - different classifications Kinds of databases to be mined

Kinds of knowledge to be discovered

Kinds of techniques employed

Kinds of applications

Page 31: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 31

A Multi-Dimensional View of Data Mining Classification

Databases to be mined Relational, transactional, object-oriented, object-

relational, active, spatial, time-series, text, multi-media,WWW, etc.

Knowledge to be mined Characterization, discrimination, association,

classification, clustering, trend, deviation and outlier analysis, etc.

Multiple/integrated functions Mining at multiple levels of abstractions

Page 32: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 32

A Multi-Dimensional View of Data Mining Classification

Techniques utilized Decision/Regression trees, clustering, neural

networks, etc. Applications adapted

Retail, telecom, banking, DNA mining, stock market analysis, Web mining

Page 33: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 33

Data Mining Applications

Science: Chemistry, Physics, Medicine Biochemical analysis Remote sensors on a satellite Telescopes – star galaxy classification Medical Image analysis

Page 34: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 34

Data Mining Applications

Bioscience Sequence-based analysis Protein structure and function prediction Protein family classification Microarray gene expression

Page 35: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 35

Pharmaceutical companies, Insurance and Health care, Medicine Drug development Identify successful medical therapies Claims analysis, fraudulent behavior Medical diagnostic tools Predict office visits

Data Mining Applications

Page 36: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 36

Financial Industry, Banks, Businesses, E-commerce Stock and investment analysis Identify loyal customers vs. risky customer Predict customer spending Risk management Sales forecasting

Data Mining Applications

Page 37: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 37

Retail and Marketing Customer buying patterns/demographic

characteristics Mailing campaigns Market basket analysis Trend analysis

Data Mining Applications

Page 38: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 38

Database analysis and decision support Market analysis and management

target marketing, customer relation management, market

basket analysis, cross selling, market segmentation

Risk analysis and management Forecasting, customer retention, improved underwriting,

quality control, competitive analysis

Fraud detection and management

Data Mining Applications

Page 39: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 39

Sports and Entertainment IBM Advanced Scout analyzed NBA game

statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

Astronomy JPL and the Palomar Observatory discovered

22 quasars with the help of data mining

Data Mining Applications

Page 40: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 40

DATA MINING EXAMPLES

Grocery store NBA Banking and Credit Card scoring

Fraud detection Personalization & Customer Profiling Campaign Management and Database

Marketing

Page 41: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 41

Data mining at work: Case study 1

Page 42: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 42

Processing Loan Applications

Given: questionnaire with financial and personal information

Problem: should money be lend? Borderline cases referred to loan officers But: 50% of accepted borderline cases defaulted! Solution:

reject all borderline cases? Borderline cases are most active customers!

Page 43: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 43

Enter Machine Learning

Given: 1000 training examples of borderline cases

20 attributes: age, years with current employer,years at current address,

years with the bank, years at current job, other credit cards Learned rules predicted 2/3 of borderline cases

correctly! Rules could be used to explain decisions to customers

Page 44: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 44

Case study 2:Screening images

Given: radar satellite images of coastal waters

Problem: detecting oil slicks in those images

Oil slicks = dark regions with changing size and shape

Look-alike dark regions can be caused by weather conditions (e.g. high wind)

Expensive process requiring highly trained personnel

Page 45: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 45

Dark regions extracted from normalized image Attributes:

size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background

Constraints: Scarcity of training examples (oil slicks are rare!) Unbalanced data: most dark regions aren’t oil

slicks Regions from same image form a batch Requirement is adjustable false-alarm rate

Enter Machine Learning

Page 46: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 46

Data Mining Challenges

Computationally expensive to investigate all possibilities

Dealing with noise/missing information and errors in data

Choosing appropriate attributes/input representation

Finding the minimal attribute space Finding adequate evaluation function(s) Extracting meaningful information Not overfitting

Page 47: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 47

Are All the “Discovered” Patterns Interesting?

Interestingness measures: A pattern is

interesting if it is easily understood by humans,

valid on new or test data with some degree of

certainty, potentially useful, novel, or validates

some hypothesis that a user seeks to confirm

Page 48: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 48

Are All the “Discovered” Patterns Interesting?

Objective vs. subjective measures: Objective: based on statistics and structures of

patterns support and confidence

Subjective: based on user’s belief in the data unexpectedness, novelty, action ability, etc.

Page 49: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 49

Can We Find All and Only Interesting Patterns?

Completeness - Find all the interesting

patterns Can a data mining system find all the interesting

patterns?

Association vs. classification vs. clustering

Page 50: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 50

Can We Find All and Only Interesting Patterns?

Optimization - Search for only interesting patterns

Can a data mining system find only the interesting

patterns?

Approaches First general all the patterns and then filter out the

uninteresting ones

Mining query optimization

Page 51: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 51

Major Issues in Data Mining

Mining methodology and user interaction Mining different kinds of knowledge in databases Incorporation of background knowledge Handling noise and incomplete data Pattern evaluation: the interestingness problem Expression and visualization of data mining results

Page 52: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 52

Performance and scalability Efficiency of data mining algorithms Parallel, distributed and incremental mining

methods Issues relating to the diversity of data types

Handling relational and complex types of data Mining information from diverse databases

Major Issues in Data Mining

Page 53: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 53

Issues related to applications and social impacts Application of discovered knowledge

Domain-specific data mining tools Intelligent query answering Expert systems Process control and decision making

A knowledge fusion problem Protection of data security, integrity, and privacy

Major Issues in Data Mining

Page 54: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 54

Summary

Data mining: discovering interesting patterns from large amounts of data

A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

Page 55: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 55

Summary

Mining can be performed in a variety of information repositories

Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.

Classification of data mining systems Major issues in data mining

Page 56: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 56

Exercise

Practical Data mining example

Page 57: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 57

Kinds of Data Mining

Decision Tree Learning Clustering Neural Networks Association Rules Support Vector Machines Genetic Algorithms Nearest Neighbor Method

Page 58: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 58

Decision Tree ExampleGrandparents

A lot A little

Page 59: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 59

DECISION TREE FOR THE CONCEPT

“Play Tennis”Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong NoMitchell, 1997

Page 60: Data Mining for Scientific Applications

© Copyright 2006, Natasha Balac 60

DECISION TREE FOR THE CONCEPT

“Play Tennis”

[Mitchell,1997]