unit 3 part i data mining

40
Data Mining-PART I By M.Dhilsath Fathima

Upload: dhilsath-fathima

Post on 22-Jan-2018

812 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Unit 3 part i Data mining

Data Mining-PART I

By

M.Dhilsath Fathima

Page 2: Unit 3 part i Data mining

Topics to cover..Introduction Types of Data Data Mining Functionalities Interestingness of Patterns Classification of Data Mining SystemsData Mining Task Primitives Integration of a Data Mining System with a Data

WarehouseIssues Data Preprocessing.

Page 3: Unit 3 part i Data mining

What is Database?

A database is any organized collection of data.

Page 4: Unit 3 part i Data mining

Examples Co-workers

Page 5: Unit 3 part i Data mining

Examples Patient Information

Page 6: Unit 3 part i Data mining

Examples Airline reservation system

Page 7: Unit 3 part i Data mining

DATABASE• Database: Shared collection of logically

related data (and a description of this

data), designed to meet the information

needs of an organization.

• Database management System: A

software system that enables users to

define, create, and maintain the

database and that provides controlled

access to this database.

Page 8: Unit 3 part i Data mining

Who and How to do it ?

• Database Management System (DBMS) does this job.

• Using Software tools: Access, FileMaker, Lotus Notes, Oracle or SQL Server, …….

• It includes tools to add, modify or delete data from the database, ask questions (or queries) about the data stored in the database and produce reports summarizing selected contents.

Page 9: Unit 3 part i Data mining

Why do we need a database?

• Keep records of our:– Clients– Staff– Volunteers

• To keep a record of activities.• Keep sales records• Develop reports• Perform Querying

Page 10: Unit 3 part i Data mining

Data vs. information• What is data?

–Data is unprocessed information.

• What is information?

– Information is data that

have been organized

and communicated in a

logical and meaningful

manner.

Page 11: Unit 3 part i Data mining

Purpose of Database system/Stages of Database System

– Data is converted into information, and information is converted

into knowledge.

– Knowledge; information evaluated and organized so that it can be

used purposefully.

Data(Unprocessed

information)

Data(Unprocessed

information)Information(processed Data)

Information(processed Data)

Knowledge(Evaluated Information

using measures)

Knowledge(Evaluated Information

using measures)

Action(Data Analysis & Future Prediction)

Action(Data Analysis & Future Prediction)

Is to transformIs to transform

Page 12: Unit 3 part i Data mining

12

Data Mining works with Warehouse Data

• Data Warehousing provides the Enterprise with a memory.

• Data Mining provides the Enterprise with intelligence

Page 13: Unit 3 part i Data mining

Data Mining works with Warehouse Data

Page 14: Unit 3 part i Data mining

What is data Mining?• Now a days, huge data sets have become available due

to advances in technology. • As a result, there is an increasing interest in various

scientific communities to explore the use of emerging data mining techniques for the analysis of these large data sets .

• Data mining is the extraction of implicit, previously unknown and potentially useful information,patterns,associations from data .

• Data mining is the Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns .

Page 15: Unit 3 part i Data mining

WHO USES DATAMİNİNG?

•Banking–future prediction

•Amazon.com (Online Stores)–recommendation

•Facebook –prediction how active a user will be after

3 months.

13/03/16 Seval Ünver | CENG 553 15

Page 16: Unit 3 part i Data mining

Datamining is…

13/03/16 Seval Ünver | CENG 553 16

Page 17: Unit 3 part i Data mining

DATAMİNİNG İS NOT…

• Data warehousing • SQL / Ad Hoc Queries /

Reporting• Online Analytical

Processing (OLAP)• Data Visualization

DATAMİNİNG İS …

• Explores Data• Find Patterns• Performs Prediction

13/03/16 Seval Ünver | CENG 553 17

Page 18: Unit 3 part i Data mining

KDD Process

• Knowledge discovery in databases (KDD) is a multi step process of finding useful information and patterns in data

• Data Mining is the use of algorithms to extract information and patterns derived by the KDD process.

• Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD.

Page 19: Unit 3 part i Data mining

Steps of KDD Process

Page 20: Unit 3 part i Data mining

STEPS OF KDD PROCESS

1. Selection-Data Extraction -Obtaining Data from heterogeneous data sources -Databases, Data warehouses, World wide web or other information repositories.

2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data to be

cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected.

3. Transformation- Data Integration- Combines data from multiple sources Combines data from multiple sources

into a coherent store -Data can be encoded in common into a coherent store -Data can be encoded in common formats, normalized, reduced.formats, normalized, reduced.

Page 21: Unit 3 part i Data mining

Steps of KDD Process

4. D4. Data mining – Apply algorithms to transformed data an extract patterns.

5. Pattern Interpretation/evaluation Pattern Evaluation- Evaluate the interestingness of resulting

patterns or apply interestingness measures to filter out discovered patterns.

Knowledge presentation- present the mined knowledge-

visualization techniques can be used.

Page 22: Unit 3 part i Data mining

Types of Data /What kind of Data can be mined

• Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data.

• Relational Databases• Data Warehouse• Transaction Databases

• Advanced DB systems and information repositories– Spatial databases

– Time-series data

– multimedia databases

– WWW

Page 23: Unit 3 part i Data mining

Relational Databases– A relational database consists

of a set of tables containing either values of entity attributes, or values of attributes from entity relationships.

– Tables have columns and rows, where columns represent attributes and rows represent tuples.

– A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key.

Page 24: Unit 3 part i Data mining

Data Warehouse

• A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema. A data warehouse gives the option to analyze data from different sources under the same roof.

Page 25: Unit 3 part i Data mining

Transaction Databases• A transaction database is a set of

records representing transactions, each with a time stamp, an identifier and a set of items. Associated with the transaction files could also be descriptive data for the items.

• Transactions are usually stored in flat files or stored in two normalized transaction tables, one for the transactions and one for the transaction items.

• Applications: Airline reservation, Railway reservation, Log records etc.

Page 26: Unit 3 part i Data mining

MULTIMEDIA DATABASE

• Multimedia databases include video, images, audio, Sound clips, and text data. They can be stored on extended object-relational or object-oriented databases, or simply on a file system. • Ex: Digital Music Player, Social Media,

Electronic publishing.

Page 27: Unit 3 part i Data mining

Spatial Databases• A spatial database is a

database that is enhanced to store and access spatial data that defines a geometric space.

• These data are often associated with geographic locations and features, or constructed features like cities. Data on spatial databases are stored as coordinates, points, lines, polygons and topology.

• Ex: store geographical information like maps, and global or regional positioning.

Page 28: Unit 3 part i Data mining

Time Series Database• A Time-Series

Database is a database that contains data for each point in time.

• Examples: Weather Data, stock market data , Browser logged activities, ocean tides.

Page 29: Unit 3 part i Data mining

Time Series Database-Example

Page 30: Unit 3 part i Data mining

World Wide Web

• The World Wide Web is the most heterogeneous and dynamic repository available.

• Data in the World Wide Web is organized in inter-connected documents. These documents can be text, audio, video, raw data, and even applications.

Page 31: Unit 3 part i Data mining

Typical Architecture of Data Mining System

Page 32: Unit 3 part i Data mining

Integration of a Data Mining System with a Database/Data Warehouse System

The list of Integration Schemes is as follows −• No Coupling − In this scheme, the data mining system does not

utilize any of the database or data warehouse functions. It fetches the data directly from a particular source and processes that data using some data mining algorithms. The data mining result is stored in another file.(Ex :Collect data directly from Transactional database)

• Loose Coupling/Semi−tight Coupling - In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the data respiratory managed by these systems and performs data mining on that data or fetch directly from particular sources. (Ex: Taken from transactional DB+ Database/DWH)

• Tight coupling − In this scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem is treated as one functional component of an information system.

Page 33: Unit 3 part i Data mining

Integrated architecture of a Data Mining with DWH/ AN OLAM SYSTEM ARCHITECTURE

Page 34: Unit 3 part i Data mining

Data Mining Task Primitives• We can specify a data mining task in the form of a data mining

query.• This query is input to the system.• A data mining query is defined in terms of data mining task

primitives.• Note − These primitives allow us to communicate in an interactive

manner with the data mining system. Here is the list of Data Mining Task Primitives −

1. Kind of knowledge to be mined.

2. Set of task relevant data to be mined.

3. Representation for visualizing the discovered patterns.

4. Background knowledge to be used in discovery process.

5. Interestingness measures and thresholds for pattern evaluation.

Page 35: Unit 3 part i Data mining

Data Mining Task Primitives-Example of Data mining query

• use database AllElectronics_db use state_ location_hierarchy for B.address mine characteristics as customerPurchasing analyze count% in relevance to C.age,I.type,I.place_made from customer C, item I, purchase P, items_sold S, branch B where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100 with noise threshold = 5% display as table

Page 36: Unit 3 part i Data mining

Data Mining Task Primitives-cont..1. Kind of knowledge to be mined– It refers to the kind of functions to be performed.

These functions are −• Characterization• Association and Correlation Analysis• Classification• Prediction• Clustering• Outlier Analysis

1. Set of task relevant data to be mined– This is the portion of database in which the user is interested.

This portion includes the following −• Database Attributes• Data Warehouse dimensions of interest

Page 37: Unit 3 part i Data mining

Data Mining Task Primitives-cont..3. Representation for visualizing the discovered

patterns– This refers to the form in which discovered patterns

are to be displayed. These representations may include the following −• Rules• Tables• Charts• Graphs• Decision Trees• Cubes

Page 38: Unit 3 part i Data mining

Data Mining Task Primitives-cont..4. Background knowledge

– The background knowledge allows data to be mined at multiple levels of abstraction. For example, the Concept hierarchies are one of the background knowledge that allows data to be mined at multiple levels of abstraction.

5.Interestingness measures and thresholds for pattern evaluation– This is used to evaluate the patterns that are discovered by the process of

knowledge discovery. There are different interesting measures for different kind of knowledge.

Page 39: Unit 3 part i Data mining

Classification of Data mining System

Page 40: Unit 3 part i Data mining

Classification of Data mining System(Cont..)

Data to be mined

Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

Knowledge to be mined

Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

Multiple/integrated functions and mining at multiple levelsTechniques utilized

Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.