unit 3 part i data mining
TRANSCRIPT
Data Mining-PART I
By
M.Dhilsath Fathima
Topics to cover..Introduction Types of Data Data Mining Functionalities Interestingness of Patterns Classification of Data Mining SystemsData Mining Task Primitives Integration of a Data Mining System with a Data
WarehouseIssues Data Preprocessing.
What is Database?
A database is any organized collection of data.
Examples Co-workers
Examples Patient Information
Examples Airline reservation system
DATABASE• Database: Shared collection of logically
related data (and a description of this
data), designed to meet the information
needs of an organization.
• Database management System: A
software system that enables users to
define, create, and maintain the
database and that provides controlled
access to this database.
Who and How to do it ?
• Database Management System (DBMS) does this job.
• Using Software tools: Access, FileMaker, Lotus Notes, Oracle or SQL Server, …….
• It includes tools to add, modify or delete data from the database, ask questions (or queries) about the data stored in the database and produce reports summarizing selected contents.
Why do we need a database?
• Keep records of our:– Clients– Staff– Volunteers
• To keep a record of activities.• Keep sales records• Develop reports• Perform Querying
Data vs. information• What is data?
–Data is unprocessed information.
• What is information?
– Information is data that
have been organized
and communicated in a
logical and meaningful
manner.
Purpose of Database system/Stages of Database System
– Data is converted into information, and information is converted
into knowledge.
– Knowledge; information evaluated and organized so that it can be
used purposefully.
Data(Unprocessed
information)
Data(Unprocessed
information)Information(processed Data)
Information(processed Data)
Knowledge(Evaluated Information
using measures)
Knowledge(Evaluated Information
using measures)
Action(Data Analysis & Future Prediction)
Action(Data Analysis & Future Prediction)
Is to transformIs to transform
12
Data Mining works with Warehouse Data
• Data Warehousing provides the Enterprise with a memory.
• Data Mining provides the Enterprise with intelligence
Data Mining works with Warehouse Data
What is data Mining?• Now a days, huge data sets have become available due
to advances in technology. • As a result, there is an increasing interest in various
scientific communities to explore the use of emerging data mining techniques for the analysis of these large data sets .
• Data mining is the extraction of implicit, previously unknown and potentially useful information,patterns,associations from data .
• Data mining is the Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns .
WHO USES DATAMİNİNG?
•Banking–future prediction
•Amazon.com (Online Stores)–recommendation
•Facebook –prediction how active a user will be after
3 months.
13/03/16 Seval Ünver | CENG 553 15
Datamining is…
13/03/16 Seval Ünver | CENG 553 16
DATAMİNİNG İS NOT…
• Data warehousing • SQL / Ad Hoc Queries /
Reporting• Online Analytical
Processing (OLAP)• Data Visualization
DATAMİNİNG İS …
• Explores Data• Find Patterns• Performs Prediction
13/03/16 Seval Ünver | CENG 553 17
KDD Process
• Knowledge discovery in databases (KDD) is a multi step process of finding useful information and patterns in data
• Data Mining is the use of algorithms to extract information and patterns derived by the KDD process.
• Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD.
Steps of KDD Process
STEPS OF KDD PROCESS
1. Selection-Data Extraction -Obtaining Data from heterogeneous data sources -Databases, Data warehouses, World wide web or other information repositories.
2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data to be
cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected.
3. Transformation- Data Integration- Combines data from multiple sources Combines data from multiple sources
into a coherent store -Data can be encoded in common into a coherent store -Data can be encoded in common formats, normalized, reduced.formats, normalized, reduced.
Steps of KDD Process
4. D4. Data mining – Apply algorithms to transformed data an extract patterns.
5. Pattern Interpretation/evaluation Pattern Evaluation- Evaluate the interestingness of resulting
patterns or apply interestingness measures to filter out discovered patterns.
Knowledge presentation- present the mined knowledge-
visualization techniques can be used.
Types of Data /What kind of Data can be mined
• Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data.
• Relational Databases• Data Warehouse• Transaction Databases
• Advanced DB systems and information repositories– Spatial databases
– Time-series data
– multimedia databases
– WWW
Relational Databases– A relational database consists
of a set of tables containing either values of entity attributes, or values of attributes from entity relationships.
– Tables have columns and rows, where columns represent attributes and rows represent tuples.
– A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key.
Data Warehouse
• A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema. A data warehouse gives the option to analyze data from different sources under the same roof.
Transaction Databases• A transaction database is a set of
records representing transactions, each with a time stamp, an identifier and a set of items. Associated with the transaction files could also be descriptive data for the items.
• Transactions are usually stored in flat files or stored in two normalized transaction tables, one for the transactions and one for the transaction items.
• Applications: Airline reservation, Railway reservation, Log records etc.
MULTIMEDIA DATABASE
• Multimedia databases include video, images, audio, Sound clips, and text data. They can be stored on extended object-relational or object-oriented databases, or simply on a file system. • Ex: Digital Music Player, Social Media,
Electronic publishing.
Spatial Databases• A spatial database is a
database that is enhanced to store and access spatial data that defines a geometric space.
• These data are often associated with geographic locations and features, or constructed features like cities. Data on spatial databases are stored as coordinates, points, lines, polygons and topology.
• Ex: store geographical information like maps, and global or regional positioning.
Time Series Database• A Time-Series
Database is a database that contains data for each point in time.
• Examples: Weather Data, stock market data , Browser logged activities, ocean tides.
Time Series Database-Example
World Wide Web
• The World Wide Web is the most heterogeneous and dynamic repository available.
• Data in the World Wide Web is organized in inter-connected documents. These documents can be text, audio, video, raw data, and even applications.
Typical Architecture of Data Mining System
Integration of a Data Mining System with a Database/Data Warehouse System
The list of Integration Schemes is as follows −• No Coupling − In this scheme, the data mining system does not
utilize any of the database or data warehouse functions. It fetches the data directly from a particular source and processes that data using some data mining algorithms. The data mining result is stored in another file.(Ex :Collect data directly from Transactional database)
• Loose Coupling/Semi−tight Coupling - In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the data respiratory managed by these systems and performs data mining on that data or fetch directly from particular sources. (Ex: Taken from transactional DB+ Database/DWH)
• Tight coupling − In this scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem is treated as one functional component of an information system.
Integrated architecture of a Data Mining with DWH/ AN OLAM SYSTEM ARCHITECTURE
Data Mining Task Primitives• We can specify a data mining task in the form of a data mining
query.• This query is input to the system.• A data mining query is defined in terms of data mining task
primitives.• Note − These primitives allow us to communicate in an interactive
manner with the data mining system. Here is the list of Data Mining Task Primitives −
1. Kind of knowledge to be mined.
2. Set of task relevant data to be mined.
3. Representation for visualizing the discovered patterns.
4. Background knowledge to be used in discovery process.
5. Interestingness measures and thresholds for pattern evaluation.
Data Mining Task Primitives-Example of Data mining query
• use database AllElectronics_db use state_ location_hierarchy for B.address mine characteristics as customerPurchasing analyze count% in relevance to C.age,I.type,I.place_made from customer C, item I, purchase P, items_sold S, branch B where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100 with noise threshold = 5% display as table
Data Mining Task Primitives-cont..1. Kind of knowledge to be mined– It refers to the kind of functions to be performed.
These functions are −• Characterization• Association and Correlation Analysis• Classification• Prediction• Clustering• Outlier Analysis
1. Set of task relevant data to be mined– This is the portion of database in which the user is interested.
This portion includes the following −• Database Attributes• Data Warehouse dimensions of interest
Data Mining Task Primitives-cont..3. Representation for visualizing the discovered
patterns– This refers to the form in which discovered patterns
are to be displayed. These representations may include the following −• Rules• Tables• Charts• Graphs• Decision Trees• Cubes
Data Mining Task Primitives-cont..4. Background knowledge
– The background knowledge allows data to be mined at multiple levels of abstraction. For example, the Concept hierarchies are one of the background knowledge that allows data to be mined at multiple levels of abstraction.
5.Interestingness measures and thresholds for pattern evaluation– This is used to evaluate the patterns that are discovered by the process of
knowledge discovery. There are different interesting measures for different kind of knowledge.
Classification of Data mining System
Classification of Data mining System(Cont..)
Data to be mined
Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levelsTechniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.