datawarehousedataminingnotes

41

Upload: tommy96

Post on 19-Dec-2014

214 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: DatawarehouseDataMiningNotes
Page 2: DatawarehouseDataMiningNotes

Database and data warehousing◦ Chapter 4 pages 167-177

Data mining and OLAP◦ Chapter 8 pages 321-325

Page 3: DatawarehouseDataMiningNotes

In managing information, physical deals with the structure of information as it resides on various storage media.

Logical deals with how knowledge workers view their information needs, and includes such terms as:◦CHARACTER - our smallest unit of

information.◦FIELD - group of related characters.◦RECORD - group of related fields.◦FILE - group of related records.◦DATABASE - group of logically

associated files.◦DATA WAREHOUSE - information from

many databases.

Page 4: DatawarehouseDataMiningNotes

DATA DICTIONARY - contains the logical structure of information in a database.◦Definitions of all fields, records, and tables◦Relationships between tables◦Who is responsible for maintaining data in

the database◦Descriptions of who is authorized to access

different parts of the database Data dictionary contains meta data (data

about the data)

Page 5: DatawarehouseDataMiningNotes

Sample Data Dictionary Report

Page 6: DatawarehouseDataMiningNotes

Definition- a database that stores current and historical data designed to support business analysis activities and decision-making tasks of managers; typically a relational database model is used. The data warehouse uses special software (tools) to assist managers extract information.

Benefits improved access improved information isolation from operational systems tools permit advanced data analysis

Users and data marts

Page 7: DatawarehouseDataMiningNotes

Extraction phase – create files on the computer that will store the data warehouse and move transaction data to this machine; data may come from many sources or parts of the organization

Transformation phase – cleanse and standardize the data. Why is this necessary?

Load phase – transfer the data from the transformation phase into the data warehouse

The ETL process becomes automated to make regular transfers of transaction data into the data warehouse

Page 8: DatawarehouseDataMiningNotes

Operational Data Data is on many

systems Current operational

data Inconsistent data

definitions Functionally

organized data Data are constantly

changing Support OLTP

Warehouse Data Integrated in one

enterprise-wide system

Recent and historical data

Consistent data definitions

Data are organized around business entities

Data are stabilized Support OLAP

Page 9: DatawarehouseDataMiningNotes

• Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, previously

unknown and potentially useful) information or patterns from data in large databases

• Similar terms Exploratory data analysis Data driven discovery Deductive learning Knowledge extraction

Page 10: DatawarehouseDataMiningNotes

A computer-based data analysis process

• Utilizes historical organizational data, typically in a data warehouse

• Uses a variety of data analysis, modeling, and visualization techniques

• One use is to discover previously unknown patterns or potential relationships in the data: undirected

• Also used to make predictions, verify assumptions, or otherwise provide useful information: directed

• Allows businesses to make proactive, knowledge-driven decisions

Page 11: DatawarehouseDataMiningNotes

• Prediction Use some variables to predict unknown or

future values of other variables• Description

Find human-interpretable patterns that describe the data

Page 12: DatawarehouseDataMiningNotes
Page 13: DatawarehouseDataMiningNotes
Page 14: DatawarehouseDataMiningNotes

• Given a collection of data Each record contains a set of attributes, one of the

attributes is the class variable Usually, the given data set is divided into training and

test sets, with training set used to build the model and test set used to validate it

• Find a model for the class attribute as a function of the values of other attributes based on the training set Previously unseen data (test set) are used to

determine the accuracy of the model

Page 15: DatawarehouseDataMiningNotes

Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product

Approach:• Use the data for a similar product introduced before• We know which customers decided to buy and which decided

otherwise; this {buy, don’t buy} decision is the class variable• Collect various demographic, lifestyle, and company-

interaction related information about all such customers• Use this information as input attributes to create a classifier

model

Page 16: DatawarehouseDataMiningNotes

Goal: Predict fraudulent cases in credit card transactions

Approach:• Use past credit card transactions and the information on its

account-holder as attributes When does a customer buy, what does he buy,

how often he pays on time, etc• Determine whether past transactions were fraud or fair

transactions; this is the class variable• Create a model for the class of the transactions

Page 17: DatawarehouseDataMiningNotes

Goal: To predict whether a customer is likely to be lost to a competitor

Approach:• Use detailed record of transactions with each of the past

and present customers, to find attributes How often the customer calls, where he calls,

what time-of-the day he calls most, his financial status, marital status, etc

• Label the customers as still with the company or left the company (churned)

• Create a model for churn

Page 18: DatawarehouseDataMiningNotes

Clustering concerns segmenting a diverse population into several homogeneous subgroups or clusters. Clustering differs from classification in that there are no predefined classifications.

• Data points in one cluster are more similar to one another

• Data points in separate clusters are less similar to one another

Similarity Measures:• Euclidean distance if attributes are continuous• Other problem-specific measures

Page 19: DatawarehouseDataMiningNotes

Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mixApproach:

• Collect different attributes of customers based on their geographical and lifestyle related information

• Find clusters of similar customers• Measure the clustering quality by observing buying

patterns of customers in same cluster vs. those from different clusters

Page 20: DatawarehouseDataMiningNotes

Goal: To find groups of documents that are similar to each other based on the important terms appearing in them

Approach: To identify frequently occurring terms in each document; form a similarity measure based on the frequencies of different terms; use it to cluster

Gain: Retrieval can utilize the clusters to relate a new document or search term to clustered documents

Page 21: DatawarehouseDataMiningNotes

Given a set of records, each of which contains some number of items from a specific collection, produce dependency rules which will predict occurrence of an item based on occurrences of other items:

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Page 22: DatawarehouseDataMiningNotes

Goal: To identify items that are bought together by sufficiently many customers

Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items

A classic case: If a customer buys diapers on Friday evening, then he is very likely to buy beer

Page 23: DatawarehouseDataMiningNotes

So, don’t be surprised if you find six-packs stacked next to diapers!

Page 24: DatawarehouseDataMiningNotes

• Privacy • • Profiling

• Unauthorized Use

• Big Brother

Page 25: DatawarehouseDataMiningNotes

Customer loyalty cards have multiple uses, but one use is to collect data for the data warehouse

Examples◦Grocery stores◦Web sites◦Harrah’s◦Store related credit cards

Assurance of a steady flow of data

Page 26: DatawarehouseDataMiningNotes

Multidimensional data analysis (or OLAP) enables users to view data using various dimensions, measures and time frames (i. e., OLAP)◦dimensions: products, business units,

country, industry (e.g., categories)◦measures: money, unit sales, head

count, variances◦ time: daily, weekly, monthly, quarterly,

yearly) This type of analysis also provides the

ability to view data in different ways (tables, charts, 3-D, geographically)

OLAP tools provide for this Pivot tables in Excel or Access

Page 27: DatawarehouseDataMiningNotes

Pro

pert

y

City

Time

Three dimensional revenue model

Page 28: DatawarehouseDataMiningNotes

14670

15056

15888

14555

16004

14578

15500

15890Fla

tH

ouse

Glasgow

London

Aberdeen

Q1 Q2 Q3 Q4

Page 29: DatawarehouseDataMiningNotes

Property Type City Time

Total Revenue

Flat Glasgow Q1 15056

House Glasgow Q1 14670

Flat Glasgow Q2 14555

House Glasgow Q2 15888

Flat Glasgow Q3 14578

House Glasgow Q3 16004

Flat Glasgow Q4 15890

House Glasgow Q4 15500

Flat London Q1 19678

House London Q1 23877

Flat London Q2 19567

House London Q2 28677

Page 30: DatawarehouseDataMiningNotes

• A common operation is to aggregate a measure over one or more dimensionsFind the total revenueFind the total revenue for each cityFind the top property-type for the 3rd

quarter based on total revenue across all cities

Page 31: DatawarehouseDataMiningNotes

• Roll-up: Aggregating data across different dimension levelsExample: given revenue by city, we can roll-up to get

total revenue by state

• Drill-down: The inverse of roll-up: disaggregating dataExample: given total revenue by state, we can drill-

down to get revenue by city

Page 32: DatawarehouseDataMiningNotes

Glasgow London Aberdeen

Q1 29726 43555 53210

Q2 30443 48244 34567

Q3 30582 56222 45677

Q4 31390 45632 50056Glasgow

London Aberdeen

Jan 9035 21005 5216

Feb 10788 14799 14944

Mar 9903 7751 33050

Apr 11273 10573 21884

May

9005 16896 8573

Jun 10265 20775 4110

Rollup

Drill Down

Page 33: DatawarehouseDataMiningNotes

• Slicing & Dicing: Selecting data within dimension categoriesExample: given revenue by city for the entire year, we

can extract the revenue for a given city for a given quarter

• Rotating: Reorienting the presentation of the data cubeExample: given a cube that presents revenue by city &

property type for each quarter, we can change the presentation to present the revenue by property type & quarter for each city

Page 34: DatawarehouseDataMiningNotes

Example: OLAP Usage at an Automobile Dealership

The StoryAn automobile dealership manager wants to improve business activity. Therefore she wants to view sales figures from different perspectives.

A QuestionWhat is the sales volume for a specific model and colors, for a specific salesperson ?

The Data NeedsSales by model

Sales by salesperson

Sales by color

Page 35: DatawarehouseDataMiningNotes

Example: The Multi-dimensional Data Model Used

Sales Volume

Blue Red White

Van

Coupe

Sedan

Miller

Clyde

Smith

COLOR

SALESPERSON

MOD E L

Page 36: DatawarehouseDataMiningNotes

Example: OLAP “Slicing & Dicing“ – Selecting Categories

Sales Volumes

Blue

Red White

Van

Coupe

Sedan

Miller

Clyde

Smith

COLOR

SALESPERSON

MOD E L

• Color: Blue and White

Clyde

Blue

White

Coupe

“Sliced & Diced“ Data

Choose a range out of each dimension: • Model: Coupe only

• Salesperson: Clyde only

Page 37: DatawarehouseDataMiningNotes

Example: OLAP “Rotation“ – Changing the Presentation of the Cube

Sales Volume

Blue Red White

Van

Coupe

Sedan

COLOR

MOD E L

View of the Account Manager

Rotate the data cube by 90°

SALESPERSON

Sales Volume

Miller

Smith Clyde

Van

Coupe

Sedan

MOD E L

View of the Product Manager

Different Users will require different views of the multidimensional cube

Page 38: DatawarehouseDataMiningNotes

Example: OLAP Drill-Down and Roll-UpData can be disaggregated and aggregated along a dimension according to the natural hierarchy

Drill-Down

Roll-Up

State

Region

Salesperson Miller Smith

Clyde Lucas Gleason

Atlanta Athens

Georgia

Sales Volume by Organization Dimension- three level hierarchy -

Page 39: DatawarehouseDataMiningNotes

Primarily used to exploit data warehouses Provides extremely fast response View combinations of two dimensions Enable drilling down (start with broad info

and get more specific) Produces results as counts or percentages Conversion of tables to charts/graphs Usually requires a tailored-made relational

database OLAP applications are widely used by mid-

level and upper level managers A form of business intelligence software

Page 40: DatawarehouseDataMiningNotes

Go to www.fedscope.opm.gov ◦Under data cubes on entry page click on

employment◦Demonstrate drill down and adding charts◦Data for this example comes from the

Central Personnel Data File (CPDF) of the federal government

◦The OLAP tool used to build this site is from a company named Cognos (PowerPlay)

OLAP tools based on Excel◦ http://www.cubularity.com

Page 41: DatawarehouseDataMiningNotes

Linkage between elements◦ spreadsheet - between cells in same table◦ DBMS - between elements in different tables

Orientation◦ spreadsheet is toward calculations◦ DBMS is tilted toward organization and

linkage of data elements in different tables Capabilities

◦ DBMS has extensive querying and reporting power

◦ spreadsheet is limited Memory requirements

◦ entire spreadsheet table must be in memory◦ not true for the database table