1 data warehousing & data mining presented by siva krishna nilesh kumar

31
1 Data warehousing Data warehousing & & Data mining Data mining Presented by Siva Krishna

Upload: samantha-daniel

Post on 31-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

11

Data warehousing Data warehousing & &

Data miningData mining

Presented by

Siva Krishna

Nilesh Kumar

22

Evolution of Data Warehousing

60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request

70’s: Terminal-based DSS and EIS (executive information systems still inflexible, not integrated with desktop tool

80’s: Desktop data access and analysis tools query tools, spreadsheets, GUI’s easier to use, but only access operational databases

90’s: Data warehousing with integrated OLAP engines and tools

33

What is a Data Warehouse?What is a Data Warehouse?

A single, complete and A single, complete and consistent store of data consistent store of data obtained from a variety of obtained from a variety of different sources made different sources made available to end users in a available to end users in a what they can understand and what they can understand and use in a business context.use in a business context.

44

Data Warehouse ArchitectureData Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

55

Components of the WarehouseComponents of the Warehouse

Data Extraction and LoadingData Extraction and Loading The Warehouse The Warehouse Analyze and Query -- OLAP ToolsAnalyze and Query -- OLAP Tools Metadata Metadata Data Mining toolsData Mining tools

66

Loading the WarehouseLoading the Warehouse

Cleaning the data before Cleaning the data before it is loadedit is loaded

77

Data Integrity ProblemsData Integrity Problems Same person, different spellingsSame person, different spellings

Agarwal, Agrawal, Aggarwal etc...Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company nameMultiple ways to denote company name

Persistent Systems, PSPL, Persistent Pvt. LTD.Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different namesUse of different names

mumbai, bombaymumbai, bombay Different account numbers generated by different Different account numbers generated by different

applications for the same customerapplications for the same customer

88

Data Transformation TermsData Transformation Terms

ExtractingExtracting ConditioningConditioning HouseholdingHouseholding EnrichmentEnrichment ScoringScoring

99

ExtractingExtracting Capture of data from operational source in “as is” Capture of data from operational source in “as is”

statusstatus Sources for data generally in legacy mainframes in Sources for data generally in legacy mainframes in

VSAM, IMS, IDMS, DB2; more data today in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unixrelational databases on Unix

ConditioningConditioning The conversion of data types from the source to the The conversion of data types from the source to the

target data store (warehouse) -- always a relational target data store (warehouse) -- always a relational databasedatabase

1010

HouseholdingHouseholding Identifying all members of a household (living at Identifying all members of a household (living at

the same address)the same address) Ensures only one mail is sent to a householdEnsures only one mail is sent to a household Can result in substantial savings: 1 lakh catalogues Can result in substantial savings: 1 lakh catalogues

at Rs. 50 each costs Rs. 50 lakhs. A 2% savings at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh.would save Rs. 1 lakh.

EnrichmentEnrichment Bring data from external sources to Bring data from external sources to

augment/enrich operational data. Data augment/enrich operational data. Data sources include Dunn and Bradstreet, A. sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc...C. Nielsen, CMIE, IMRA etc...

1111

Scoring computation of a probability of an event. e.g...., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product

1212

Data Warehouse StructureData Warehouse Structure

Subject Orientation -- customer, product, Subject Orientation -- customer, product, policy, account etc... A subject may be policy, account etc... A subject may be implemented as a set of related tables. E.g., implemented as a set of related tables. E.g., customer may be five tablescustomer may be five tables

1313

Data Warehouse StructureData Warehouse Structure base customer (1985-87)base customer (1985-87)

custid, from date, to date, name, phone, dobcustid, from date, to date, name, phone, dob base customer (1988-90)base customer (1988-90)

custid, from date, to date, name, credit rating, employercustid, from date, to date, name, credit rating, employer customer activity (1986-89) -- monthly summarycustomer activity (1986-89) -- monthly summary customer activity detail (1987-89)customer activity detail (1987-89)

custid, activity date, amount, clerk id, order nocustid, activity date, amount, clerk id, order no customer activity detail (1990-91)customer activity detail (1990-91)

custid, activity date, amount, line item no, order nocustid, activity date, amount, line item no, order no

1414

Schema DesignSchema Design Database organizationDatabase organization

must look like businessmust look like business must be recognizable by business usermust be recognizable by business user approachable by business userapproachable by business user Must be Must be simplesimple

Schema TypesSchema Types Star SchemaStar Schema Fact Constellation SchemaFact Constellation Schema Snowflake schemaSnowflake schema

1515

Dimension TablesDimension Tables

Dimension tablesDimension tables Define business in terms already familiar to usersDefine business in terms already familiar to users Wide rows with lots of descriptive textWide rows with lots of descriptive text Small tables (about a million rows) Small tables (about a million rows) Joined to fact table by a foreign keyJoined to fact table by a foreign key heavily indexedheavily indexed typical dimensionstypical dimensions

time periods, geographic region (markets, cities), time periods, geographic region (markets, cities), products, customers, salesperson, etc.products, customers, salesperson, etc.

1616

Fact TableFact Table Central tableCentral table

mostly raw numeric itemsmostly raw numeric items narrow rows, a few columns at mostnarrow rows, a few columns at most large number of rows (millions to a billion)large number of rows (millions to a billion) Access via dimensionsAccess via dimensions

1717

Star SchemaStar Schema A single fact table and for each dimension A single fact table and for each dimension

one dimension tableone dimension table Does not capture hierarchies directlyDoes not capture hierarchies directly

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

1818

Snowflake schemaSnowflake schema Represent dimensional hierarchy directly by Represent dimensional hierarchy directly by

normalizing tables. normalizing tables. Easy to maintain and saves storageEasy to maintain and saves storage

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

1919

Fact ConstellationFact Constellation Fact ConstellationFact Constellation

Multiple fact tables that share many dimension Multiple fact tables that share many dimension tablestables

Booking and Checkout may share many Booking and Checkout may share many dimension tables in the hotel industrydimension tables in the hotel industry

Hotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

2020

On-Line Analytical On-Line Analytical Processing (OLAP)Processing (OLAP)

Making Decision Making Decision Support PossibleSupport Possible

2121

What Is OLAP?What Is OLAP? Online Analytical Processing - coined by Online Analytical Processing - coined by

EF Codd in 1994 paper contracted by EF Codd in 1994 paper contracted by Arbor Software*Arbor Software*

Generally synonymous with earlier terms such as Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Decisions Support, Business Intelligence, Executive Information SystemInformation System

OLAP = Multidimensional DatabaseOLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, MOLAP: Multidimensional OLAP (Arbor Essbase,

Oracle Express)Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, ROLAP: Relational OLAP (Informix MetaCube,

Microstrategy DSS Agent)Microstrategy DSS Agent)

2222

Data Warehouse for Decision Data Warehouse for Decision Support & OLAPSupport & OLAP

Putting Information technology to help the Putting Information technology to help the knowledge worker make faster and better knowledge worker make faster and better decisionsdecisions Which of my customers are most likely to go to Which of my customers are most likely to go to

the competition?the competition? What product promotions have the biggest impact What product promotions have the biggest impact

on revenue?on revenue? How did the share price of software companies How did the share price of software companies

correlate with profits over last 10 years?correlate with profits over last 10 years?

2323

Decision SupportDecision Support

Used to manage and control businessUsed to manage and control business Data is historical or point-in-timeData is historical or point-in-time Optimized for inquiry rather than updateOptimized for inquiry rather than update Use of the system is loosely defined and Use of the system is loosely defined and

can be ad-hoccan be ad-hoc Used by managers and end-users to Used by managers and end-users to

understand the business and make understand the business and make judgementsjudgements

2424

Data MiningData Mining

The non trivial extraction of implicit, The non trivial extraction of implicit, previously unknown, and potentially useful previously unknown, and potentially useful information from data information from data

In simple wordsIn simple words Searching for new knowledgeSearching for new knowledge

2525

Data Mining GoalsData Mining Goals

PredictionPrediction IdentificationIdentification ClassificationClassification OptimizationOptimization

2626

Data mining techniquesData mining techniques clustering clustering data summarization data summarization learning classification rules learning classification rules finding dependency net works finding dependency net works analyzing changesanalyzing changes detecting anomalies detecting anomalies

2727

Data Mining in UseData Mining in Use The US Government uses Data Mining to track The US Government uses Data Mining to track

fraudfraud A Supermarket becomes an information brokerA Supermarket becomes an information broker Basketball teams use it to track game strategyBasketball teams use it to track game strategy Cross SellingCross Selling Warranty Claims RoutingWarranty Claims Routing Holding on to Good CustomersHolding on to Good Customers Weeding out Bad CustomersWeeding out Bad Customers

2828

Data Warehouse PitfallsData Warehouse Pitfalls You are going to spend much time extracting, cleaning, You are going to spend much time extracting, cleaning,

and loading dataand loading data Despite best efforts at project management, data Despite best efforts at project management, data

warehousing project scope will increasewarehousing project scope will increase You are going to find problems with systems feeding You are going to find problems with systems feeding

the data warehousethe data warehouse You will find the need to store data not being captured You will find the need to store data not being captured

by any existing systemby any existing system You will need to validate data not being validated by You will need to validate data not being validated by

transaction processing systemstransaction processing systems

2929

Data Warehouse PitfallsData Warehouse Pitfalls Some transaction processing systems feeding the Some transaction processing systems feeding the

warehousing system will not contain detailwarehousing system will not contain detail Many warehouse end users will be trained and never or Many warehouse end users will be trained and never or

seldom apply their trainingseldom apply their training After end users receive query and report tools, requests After end users receive query and report tools, requests

for IS written reports may increasefor IS written reports may increase Your warehouse users will develop conflicting business Your warehouse users will develop conflicting business

rulesrules Large scale data warehousing can become an exercise in Large scale data warehousing can become an exercise in

data homogenizingdata homogenizing

3030

ApplicationsApplications

Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.

Finance - stock market prediction, credit assessment, fraud detection etc.Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behavior' etc.

Knowledge Acquisition

Scientific Discovery - superconductivity research, etc.

Engineering - automotive diagnostic expert systems, fault detection etc.

3131

Thanking you

Nilesh and Siva Krishna…