1 data warehousing & data mining presented by siva krishna nilesh kumar
TRANSCRIPT
11
Data warehousing Data warehousing & &
Data miningData mining
Presented by
Siva Krishna
Nilesh Kumar
22
Evolution of Data Warehousing
60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request
70’s: Terminal-based DSS and EIS (executive information systems still inflexible, not integrated with desktop tool
80’s: Desktop data access and analysis tools query tools, spreadsheets, GUI’s easier to use, but only access operational databases
90’s: Data warehousing with integrated OLAP engines and tools
33
What is a Data Warehouse?What is a Data Warehouse?
A single, complete and A single, complete and consistent store of data consistent store of data obtained from a variety of obtained from a variety of different sources made different sources made available to end users in a available to end users in a what they can understand and what they can understand and use in a business context.use in a business context.
44
Data Warehouse ArchitectureData Warehouse Architecture
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
55
Components of the WarehouseComponents of the Warehouse
Data Extraction and LoadingData Extraction and Loading The Warehouse The Warehouse Analyze and Query -- OLAP ToolsAnalyze and Query -- OLAP Tools Metadata Metadata Data Mining toolsData Mining tools
66
Loading the WarehouseLoading the Warehouse
Cleaning the data before Cleaning the data before it is loadedit is loaded
77
Data Integrity ProblemsData Integrity Problems Same person, different spellingsSame person, different spellings
Agarwal, Agrawal, Aggarwal etc...Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company nameMultiple ways to denote company name
Persistent Systems, PSPL, Persistent Pvt. LTD.Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different namesUse of different names
mumbai, bombaymumbai, bombay Different account numbers generated by different Different account numbers generated by different
applications for the same customerapplications for the same customer
88
Data Transformation TermsData Transformation Terms
ExtractingExtracting ConditioningConditioning HouseholdingHouseholding EnrichmentEnrichment ScoringScoring
99
ExtractingExtracting Capture of data from operational source in “as is” Capture of data from operational source in “as is”
statusstatus Sources for data generally in legacy mainframes in Sources for data generally in legacy mainframes in
VSAM, IMS, IDMS, DB2; more data today in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unixrelational databases on Unix
ConditioningConditioning The conversion of data types from the source to the The conversion of data types from the source to the
target data store (warehouse) -- always a relational target data store (warehouse) -- always a relational databasedatabase
1010
HouseholdingHouseholding Identifying all members of a household (living at Identifying all members of a household (living at
the same address)the same address) Ensures only one mail is sent to a householdEnsures only one mail is sent to a household Can result in substantial savings: 1 lakh catalogues Can result in substantial savings: 1 lakh catalogues
at Rs. 50 each costs Rs. 50 lakhs. A 2% savings at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh.would save Rs. 1 lakh.
EnrichmentEnrichment Bring data from external sources to Bring data from external sources to
augment/enrich operational data. Data augment/enrich operational data. Data sources include Dunn and Bradstreet, A. sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc...C. Nielsen, CMIE, IMRA etc...
1111
Scoring computation of a probability of an event. e.g...., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product
1212
Data Warehouse StructureData Warehouse Structure
Subject Orientation -- customer, product, Subject Orientation -- customer, product, policy, account etc... A subject may be policy, account etc... A subject may be implemented as a set of related tables. E.g., implemented as a set of related tables. E.g., customer may be five tablescustomer may be five tables
1313
Data Warehouse StructureData Warehouse Structure base customer (1985-87)base customer (1985-87)
custid, from date, to date, name, phone, dobcustid, from date, to date, name, phone, dob base customer (1988-90)base customer (1988-90)
custid, from date, to date, name, credit rating, employercustid, from date, to date, name, credit rating, employer customer activity (1986-89) -- monthly summarycustomer activity (1986-89) -- monthly summary customer activity detail (1987-89)customer activity detail (1987-89)
custid, activity date, amount, clerk id, order nocustid, activity date, amount, clerk id, order no customer activity detail (1990-91)customer activity detail (1990-91)
custid, activity date, amount, line item no, order nocustid, activity date, amount, line item no, order no
1414
Schema DesignSchema Design Database organizationDatabase organization
must look like businessmust look like business must be recognizable by business usermust be recognizable by business user approachable by business userapproachable by business user Must be Must be simplesimple
Schema TypesSchema Types Star SchemaStar Schema Fact Constellation SchemaFact Constellation Schema Snowflake schemaSnowflake schema
1515
Dimension TablesDimension Tables
Dimension tablesDimension tables Define business in terms already familiar to usersDefine business in terms already familiar to users Wide rows with lots of descriptive textWide rows with lots of descriptive text Small tables (about a million rows) Small tables (about a million rows) Joined to fact table by a foreign keyJoined to fact table by a foreign key heavily indexedheavily indexed typical dimensionstypical dimensions
time periods, geographic region (markets, cities), time periods, geographic region (markets, cities), products, customers, salesperson, etc.products, customers, salesperson, etc.
1616
Fact TableFact Table Central tableCentral table
mostly raw numeric itemsmostly raw numeric items narrow rows, a few columns at mostnarrow rows, a few columns at most large number of rows (millions to a billion)large number of rows (millions to a billion) Access via dimensionsAccess via dimensions
1717
Star SchemaStar Schema A single fact table and for each dimension A single fact table and for each dimension
one dimension tableone dimension table Does not capture hierarchies directlyDoes not capture hierarchies directly
T ime
prod
cust
city
fact
date, custno, prodno, cityname, ...
1818
Snowflake schemaSnowflake schema Represent dimensional hierarchy directly by Represent dimensional hierarchy directly by
normalizing tables. normalizing tables. Easy to maintain and saves storageEasy to maintain and saves storage
T ime
prod
cust
city
fact
date, custno, prodno, cityname, ...
region
1919
Fact ConstellationFact Constellation Fact ConstellationFact Constellation
Multiple fact tables that share many dimension Multiple fact tables that share many dimension tablestables
Booking and Checkout may share many Booking and Checkout may share many dimension tables in the hotel industrydimension tables in the hotel industry
Hotels
Travel Agents
Promotion
Room Type
Customer
Booking
Checkout
2020
On-Line Analytical On-Line Analytical Processing (OLAP)Processing (OLAP)
Making Decision Making Decision Support PossibleSupport Possible
2121
What Is OLAP?What Is OLAP? Online Analytical Processing - coined by Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by EF Codd in 1994 paper contracted by Arbor Software*Arbor Software*
Generally synonymous with earlier terms such as Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Decisions Support, Business Intelligence, Executive Information SystemInformation System
OLAP = Multidimensional DatabaseOLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)Microstrategy DSS Agent)
2222
Data Warehouse for Decision Data Warehouse for Decision Support & OLAPSupport & OLAP
Putting Information technology to help the Putting Information technology to help the knowledge worker make faster and better knowledge worker make faster and better decisionsdecisions Which of my customers are most likely to go to Which of my customers are most likely to go to
the competition?the competition? What product promotions have the biggest impact What product promotions have the biggest impact
on revenue?on revenue? How did the share price of software companies How did the share price of software companies
correlate with profits over last 10 years?correlate with profits over last 10 years?
2323
Decision SupportDecision Support
Used to manage and control businessUsed to manage and control business Data is historical or point-in-timeData is historical or point-in-time Optimized for inquiry rather than updateOptimized for inquiry rather than update Use of the system is loosely defined and Use of the system is loosely defined and
can be ad-hoccan be ad-hoc Used by managers and end-users to Used by managers and end-users to
understand the business and make understand the business and make judgementsjudgements
2424
Data MiningData Mining
The non trivial extraction of implicit, The non trivial extraction of implicit, previously unknown, and potentially useful previously unknown, and potentially useful information from data information from data
In simple wordsIn simple words Searching for new knowledgeSearching for new knowledge
2525
Data Mining GoalsData Mining Goals
PredictionPrediction IdentificationIdentification ClassificationClassification OptimizationOptimization
2626
Data mining techniquesData mining techniques clustering clustering data summarization data summarization learning classification rules learning classification rules finding dependency net works finding dependency net works analyzing changesanalyzing changes detecting anomalies detecting anomalies
2727
Data Mining in UseData Mining in Use The US Government uses Data Mining to track The US Government uses Data Mining to track
fraudfraud A Supermarket becomes an information brokerA Supermarket becomes an information broker Basketball teams use it to track game strategyBasketball teams use it to track game strategy Cross SellingCross Selling Warranty Claims RoutingWarranty Claims Routing Holding on to Good CustomersHolding on to Good Customers Weeding out Bad CustomersWeeding out Bad Customers
2828
Data Warehouse PitfallsData Warehouse Pitfalls You are going to spend much time extracting, cleaning, You are going to spend much time extracting, cleaning,
and loading dataand loading data Despite best efforts at project management, data Despite best efforts at project management, data
warehousing project scope will increasewarehousing project scope will increase You are going to find problems with systems feeding You are going to find problems with systems feeding
the data warehousethe data warehouse You will find the need to store data not being captured You will find the need to store data not being captured
by any existing systemby any existing system You will need to validate data not being validated by You will need to validate data not being validated by
transaction processing systemstransaction processing systems
2929
Data Warehouse PitfallsData Warehouse Pitfalls Some transaction processing systems feeding the Some transaction processing systems feeding the
warehousing system will not contain detailwarehousing system will not contain detail Many warehouse end users will be trained and never or Many warehouse end users will be trained and never or
seldom apply their trainingseldom apply their training After end users receive query and report tools, requests After end users receive query and report tools, requests
for IS written reports may increasefor IS written reports may increase Your warehouse users will develop conflicting business Your warehouse users will develop conflicting business
rulesrules Large scale data warehousing can become an exercise in Large scale data warehousing can become an exercise in
data homogenizingdata homogenizing
3030
ApplicationsApplications
Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.
Finance - stock market prediction, credit assessment, fraud detection etc.Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behavior' etc.
Knowledge Acquisition
Scientific Discovery - superconductivity research, etc.
Engineering - automotive diagnostic expert systems, fault detection etc.