professional issues in computing · pdf filetelenor, ufone, mobilink, warid, zong ??? ... data...
TRANSCRIPT
Data Warehousing(Introduction, Types & Applications)
Naveed Iqbal, Assistant Professor
NUCES, Islamabad Campus(Lecture Slides Week # 2)
NUCES, Islamabad Campus Data Warehousing - Fall 2012 2
Why a Data Warehouse (DWH)?
Data recording and storage is growing: Almost every industry has huge amount of operational data.
Careful use/analysis of historic information may result inexcellent prediction for the future: Knowledge worker wants to turn available data into useful
information.
This information is used by them to support strategic decisionmaking.
Gives total view of the organization: It is a platform for consolidated historical data for analysis.
It stores data of good quality so that knowledge worker canmake correct decisions.
Intelligent decision-support is required for decision-making.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 3
Why a Data Warehouse? (Contd.)
From business perspective:
It is latest marketing weapon.
Helps to keep customers by learning more about
their needs.
Valuable tool in today‟s competitive fast evolving
world.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 4
Reason-I: Why a Data Warehouse (DWH)?
Data sets are growing:
How Much Data is that?1 MB 220 or 106 bytes Small novel 3½ Disk.
1 GB 230 or 109 bytesPaper reams that could fill the back of a
pickup van.
1 TB 240 or 1012 bytes50,000 trees chopped and converted into
paper and printed.
2 PB 1 PB = 250 or 1015 bytes Academic research libraries across USA.
5 EB 1 EB = 260 or 1018 bytesAll words ever spoken by the Human
Beings.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 5
Reason-I: Why a Data Warehouse (DWH)?
Size of Data Sets are going up.
Cost of Data Storage is coming down. The amount of data average business collects and stores is
doubling every year.
Total hardware and software cost to store and manage 1 MB ofdata: 1990: $ 15
2002: ¢ 15 (down 100 times)
2010: < ¢ 1 (down 150 times)
A few examples: Wall Mart: 24+ TB
Finance Telecom: 100+ TB
CERN: Upto 20 PB by 2006
Stanford Linear Accelerator Center (SLAC): 500 TB
Telenor, Ufone, Mobilink, Warid, Zong ???
NUCES, Islamabad Campus Data Warehousing - Fall 2012 6
Caution!
A Warehouse of Data
is NOT a
Data Warehouse.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 7
Caution!
Size
is NOT
Everything.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 8
Reason-2: Why a Data Warehouse (DWH)?
DBMS Approach
List of all items that were soldlast month?
List of all makeup itemspurchased by Sassi?
The total sales of the last monthgrouped by branch?
How many sales transactionsoccurred during the month ofJanuary?
Intelligent Enterprise
Which items sell together?Which items to stock?
Where and how to place theitems? What discounts to offer?
How best to target customers toincrease sales at a branch?
Which customers are most likelyto respond to my nextpromotional campaign, and why?
Businesses demand Intelligence (BI).
Complex questions from integrated data.
“Intelligent Enterprise”
NUCES, Islamabad Campus Data Warehousing - Fall 2012 9
Reason-3: Why a Data Warehouse (DWH)?
Businesses want much more …
What happened?
Why it happened?
What will happen?
What is happening?
What do you want to happen?
NUCES, Islamabad Campus Data Warehousing - Fall 2012 10
What is a Data Warehouse?
A complete repository of historical
corporate data extracted from
transaction systems that is
available for ad-hoc access by
knowledge workers.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 11
What is a Data Warehouse?
Transaction System: Management Information System (MIS)
Could be typed sheets (NOT transaction system)
Ad-Hoc Access: Does not have a certain access pattern
Queries not known in advance
Difficult to write SQL in advance
Knowledge Workers: Typically NOT IT literate (Executives, Analysts, Managers)
NOT clerical workers
Decision makers
NUCES, Islamabad Campus Data Warehousing - Fall 2012 12
What is a Data Warehouse?
Inmons‟s Definition:
A Data Warehouse is:
Subject-oriented
Integrated
Time-variant
Nonvolatile
Collection of data in support of
management‟s decision making process.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 13
Another View of a DWH
Subject
Oriented
Integrated
Time Variant
Non Volatile
NUCES, Islamabad Campus Data Warehousing - Fall 2012 14
Subject-oriented
Data Warehouse is organized around subjects such as sales,
product, customer.
It focuses on modeling and analysis of data for decision makers.
Excludes data not useful in decision support process.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 15
Integration
Data Warehouse is constructed by integrating multiple
heterogeneous sources.
Data Preprocessing are applied to ensure consistency.
RDBMS
Legacy
System
Data
Warehouse
Flat File Data Processing
Data Transformation
NUCES, Islamabad Campus Data Warehousing - Fall 2012 16
Time-variant
Provides information from historical perspective e.g.
past 5-10 years.
Every key structure contains either implicitly or
explicitly an element of time.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 17
Nonvolatile
Data once recorded cannot be updated.
Data Warehouse requires two operations in dataaccessing
Initial loading of data
Access of data
load
access
NUCES, Islamabad Campus Data Warehousing - Fall 2012 18
Summary: What is a Data Warehouse?
It is a blend of many technologies, the basicconcept being: Take all data from different operational systems
If necessary, add relevant data from industry
Transform all data and bring into a uniform format
Integrate all data as a single entity
Store data in a format supporting easy access fordecision support
Create performance enhancing indices
Implement performance enhancement joins
Run ad-hoc queries with slow selectivity
NUCES, Islamabad Campus Data Warehousing - Fall 2012 19
Benefits of Data Warehouse
High returns on investment.
Substantial competitive advantage.
Increased productivity of corporate decision-makers.
Fast reporting for decision making process.
Reduced reporting load on transactional systems.
Making institutional data more user-friendly andaccessible for knowledge workers.
Integrated data from different source systems.
Enabled „point-in-time‟ analysis and trending overtime.
Helps in identifying and resolving data integrity issues,either in the warehouse itself or in the source systemsthat collect the data.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 20
Data Warehouse: How is it Different?
1. Decision making is Ad-Hoc
NUCES, Islamabad Campus Data Warehousing - Fall 2012 21
Data Warehouse: How is it Different?
2. Different patterns of hardware utilization
Bus Service vs. Train
NUCES, Islamabad Campus Data Warehousing - Fall 2012 22
Data Warehouse: How is it Different?
3. Combines operational and historic data
Don‟t do data entry into a DWH. OLTP or ERP are the
source systems.
OLTP systems don‟t keep history, cannot get balance
statement more than a year old.
DWH keep historical data, even of bygone customers.
Why?
In the context of bank, want to know why the customer
left?
What are the events that led to his/her leaving? Why?
Customer retention
NUCES, Islamabad Campus Data Warehousing - Fall 2012 23
Data Warehouse: How is it Different?
How much history? Depends on:
Industry
Cost of storing historical data
Economic value of historical data
Industry and history
Telecom calls are much much more as compared to banktransactions
18 months
Retailers interested in analyzing yearly seasonal patterns 65 weeks, why?
Insurance companies want to do actuary analysis, use thehistorical data in order to predict risk
7 years
Hence NOT a complete repository of data.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 24
Data Warehouse: How is it Different?
How much history?
Economic value of data vs. storage cost
Data Warehouse a complete repository of
data?
NUCES, Islamabad Campus Data Warehousing - Fall 2012 25
Data Warehouse: How is it Different?
4. Usually (but not always) periodic or batch updatesrather than real-time
The boundary is blurring for active data warehousing.
For an ATM, if update not in real-time, then lot of realtrouble.
DWH is for strategic decision making based onhistorical data, would not hurt if transactions of lastone hour or day are absent.
Rate of update depends on:
Volume of data
Nature of business
Cost of keeping historical data
Benefit of keeping historical data
NUCES, Islamabad Campus Data Warehousing - Fall 2012 26
Data Warehouse: How is it Different?
5. Starts with 6x12 availability requirement … but 7x24usually becomes the goal.
Decision makers typically don‟t work 24 hrs a day and7 days a week. An ATM system does.
Once decision makers start using the DWH, and startreaping the benefits, they start liking it.
Start using the DWH more often, till want it available100% of the time.
For business across the globe, 50% of the world maybe sleeping at any one time, but the businesses areup 100% of the time.
100% availability not a trivial task, need to take intoaccount loading strategies, refresh rates etc.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 27
Data Warehouse: How is it Different?
6. Does not follows the traditional development model
• Requirements gathering
• Analysis
• Design
• Programming
• Testing
• Integration
• Implementation
• Design warehouse
• Integrate data
• Test for biasness / incorrectness
• Program w.r.t. data
• Design DSS system
• Analyze results
• Understand requirements
NUCES, Islamabad Campus Data Warehousing - Fall 2012 28
Data Warehouse: How is it Different?
7. Comparison of response times
OLAP (Online Analytical Processing) queries must be
executed in a small number of seconds.
Often requires de-normalization and/or sampling
Complex query scripts and large list selections can
generally be executed in a small number of minutes.
Sophisticated clustering algorithms e.g. data mining
can generally be executed in a small number of hours
(even for hundreds of thousands of customers).
NUCES, Islamabad Campus Data Warehousing - Fall 2012 29
Data Warehouse: How is it Different?
8. Data Warehouse vs. OLTP (Online Transaction Processing)
OLTP: Select tx_date, balance from tx_table where account_ID = 829;
DWH
Select balance, age, sal, gender from customer_table and tx_table
where age between (30 and 40) and education = „graduate‟ and
custID.customer_table = customer_ID.tx_table;
OLTP
Primary key used
No concept of primary index
May use a single table
Normally few rows returned
High selectivity of query
Indexing on primary key (unique)
DWH
Primary key NOT used
Primary index used
Mostly uses multiple tables
Normally many rows returned
Low selectivity of query
Indexing on primary index (non-
unique)
NUCES, Islamabad Campus Data Warehousing - Fall 2012 30
DWH vs. OLTP: Summary
DWH
Application neutral
Single source of „truth‟
Evolves over time
How to improve business
Historical, detailed data
Some summary
Lightly de-normalized
Hardly uses PK
No. of returned results in Ks
Minutes to hours
Typical availability 12x6
OLTP
Application specific
Multiple databases with repetition
Off the shelf application
Runs the business
Operational data
No summary
Fully normalized
Based on PK
No. of returned results in 100s
Sub seconds to seconds
Typical availability 24x7
Scope
Data
Perspective
Queries
Time Factor
NUCES, Islamabad Campus Data Warehousing - Fall 2012 31
DWH
Informational processing
Analysis
Knowledge workers
Decision support
Subject oriented
Complex query
Mostly read
Information out
High flexibility / autonomy
Query throughput
OLTP
Operational processing
Transaction
Clerks, DBAs etc.
Day to day operation
Application oriented
Short, simple transaction
Read / write
Data in
High performance / availability
Transaction throughput
Characteristics
Orientation
Users
Function
DB Design
Unit of work
Access
Focus
Priority
Metric
DWH vs. OLTP: Summary
NUCES, Islamabad Campus Data Warehousing - Fall 2012 32
Putting the pieces together
Data Warehouse Server
(Tier 1)
Data
Warehouse
Operational
Data Bases
Semistructured
Sources Query/Reporting
Data Marts
MOLAP
ROLAP
Clients
(Tier 3)
Tools
Meta
Data
Data sources
Data
(Tier 0)
IT
Users
Business
Users
Business Users
Data Mining
Archived
data
Analysis
OLAP Servers
(Tier 2)
Extract
Transform
Load
(ETL)
www data
NUCES, Islamabad Campus Data Warehousing - Fall 2012 33
Why is this hard?
Data sources are unstructured & heterogeneous.
Requirements are always changing.
Most computer scientist trained on OLTP systems,those concepts not valid for VLDB & DSS.
The scale factor in VLDB implementations is difficult tocomprehend.
Performance impacts are often non-linear O(n) vs.O(nlogn) e.g. scanning vs indexing.
Complex computer/database architectures.
Rapidly changing product characteristics.
…
NUCES, Islamabad Campus Data Warehousing - Fall 2012 34
DWH: High level implementation steps
Phase-I
Determine user needs
Determine DBMS Server platform
Determine hardware platform(s)
Information and Data Modeling
Construct metadata repository
Phase-II
Data acquisition and cleansing
Data transform, transport and populate
Determine middleware connectivity
Prototyping, querying and reporting
Data Mining
Online Analytical Processing (OLAP)
Phase-III
Deployment and System Management
NUCES, Islamabad Campus Data Warehousing - Fall 2012 35
Types of Data Warehouses
Financial
Telecommunication
Insurance
Human Resource
Global
Exploratory
…
NUCES, Islamabad Campus Data Warehousing - Fall 2012 36
Types of Data Warehouses
Financial
First Data Warehouse that an organization
builds. This is appealing because:
Nerve center, easy to get attention.
In most organizations, smallest data set.
Touches all aspects of an organization with a
common denomination i.e. money.
Inherent structure of data directly influenced by
the day-to-day activities of financial processing.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 37
Types of Data Warehouses
Telecommunication
Dominated by sheer volume of data
Many ways to accommodate call leveldetail: Only a few months of call level detail.
Storing lots of call level detail scattered overdifferent storage media.
Storing only selective call level detail etc.
Unfortunately, for many kinds of processing,working at an aggregate level is simply notpossible as finding patterns will be difficult.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 38
Types of Data Warehouses
Insurance
Insurance Data Warehouses are similar toother Data Warehouses BUT with fewexceptions: Store data that is very old and used for actuarial
processing / analysis.
Typical business may change dramatically overlast 40-50 years, but not insurance.
In retailing or telecom, there are few importantdates but in the insurance environment there aremany dates of many kinds.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 39
Types of Data Warehouses
Insurance
Insurance Data Warehouses are similar toother Data Warehouses BUT with fewexceptions (Contd.): Long operational business cycles, in years.
Processing time in months. Thus the operatingspeed is different.
Transactions are not gathered and processedbut are in kind of „frozen‟.
Thus a very unique approach of design &implementation.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 40
DWH: Typical Applications
Impact on organization‟s core business is tostreamline and maximize profitability. Fraud Detection
Profitability Analysis
Direct Mail / Database Marketing
Credit Risk Prediction
Customer Retention Modeling
Yield Management
Inventory Management
ROI on any one of these applications canjustify HW / SW and Consultancy costs in mostorganizations.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 41
DWH: Typical Applications
Fraud Detection
By observing data usage patterns
People have typical purchase patterns
Deviation patterns
Certain cities notorious for fraud
Certain items bought by stolen cards
Similar behavior for stolen cards
NUCES, Islamabad Campus Data Warehousing - Fall 2012 42
DWH: Typical Applications
Profitability Analysis
Banks know if they are profitable are not
Don‟t know which customers are profitable
Typically more than 50% are NOT profitable
Don‟t know which one?
Balance is not enough, transactionalbehavior is the key
Restructure products and pricing strategies
Life time profitability models (next 3-5 years)
NUCES, Islamabad Campus Data Warehousing - Fall 2012 43
DWH: Typical Applications
Direct Mail Marketing
Targeted marketing
Offering high bandwidth package NOT to all
users
Know from call detail records of web surfing
Saves marketing expense, saving pennies
Knowing your customer better
NUCES, Islamabad Campus Data Warehousing - Fall 2012 44
DWH: Typical Applications
Credit Risk Prediction
Who should get a loan?
Customer segregation i.e. stable vs. rolling
Qualitative decision making NOT subjective
Different interest rates for different
customers
Do not subsidize bad customer on the basis
of good
NUCES, Islamabad Campus Data Warehousing - Fall 2012 45
DWH: Typical Applications
Yield Management
Works for fixed inventory businesses
The price of item suddenly goes to zero
Item prices vary for varying customers
Examples: Airlines, Hotels etc.
E.g. Price of air ticket depends on: How much in advance ticket was bought?
How many vacant seats were available?
How profitable is the customer?
Ticket is one-way or return?
NUCES, Islamabad Campus Data Warehousing - Fall 2012 46
DWH: Recent Applications
Agriculture Systems
Agriculture related data collected for
decades
Metrological data consists of 50+ attributes
Decision making based on expert judgment
Lack of integration results in underutilization
What is required, in which amount and
when?
NUCES, Islamabad Campus Data Warehousing - Fall 2012 47
DWH: Typical Early Adopters
Financial service / insurance
Retailing and distribution
Telecommunications
Transportation
Government
Common thread:
Lots of customers and transactions.
NUCES, Islamabad Campus Data Warehousing - Fall 2012 48
DWH: End User Expectations
Point and click access to data
Insulation from DBMS structures
Want semantic data model – not 3rd normalform
Integration with existing tools: Excel,SAS etc.
Interactive response times for onlineanalysis but batch time is important aswell.