datawarehouse tutorials
DESCRIPTION
Datawarehouse TutorialsTRANSCRIPT
Case study
Insurance domain
Agenda
• Introduction to Insurance domain
• What is Health insurance
• Identify Dimensions
• ETL
• Reporting
• Identify Facts
• Historical data maintanance
3
Introduction to Insurance
• Insurance provides protection against financial losses resulting from a variety of threats. By purchasing insurance policies, individuals and businesses can receive reimbursement for losses due to car accidents, theft of property, and fire and storm damage, medical expenses, and loss of income due to disability or death.
• Types of Insurance: Auto Insurance Home Insurance Health Insurance Life Insurance etc…
4
Health Insurance
• Health insurance (popularly known as Medical Insurance or Mediclaim) protects one against any financial constraints arising on account of a medical emergency. It sometimes includes disability and long term medical needs.
• In Mediclaim, you pay a premium and in return the insurer commits to pay a predetermined sum of money to meet the claims.
• Health insurance is available to individual and groups.
5
Health Insurance
• Health insurance (popularly known as Medical Insurance or Mediclaim) protects one against any financial constraints arising on account of a medical emergency. It sometimes includes disability and long term medical needs.
• In Mediclaim, you pay a premium and in return the insurer commits to pay a predetermined sum of money to meet the claims.
• Health insurance is available to individual and groups.
Which are our lowest/highest margin
customers ?
Which are our lowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
Which customers are most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
A customer wants to know….
Data, data everywhere yet...
• I can’t find the data I need data is scattered over the network many versions, subtle differences
• I can’t get the data I need need an expert to get the data
• I can’t understand the data I found available data poorly documented
• I can’t use the data I found results are unexpected data needs to be transformed from
one form to other
What the knowledge users want is...
• Data should be integrated across the enterprise
• Summary data has a real value to the organization
• Historical data holds the key to understanding data over time
• What-if capabilities are required
But the Present Systems are…
• Run mission critical applications
• Need to work with stringent performance requirements for routine tasks
• Used to run a business
• Online Transaction Processing (OLTP) systems a.k.a. Operational Systems
Operational Systems…
• Run the business in real time
• Optimized to handle large numbers of simple read/write transactions
• Based on up-to-the-second data
• Optimized for fast response to predefined transactions
• Used by people who deal with customers, products - clerks, salespeople etc.
…What they need is a ‘Data Warehouse’
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a way they can understand and use in a business context
[Barry Devlin]
12
Evolution
• 60’s: Batch reports• hard to find and analyze information• inflexible and expensive, reprogram every new
request• 70’s: Terminal-based DSS and EIS (executive
information systems)• still inflexible, not integrated with desktop tools
• 80’s: Desktop data access and analysis tools• query tools, spreadsheets, GUIs• easier to use, but only access operational databases
• 90’s: Data warehousing with integrated OLAP engines and tools
Data Warehouse
A data warehouse is a
• subject-oriented
• integrated
• time-varying
• non-volatile
collection of data that is used primarily in organizational decision-making
- Bill Inmon, Building the Data Warehouse 1996
14
Application-Orientation vs. Subject Orientation
Application-Orientation
Operational Database
LoansCredit Card
Trust
Savings
Subject-Orientation
DataWarehouse
Customer
VendorProduct
Activity
OLTP vs. Data Warehouse
OLTP Warehouse (DSS)
• Application Oriented • Subject Oriented
• Used to run business • Used to analyze business
• Current up to date • Summarized and refined
• Detailed data • Snapshot data
• Isolated Data • Integrated Data
• Repetitive access • Ad-hoc access
• Performance Sensitive • Performance relaxed
• Few Records accessed at a time (tens)
• Large volumes accessed at a time (millions)
• Read/Update Access • Mostly Read (Batch Update)
OLTP vs. Data Warehouse
OLTP Warehouse (DSS)
• Clerical User • Knowledge User (Manager)
• No data redundancy • Redundancy present
• DB Size (100MB -100GB) • DB Size (100GB - few terabytes)
• Transaction throughput is the performance
metric
• Query throughput is the performance metric
• 100s – 1000s of users • 10s – 100s of users
To summarize ...
OLTP Systems are used to “run” a business
The Data Warehouse helps to “optimize” the business
Data Warehouse Architecture
activities
19
Data Warehouse Architecture
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
Architecture
• Operational database layer The source data for the data warehouse — An organization's Enterprise Resource Planning systems fall into this layer.
• Data access layer The interface between the operational and informational access layer — Tools to extract, transform, load data into the warehouse fall into this layer
• Metadata layer The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.
• Informational access layer The data accessed for reporting and analyzing data— Business intelligence tools fall into this layer.
Dimensional Modeling
Database organization• must look like business• must be recognizable by business user• approachable by business user
Schema Types• Star Schema• Snowflake schema
Dimension Tables
• Define business in terms already familiar to users
• Wide rows with lots of descriptive text• Small tables (generally, few thousands)• Joined to fact table by a foreign key• Typical dimensions include time periods,
geographic regions (markets, cities), products, customers, salesperson etc.
Dimensions are similar set of members upon which the user wants to base an analysis.
• Definition Descriptions of the business. The “which, who, how, where, or when that describes or explains the fact.”
• Characteristics - Discretely valued descriptions of variables which are more or less constant. - Enables “slicing and dicing” the facts by different variables.
• Examples Time Customer Product
Sales Fact
RevenueQtyCostGross margin
CustomerDimension
TimeDimension
BranchDimension
ProductDimension
Dimensions
Fact Table
• Central table• Mostly raw numeric items• Narrow rows, a few columns at most• Large number of rows (millions to billions)• Accessed via dimensions
Facts are measurements that users will apply arithmetic calculations to.• What is the average of…
• What is the total of…
Dimensional Model Schemas
• Dimensional Data Models majorly fall into two types of models:• Star Schema• Snowflake Schema
• Several factors influence schema choice:• Presentation restrictions• Inconsistency of data• Complex queries and analysis
Star Schema
• Star schema represents a compromise between the fully normalized model and the denormalized model.
• Descriptive ‘dimension’ information is maintained in a set of denormalized dimension tables.
STAR SCHEMA
A database design that stores a central fact table surrounded by multiple dimension tables.
Star Schema
• A single fact table and for each dimension one dimension table
• Does not capture hierarchies directly
T ime
prod
cust
city
fact
date, custno, prodno, city, ...
Star Schema
Snowflake Schema
• Snowflake schemas are most often used when dealing with large hierarchies that are static.
• Snow flaked tables (look-up tables) may increase the speed of queries depending on the presentation tool
SNOWFLAKE SCHEMA
A database design that stores a central fact table surrounded by multiple dimension tables decomposed or normalized into one or more hierarchies.
Snowflake schema
• Represents dimensional hierarchy directly by normalizing tables.
T ime
prod
cust
city
fact
date, custno, prodno, city, ...
region
Snowflake Schema
Slowly-Changing Dimensions
• Most dimensions change over time. • Products change offered coverage or limits
and deductibles. • Employees are promoted, fired, or change
departments.• Customers change names and addresses.
• What are our choices for tracking these changes over time?
Slowly-Changing Dimensions
• There are three types of slowly changing dimensions:• Type 1: Overwrites the old data for a
record with new data. This eliminates the ability to track history over time.
• Type 2: Creates a new record with the new data at the type of the change. Accurately tracks history, but requires generalized key.
• Type 3: Tracks new and original values in separate fields at time of change. Intermediate values are lost.
Type 1 - Overwrite Old Values
• Customer Lynnette Groves is changing her name to ?
• If there is no value in tracking this change, we will overwrite the First Name and Last Name fields with the new values.
• ‘UPDATE’ statement; 1 record is maintained.
Type 2 - Create New Record
• Lynnette Groves is changing her name and we want to track both values
• Add a second record with a new Customer Key and make it the active row
• ‘INSERT’ statement for new, ‘UPDATE’ for active; 2 records are maintained
• New record for each change up to n records
Type 3 - Original and Current
• We decide that no matter how many times she changes her name, we only want to track the original and the current.
• Before any changes, original and current are the same. Any name change updates ‘current’ fields.
• UPDATE’ statement; 1 record is maintained
Degenerate Dimensions
• Certain attributes are tracked that don’t necessarily belong in their own dimension - orphan attributes.
• This may occur when fact tables are designed to reflect the actual working document.
DEGENERATE DIMENSIONS
Dimensions that are so small and have no attributes of their own that they have been added to the fact table.
Degenerate Dimensions
• Examples include ‘order_number’, ‘bill_of_lading_num’, and ‘invoice_number’.
• While these fields seem very transaction oriented, they are helpful in grouping things such as all line items on an invoice.
• Including these fields on the fact table amounts to denormalizing the attribute due to the granularity of the fact table being the document itself or a line item of the document.
Types of Facts
• Understanding which facts can be added across which dimensions is an important data design issue.
• Three Types of Facts:• Additive• Non-Additive• Semi-Additive
Additive Facts
• Since aggregation is a key element in the usefulness of the dimensional model, its best utilized for facts that are additive, numeric values.
• We can add revenue, cost, and quantity sold for all products, all stores, and any time period.
ADDITIVE FACTS
Measurements in a fact table that can be added across all dimensions.
Semi-Additive Facts
• Current Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?)
• It does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information
SEMI-ADDITIVE FACTS
Measurements in a fact table that can be added across some dimensions but not others.
Non-Additive Facts
• A new value will need to be calculated at each level, for each set of data.
• Ratios Averages & Variance facts.
NON-ADDITIVE FACTS
Measurements in a fact table that cannot be added across any dimensions, like ratios.
Dimensional Modeling Process
Step 1: Choose the grain of each fact table.
• Granularity defines the level of detailed data.
• It must be determined prior to going forward in the modeling process.
• Typical grains are individual transactions, time-based aggregation, and/or aggregations along a commonly used dimension.
Dimensional Modeling Process
Step 2: Choose the dimension attributes.
• For example, what should our time dimension look like? Should it have just ‘January for month’, or also ‘Jan’ and ‘1’?
• Should we store the code and the description, just the code, or just the description?
• What values will our users need to filter or report on?
Dimensional Modeling Process
Step 3: Identify dimensional hierarchies.
• A dimension such as time may have days rolling into months and then quarters, as well as days rolling into weeks which may cross months and quarters.
• Sales geography may differ from physical geography.
• Zip codes can cross city boundaries and cities are made up of multiple zip codes.
Dimensional Modeling Process
Step 4: Choose the dimensions that apply to each fact table.
• Typical dimensions include time, product, policyholder, agent, and geography.
• Remember to evaluate granularity when applying dimensions to facts.
Dimensional Modeling Process
Step 5: Choose the measured facts, including pre calculated facts.
• Each aggregated and derived fact will need to be evaluated for inclusion in the model or calculation in the application.
• Trade-offs include storage and indexing and must be weighed against the access requirements.
Dimensional Modeling Process
Step 6: Determine slowly changing dimensions
• These are the dimensions that change over time.
• If tracking these changes is important, the method must be decided.
• Options: overwrite the existing record, store all records with effective dates, or a historical and current value tables.