data warehousing xintao wu. can you easily answer these questions? what are personnel services costs...
TRANSCRIPT
Data Warehousing
Xintao Wu
Can You Easily Answer These Questions?
What are Personnel
Services costs across all
departments for all funding sources?
What are the effects of
outsourcing specific services?
What is the correlation between
expenditures and collection of
delinquent taxes?
What is the impact on revenues and
expenditures of changing the operating
hours of the Dept. of Motor Vehicles?
What is the economic impact of the small business initiative in
our district?
Overview: Data Warehousing and OLAP Technology for Data Mining
What is a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
What is a Warehouse?
Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile
more
What is a Warehouse?
Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making
Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size GB-TB TB-PB
metric transaction throughput query throughput, response
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Why Separate Data Warehouse?
High performance for both systems DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP
queries, multidimensional view, consolidation.
Different functions and different data: missing data: Decision support requires historical
data which operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
Warehouse Architecture
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
Advantages of Warehousing
High query performanceQueries not visible outside warehouseLocal processing at sources unaffectedCan operate when sources unavailableCan query data not stored in a DBMSExtra information at warehouse Modify, summarize (store aggregates) Add historical information
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
Cube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions &
measures Star schema: A fact table in the middle connected to a set
of dimension tables Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycityprovince_or_streetcountry
city
Example of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Multi-Tiered ArchitectureMulti-Tiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
other
sources
Data Storage
OLAP Server
OLAP Server Architectures
Relational OLAP (ROLAP) ROLAP - provides a Multi-dimensional view of a relational DB (e.g.
MicroStrategy) Use relational or extended-relational DBMS to store and manage warehouse data
and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services greater scalability
Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers specialized support for SQL queries over star/snowflake schemas
MOLAP versus ROLAP
MOLAPMultidimensional OLAPData stored in multi-dimensional cubeTransformation requiredData retrieved directly from cube for analysisFaster analytical processingCube size limitations
ROLAPRelational OLAPData stored in relational database as virtual cubeNo transformation neededData retrieved via SQL from database for analysisSlower analytical processingNo size limitations
Data Warehouse Back-End Tools and UtilitiesData extraction: get data from multiple, heterogeneous, and external
sourcesData cleaning: detect errors in the data and rectify them when
possibleData transformation: convert data from legacy or host format to
warehouse formatLoad: sort, summarize, consolidate, compute views, check
integrity, and build indices and partitionsRefresh propagate the updates from the data sources to the
warehouse
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehouse to data mining
Data Mining: A KDD Process
Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Steps of a KDD Process
Learning the application domain: relevant prior knowledge and goals of application
Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association,
clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
Summary
Data warehouse A subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making process
A multi-dimensional model of a data warehouse
Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivotingOLAP servers: ROLAP, MOLAP, HOLAPFrom OLAP to OLAM