introduction to data warehousing, profiling and...
TRANSCRIPT
COMP33111, 2011/2012 1
2 2
COMP33111 Lecture 2
Introduction to
Data Warehousing,
Profiling and Cleansing
Goran Nenadic
School of Computer Science
3 3
Aims
Understand the need for data warehousing
Learn basic principles of data warehousing
Understand data warehousing models and
architectures
Review the basic approaches to data
profiling and cleansing
4 4
Plan
Lecture today
Tutorial and lab exercise 2 covers practical aspects and examples
Lab test 1: 22 October 2012, 14-15 (3rd year Lab)
COMP33111, 2011/2012 2
5 5
Need for data analysis
Modern enterprise environment competition is more intense than ever
quantity of information is increasing
In order to succeed, an organisation must have a comprehensive view of all of its aspects
-> data integration
make informed and reliable decisions
-> data analysis
take timely actions and accurate predictions
recall from the Intro lecture
6 6
Online transaction processing
Traditional “transactional” information, i.e.
operational data that documents everyday life
in an enterprise/organisation
retail (e.g. sales in supermarket stores)
financial services (e.g. ATM withdrawals)
transport (e.g. flight bookings)
telecommunications (e.g. mobile billing, Internet)
healthcare (e.g. drug prescriptions)
This is OLTP or operational or transactional
data
recall from the Intro lecture
7 7
Online transaction processing OLTP: processing and recording transactions
that create new data and/or update existing
information in operational DBs
insertions, updates, deletions
Typically a small number of rows are affected in
each transaction
Traditional DBMS optimised to perform well in
OLTP, but not in comprehensive exploration,
aggregation and decision making
recall from the Intro lecture
COMP33111, 2011/2012 3
8
Data management
Vertical fragmentation of information systems
into numerous, heterogeneous OLTP systems
8
9
Goal: unified access to data
Collect and combine data
Provide integrated view
Support data sharing
9
10
Traditional query-driven approach
Translate a query into a series of queries to the
different operational sources and then combine
the answers
10
COMP33111, 2011/2012 4
11
Traditional query-driven approach
PROS
• Complete data: • all data available at sources
can be used to answer query
• Always “fresh” data
CONS
• Incomplete data: sometimes
data source unavailable
• e.g. broken connection
• Delay in query processing
• slow information sources
• complex filtering and
integration
• Inefficient and potentially
expensive for frequent queries
• Competes with/disturbs local
processing at sources
• Possibly problems with data
access (e.g. personal data)
11
continued
12
Data warehousing approach
Data integrated in advance
12
13 13
Data warehouse (DW)
DW: an integrated database designed to support data analysis, business intelligence, and better and faster decision making
DWs integrate and aggregate data from various operational and external DBs maintained by different units different data representations, inconsistencies etc.
DW needs to provide more complex aggregation and analysis of data
mining “new” data (e.g. spending trends)
COMP33111, 2011/2012 5
14
Data warehouse - definitions
14
“A data warehouse is a subject-oriented, integrated,
time-variant, and non-volatile collection of data in
support of the management’s decision-making
process.”
-- Bill Inmon (~1990)
15 15
Data warehouse - definition
Subject-oriented: the warehouse is organised
around the major subject(s) of the enterprise.
Integrated: it merges various source data from
different applications and systems.
Time-variant - data in the warehouse is only
accurate and valid at some point in time or over
some time interval.
Non-volatile - data is not updated in real time but
is refreshed from operational systems on a
regular basis.
continued
16 16
Data warehouse characteristics
DW typically integrates several resources
e.g. sales DBs from various regions/states/years
Requires more historical data than generally maintained
in operational DBs
Must be optimised for access to very large amounts of
data
Mostly read-accessed, rarely write-accessed
DW data may be more coarse grained than in
operational DBs;
e.g. agregated, summerised, de-personalised
COMP33111, 2011/2012 6
17 17
Data warehouse characteristics
DWs are maintained separately from operational data
functional reasons
“historical” data (e.g. sales over long periods of time)
consolidated data (aggregated, summarised from
various sources, including both internal and external)
performance reasons
avoid degradation of services (operational DBs are not
optimised for “non-transactional” applications)
DW updates may not be so frequent
Difference between real-time and derived data
continued
18 18
General DW role
OLAP
DSS
data mining
applications
data
warehouse
operational DBs
external
DBs
Extract
Transform
Load
ETL
derived data
19
Major DW issues
DW design
Data extraction
monitors (change detectors), updates
Integration
cleansing & merging
DW maintenance and evolution
Optimisations
19
COMP33111, 2011/2012 7
20 20
Data warehouse
modelling and design
21
DW modelling and design
How to organise a DW to facilitate
views/queries with aggregates, e.g., sum,
min, avg, etc
maintenance difficult and
computation costly
need to decide what to pre-compute and when
storage of large amounts of data
21
22 22
DW modelling and design
Modelling questions
What are the user requirements (types of analysis needed)?
Which data should be considered (measures)?
Which data is available (OLTP, integration)?
What is the context of the data?
What are the data inter-dependences (integration)?
What is the scale of the project?
Define data model and design the schema
traditional ER (entity-relationship) modelling techniques may not be appropriate
multidimensional model
continued
COMP33111, 2011/2012 8
23
Multidimensional data model
Designed for the analysis for multi-dimensional
data i.e. data organised along dimensions, e.g.,
time (days, months, weeks, quarters, ...),
space (shop, region, country,...),
product (food, non-food, dairy, imported, blue,...)
customers (private, public, large, old,...),...
Popular data model for DW
23
24 24
Multidimensional data model
Dimensionality modelling
logical design technique that aims to present the data
in a standard, intuitive form that allows for high-
performance access.
Uses the concepts of entity-relationship (ER)
modelling with some restrictions
Focused on key performance indicators that
need to be analysed
continued
25 25
Multidimensional data model
DW: a set of points in a multi-dimensional space e.g. (store, product, amount, time)
time is typically one of the dimensions, as it is of particular significance to decision support
Objects of analysis (key performance indicators) are typically numeric measures can be aggregated
e.g. sales, budget, number of applications, …
Dimensions describe characteristics of the measures, i.e. their “context”
continued
COMP33111, 2011/2012 9
26 26
Example
Sales
Time
Product
Shop
Location
Sales per item
Sales per location
Sales per month
measures
Time Characteristics
Time
Shop Characteristics
Shop
Product Characteristics
Product
Location Characteristics
Location
27
Cube representation
27
3 dimensions: store, product, time
1 measure: amount (amt)
28 28
Multidimensional data example
Dimensions: Product, City, Date
Hyper-cube representation
Date/Time
Pro
duct
COMP33111, 2011/2012 10
29 29
Example – cube
30 30
Example
Dec.
Mar.
Feb.
Apr.
May
Jun.
Jul.
Aug.
Sep.
Oct.
Nov.
Jan.
Mo
nth
Su
pe
rma
rke
t
Mis
ce
lla
ne
ou
s
Re
sta
ura
nt
Tra
ve
l
Re
tail
Ve
hic
le
Category
Region
One
FourThreeTwo
Month = Dec.
Count = 110
Amount = 6,720
Region = Two
Category = Vehicle
multidimensional
cube for credit
card purchases
31
Dimension modelling
Rule of thumb: dimensions involve attributes that
provide the context for measures
Each dimension should have a single base level,
i.e. the level of finest granularity.
“day” can be the base level for the “time” dimension
“products” can be the base level for the “product”
dimension
“store” can be the base level for “location dimension”
31
COMP33111, 2011/2012 11
32
Dimension modelling
Base level elements can then be aggregated or
summarised to more coarse-grained levels
“days” -> weeks, months, quarter, year, ....
“products” -> brand, food/non-food, supplier, ....
“store” -> district, region, country, urban/rural, ....
32
33 33
Dimension modelling - attributes
Each dimension has a set of attributes
e.g. product: category, year, manufacturer
e.g. store: town, county, manager, size
Attributes can be related
hierarchically
lattice
34
Lattice hierarchy
34
COMP33111, 2011/2012 12
35 35
DW data models
For multidimensional implementation in a relational DB, we use two types of tables:
dimension tables tuples of attributes of a dimension
one table for each of the dimensions
fact table contains the data – “facts”
typically only one fact table (potentially with multiple measures)
36 36
Example:
dimension table for product: product ID, name, manufacturer, type, etc.
dimension table for fiscal quarter quarter ID, quarter number, year, BEG/END dates
dimension table for region region ID, county, capital
fact table (product_ID, quarter_ID, region_ID, sales, qty)
continued
DW data models
37
DW data models
product_ID quarter_ID region_ID sales qty
numerical values key columns joining fact table
to dimension tables
continued
Product Info
Quarter Info
Region Info
COMP33111, 2011/2012 13
38 38
DW data models
Each dimension table has a simple primary key product_ID or quarter_ID
Fact table uses a composite primary key the primary key of the fact table is made up of two
or more foreign keys linking to dimension tables.
Such simple model structure is called a star schema
continued
39 39
Star
schema
example
[see separate sheet]
40
Storing aggregated data
Star schema: data of all granularity is kept in a
single fact table
need an aggregation level indicator
“Fact constellation” schema
aggregate tables are kept separate from fact tables
one table per combination of levels of dimensions
40
COMP33111, 2011/2012 14
41
Fact constellation schema
41
42
Snowflake schema
When dimensions have very high cardinality
(many attributes and tuples):
(partly) normalise the dimension tables by
attribute level, with each smaller dimension
table pointing to an appropriate aggregated
fact table
one table per dimension level
(star schema: 1 table per dimension)
different fact tables as in fact constellation
42
43 43
Snowflake schema example
Dimension Branch has its own
dimension City which in turn has a
dimension Region.
Dimension tables are organised in a hierarchy by normalisation
COMP33111, 2011/2012 15
44
Snowflake schema
For each level of a dimension, all tuples of this
level are stored in their own table together with
the attributes for this level plus the key(s) for the
parent “level(s)”, e.g.
time(day,month,week,quarter,year) is split into
time1(day, week, month),
time2a(week, year),
time2b(month, quarter),
time3(quarter, year)
44
continued
45 45
Data model schemas
Star schema: a logical structure that has a fact table containing factual data in the centre, surrounded by dimension tables (for each dimension) containing reference data.
Snowflake schema: a variant of the star schema where dimension tables can further have dimension tables, a kind of fractal structure.
Starflake schema: a hybrid structure that contains a mixture of star and snowflake schemas.
46
Data model schemas
Understand which summary levels exist i.e. will
be used:
understand dimensions
Start with a star schema and then create the
“snowflakes” with queries:
find the proper “snowflaked” attribute tables
46
continued
COMP33111, 2011/2012 16
47 47
Data warehouse
architectures
48 48
Data warehouse architecture
49 49
Architectural components - data
Operational data source data (operational DBs) for the DW
Operational data-store a repository of operational data
Meta-data description of data (data about data)
used for acquisition of data, DW management, and querying
COMP33111, 2011/2012 17
50 50
Architectural components - data
Detailed data aggregation of other data
Lightly and highly summarised data generated by the warehouse manager
e.g. “summary” tables that hold aggregations such as sums, averages, counts of values in other tables
other derived data (“calculated columns”)
Archive/backup data
continued
51 51
Architectural components - tools
Load manager extraction, acquisition and loading of data into the DW
Warehouse manager management of the DW, e.g. indexing, backup
Query manager management of user queries
End-user access tools reporting, querying, application development tools
executive information system (EIS) tools
online analytical processing (OLAP) tools
data mining tools
52 52
Load manager tasks
Data extraction acquisition of data from operational DBs
design changes in operational DBs require adjustments
Data cleansing detect data anomalies and rectify them
Data transformation/reformatting transforming and linking data
Loading loading, building indexes, etc.
Refreshing propagate updates from operational DBs to DW
COMP33111, 2011/2012 18
53
Data quality problems
Instance-level problems
errors and inconsistencies in the actual data contents
noisy, dirty data – e.g. missing values, duplicates, etc.
Schema-level problems
referential integrity
heterogeneous data models
53
54
Data quality problems
54
Rahm, E.; Do, H.H.
Data Cleaning: Problems and Current Approaches
IEEE Techn. Bulletin on Data Engineering, Dec. 2000
continued
55
Examples
55
COMP33111, 2011/2012 19
56
Data cleansing
Several steps:
Data analysis and profiling: data type, illegal values,
length, value ranges, variance, uniqueness, null
values, typical string patterns (e.g. phone numbers) ...
Mapping rules: define functions to clean as much of
inconsistencies as possible (e.g. attribute splitting,
approximate joins, spelling checks, etc)
Verification: has everything worked correctly?
Backflow to cleaned data: highlight the problems
56
Look at tutorials 1 and 2
57 57
Data transformation/reformatting
Link and consolidate data from different sources handle possibly different DBs schemes
parsing and fuzzy matching may be needed
avoid un-necessary redundancy
Reformatting/mapping as appropriate town region state
date month fiscal quarter year
Reconcile semantic mismatches transform into uniform measures (e.g. currency)
e.g. “gender” and “sex” should be the same
58
Data transformation/reformatting
Format revisions (data format of variables)
Decoding of fields (0 = disable; 1 = enable)
Character sets, units of measurement, date/time
conversions
Calculated and derived values
Splitting single fields, merging of multiple fields
Key restructuring (come up with keys for the fact and
dimension tables based on the keys in source systems)
Deduplication
58
typical tasks
COMP33111, 2011/2012 20
59 59
Loading data
“Materialise” prepared data and store it in DW
Large volumes of data to be loaded may take several days to load
can be performed in parallel
Building summary tables (various pre-calculated aggregations)
indexes (supporting most-likely queries)
meta-data (various information about the data)
60 60
Meta-data generation
Meta-data: data about data
Build a meta-data repository
Document the following source DBs
DW schema
loading process description (applied transformation rules, cleaning methods, etc.)
refreshing policy
security issues, user profiles/groups
statistics
etc.
61 61
ETL process
Previous tasks also known as ETL
extract
transform
load
data
warehouse
operational DBs
external
DBs
Extract
Transform
Load
ETL
COMP33111, 2011/2012 21
62
62
summary
63 63
Refreshing data in DW
Refreshing intervals e.g. every night, week, or quarterly
typically not so frequent (depending on the subject area; e.g. for stock DWs, up-the-minute data may be required)
Possibly different refreshing periods for different operational DBs
Approaches full extraction and loading (expensive, not efficient)
incremental detect and propagate only changes in operational DBs
logical and transactional correctness and integrity
purging out-of-date data
64
COMP33111, 2011/2012 22
65 65
Data warehouse data flows
66 66
Data warehouse data flows In-flow
extraction, cleansing and loading of the source data
Up-flow adding value to the data in the warehouse through
summarising, packaging and distribution of the data
Down-flow archiving and backing-up data in the warehouse
Out-flow making the data available to end-users
Meta-flow managing the meta-data
continued
67
67
COMP33111, 2011/2012 23
68 68
DW architecture types
Central warehouse
Distributed warehouse needs replication, partitioning, communication
replicated metadata repository on each site
Federated warehouse decentralised autonomous warehouses
(e.g. containing data marts)
Data web-house a distributed data warehouse that is implemented
over the Web with no central data repository
69 69
Data marts
Enterprise warehouse: collects all information
about subjects that span the entire organisation
products, sales, customers, locations, finance ...
requires extensive business modelling
may take years to design & build
Data marts: special-purpose warehouses
a subset of a data warehouse that supports the
requirements of a particular department or
business function.
70 70
Data marts
Can be standalone or linked centrally to the
corporate data warehouse
can be built without enterprise-DW: faster roll out, but
complex integration in the long run
Still, often used for building a DW
loading and consolidation from data marts
Good to control/allow access to data
continued
COMP33111, 2011/2012 24
71 71
Data warehousing:
trends and open issues
72 72
Multi-dimensional model
Benefits Predictable, standard framework
Respond well to changes in user reporting needs
Relatively easy to add data without reloading tables
Standard design approaches have been developed
There exist a number of products supporting the dimensional model
Denormalized and indexed relational models more flexible
Multidimensional models simpler to use and more efficient
73 73
Multidimensional model
Typically not in 3NF since performance can be truly awful. Most of the work that
is performed on denormalizing a data model is an attempt to reach performance objectives.
the structure can be overwhelmingly complex. We may wind up creating many small relations which the user might think of as a single relation or group of data.
continued
COMP33111, 2011/2012 25
74 74
Typical problems
Data homogenisation
Complexity of integration
Hidden problems with source systems
Required data not captured
Data ownership/access/privacy policies
Underestimation of resources for data loading
High demand for resources
Increased end-user demands
High maintenance costs
75 75
Challenges
Design and management of the DW project managing design, construction, implementation
long-term and large-scale project
monitoring utilisation patterns and design solutions
Administration of DW an intensive, major activity
includes possible evolutions of data models
Quality of data merging data from heterogonous sources
integration issues (different namings, etc.)
76 76
Open issues
Automating acquisition and loading of operational data
Self-maintainability
Intelligent meta-data extraction/generation
Performance optimisation quite large DBs: many terabytes, growing by as much
as 50% a year
Privacy and security depersonalisation of data, user authentication and
authorization, role-based access control
see [Elmasri & Navathe; sec. 23.3.2]
COMP33111, 2011/2012 26
77 77
Examples of
DW systems
78 78
Examples of DW systems
IBM DB2 Universal Data Warehouse Edition a business intelligence platform that includes DB2, federated data
access, data partitioning, enhanced extract, transform, and load (ETL), workload management etc.
see a case study on EDEKA
Microsoft SQL Server Data Transformation Services ETL data from source(s) to the data warehouse
see a case study on Barnes & Noble (web)
Oracle Business Intelligence Warehouse Builder dimensional design, ETL, deployment manager
see tutorials/demos/documents on the web
79 79
COMP33111, 2011/2012 27
80 80
81
Demos and tutorials PALO OLAP (Jedox)
open source, available for download
demos available
Dundas OLAP services demos available
Talend enterprise-ready open source data warehousing
and integration
see case studies (e.g. Virgin Mobile, Sony – gaming analytics)
Radar-soft OLAP and Visual Analysis
See Tutorial 3 for Dundas OLAP and Radar-soft
82 82
Dundas OLAP Services
See Tutorial 3
COMP33111, 2011/2012 28
83 83
Data warehousing:
wrapping up
84 84
London branch Sale branch Manchester branch
Census
data
Operational data
Detailed
transactional
data
Data warehouse Integrate
Clean
Summarise
Direct
Query
Reporting
tools
Mining
tools OLAP
Data analysis
GIS
data
85
General 3x3 DW architecture
85
COMP33111, 2011/2012 29
86 86
Steps for building a DW
1. Choose relevant business processes
2. Choose the grain (granulation) of DW
3. Identify the dimensions and model the schema
4. Choose, extract, transform and load the facts
5. Store pre-calculations in the fact table(s)
6. Round out the dimension tables
7. Choose the duration/refresh of the database
8. Track slowly changing dimensions
9. Decide the query priorities and modes
87 87
Summary
DW: an integrated database for supporting higher-level analysis of data analysis using multi-dimensional and pre-aggregated data
data model: multidimensional
Integrates and aggregates operational DBs data profiling and cleansing
data quality
Access to huge amount of data optimised querying, indexing is needed
Huge benefits supports competitive business intelligence
also, scientific data warehouses
88 88
Reading for this lecture
Chapters 31 and 32 in [Connolly & Begg]
Also, chapter 28 (28.1–28.4, 28.7) in [Elmasri & Navathe] Rahm, E.; Do, H.H. Data Cleaning: Problems and Current Approaches
IEEE Techn. Bulletin on Data Engineering, Dec. 2000
Other on-line materials: Blackboard and
http://www.cs.manchester.ac.uk/ugt/COMP33111/
Additional reading (on the web) Chaudhuri, Dayal: An overview of data warehousing and OLAP technology
(http://portal.acm.org/citation.cfm?id=248616)
W.H. Inmon: Building the data warehouse, Wiley, 2002 ISBN: 0471081302
Complete Tutorial and lab exercises 2
Try Dundas OLAP services (Lab/tutorial exercise 3)