data warehousing and data mining
DESCRIPTION
TRANSCRIPT
1University of Sunderland CIFM06 DB Systems Development Session 17
Data Warehousing and Data Data Warehousing and Data MiningMining
May 2006
2University of Sunderland CIFM06 DB Systems Development Session 17
ContentsContents
• Data Warehousing
• OLAP
• Data Mining
• Further Reading
3University of Sunderland CIFM06 DB Systems Development Session 17
Data WarehousingData Warehousing
• OLTP (online transaction processing) systems – range in size from megabytes to terabytes– high transaction throughput
• Decision makers require access to all data– Historical and current– 'A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of management’s decision-making process' (Inmon 1993)
4University of Sunderland CIFM06 DB Systems Development Session 17
BenefitsBenefits
• Potential high returns on investment– 90% of companies in 1996 reported return of
investment (over 3 years) of > 40%
• Competitive advantage– Data can reveal previously unknown, unavailable
and untapped information
• Increased productivity of corporate decision-makers– Integration allows more substantive, accurate
and consistent analysis
5University of Sunderland CIFM06 DB Systems Development Session 17
ComparisonComparisonOLTP Data Warehouse
Holds current data Holds historic data
Stores detailed data Detailed, lightly/highlysummarised data
Data is dynamic Data largely static
Repetitive processing Ad hoc querying, unstructuredand heuristic processing
High transaction throughput Medium-low level transactionthroughput
Predictable usage patterns Unpredictable usage patterns
Transaction driven Analysis driven
Application oriented Subject oriented
Supports day-to-day decisions Strategic decisions
Large number ofclerical/operational users
Lower number of managerialusers
Source: Connolly and Begg p1153
6University of Sunderland CIFM06 DB Systems Development Session 17
Typical ArchitectureTypical Architecture
Warehouse mgr
Loadmgr
Warehouse mgr
Querymanager
DBMS
Meta-data Highlysummarizeddata
Lightly summarizeddata
Detailed data
Mainframe operationaln/w,h/w data
DepartmentalRDBMS data
Private data
External dataArchive/backup
Reporting query, appdevelopment,EIS tools
OLAP tools
Data-mining tools
Source: Connolly and Begg p1157
7University of Sunderland CIFM06 DB Systems Development Session 17
Data WarehousesData Warehouses
• Types of Data– Detailed– Summarised– Meta-data– Archive/Back-up
8University of Sunderland CIFM06 DB Systems Development Session 17
Information FlowsInformation Flows
Warehouse Mgr
Loadmgr
Warehouse mgr
Querymanager
DBMS
Meta-data
Highlysumm.data
Lightlysumm.
Detailed data
Operational datasource 1
Operational datasource n
Archive/backup
Reporting query, appdevelopment,EIS tools
OLAP tools
Data-mining tools
Meta-flow
Inflow
Downflow
Upflow
Outflow
Source Connolly and Begg p1162
9University of Sunderland CIFM06 DB Systems Development Session 17
Information Flow Information Flow ProcessesProcesses
• Five primary information flows– Inflow - extraction, cleansing and loading of data
from source systems into warehouse– Upflow - adding value to data in warehouse
through summarizing, packaging and distributing data
– Downflow - archiving and backing up data in warehouse
– Outflow - making data available to end users– Metaflow - managing the metadata
10University of Sunderland CIFM06 DB Systems Development Session 17
Problems of Data Problems of Data WarehousingWarehousing
1. Underestimation of resources for data loading2. Hidden problems with source systems3. Required data not captured4. Increased end-user demands5. Data homogenization6. High demand for resources7. Data ownership8. High maintenance9. Long duration projects10.Complexity of integration
11University of Sunderland CIFM06 DB Systems Development Session 17
Data Warehouse Data Warehouse DesignDesign
• Data must be designed to allow ad-hoc queries to be answered with acceptable performance constraints
• Queries usually require access to factual data generated by business transactions– e.g. find the average number of properties rented
out with a monthly rent greater than £700 at each branch office over the last six months
• Uses Dimensionality Modelling
12University of Sunderland CIFM06 DB Systems Development Session 17
Dimensionality Dimensionality ModellingModelling
• Similar to E-R modelling but with constraints– composed of one fact table with a composite primary key– dimension tables have a simple primary key which
corresponds exactly to one foreign key in the fact table– uses surrogate keys based on integer values
– Can efficiently and easily support ad-hoc end-user queries
13University of Sunderland CIFM06 DB Systems Development Session 17
Star SchemasStar Schemas
• The most common dimensional model• A fact table surrounded by dimension tables• Fact tables
– contains FK for each dimension table– large relative to dimension tables– read-only
• Dimension tables– reference data– query performance speeded up by denormalising
into a single dimension table
14University of Sunderland CIFM06 DB Systems Development Session 17
E-R Model ExampleE-R Model Example
Source: Connolly and Begg
15University of Sunderland CIFM06 DB Systems Development Session 17
Star Schema ExampleStar Schema Example
Source: Connolly and Begg
16University of Sunderland CIFM06 DB Systems Development Session 17
Other SchemasOther Schemas
• Snowflake schemas– variant of star schema– each dimension can have its own dimensions
• Starflake schemas– hybrid structure– contains mixture of (denormalised) star and
(normalised) snowflake schemas
17University of Sunderland CIFM06 DB Systems Development Session 17
OLAPOLAP
• Online Analytical Processing– dynamic synthesis, analysis and consolidation of
large volumes of multi-dimensional data
– normally implemented using specialized multi-dimensional DBMS
• a method of visualising and manipulating data with many inter-relationships
– Support common analytical operations such as• consolidation
• drill-down
• slicing and dicing
18University of Sunderland CIFM06 DB Systems Development Session 17
Codd’s OLAP RulesCodd’s OLAP Rules1. Multi-dimensional conceptual view2. Transparency3. Accessibility4. Consistent reporting performance5. Client-server architecture6. Generic dimensionality7. Dynamic sparse matrix handling8. Multi-user support9. Unrestricted cross-dimensional operations10. Intuitive data manipulation11. Flexible reporting12. Unlimited dimensions and aggregation levels
19University of Sunderland CIFM06 DB Systems Development Session 17
OLAP ToolsOLAP Tools• Categorised according to architecture of
underlying database– Multi-dimensional OLAP
• data typically aggregated and stored according to predicted usage
• use array technology
– Relational OLAP• use of relational meta-data layer with enhanced SQL
– Managed Query Environment• deliver data direct from DBMS or MOLAP server to desktop
in form of a datacube
20University of Sunderland CIFM06 DB Systems Development Session 17
MOLAPMOLAP
RDBServer
Load
MOLAPserver
Request
Result
PresentationLayer
Database/ApplicationLogic Layer
21University of Sunderland CIFM06 DB Systems Development Session 17
ROLAPROLAP
RDBServer
ROLAPserver
Request
Result
PresentationLayer
ApplicationLogic Layer
SQL
Result
DatabaseLayer
22University of Sunderland CIFM06 DB Systems Development Session 17
MQEMQE
RDBServer
Load
MOLAPserver
Request
Result
SQL
Result
End-usertools
23University of Sunderland CIFM06 DB Systems Development Session 17
Data MiningData Mining
• ‘The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’ (Simoudis, 1996)
– focus is to reveal information which is hidden or unexpected
– patterns and relationships are identified by examining the underlying rules and features of the data
– work from data up– require large volumes of data
24University of Sunderland CIFM06 DB Systems Development Session 17
Example Data Mining Example Data Mining ApplicationsApplications
• Retail/Marketing– Identifying buying patterns of customers– Finding associations among customer
demographic characteristics– Predicting response to mailing campaigns– Market basket analysis
25University of Sunderland CIFM06 DB Systems Development Session 17
Example Data Mining Example Data Mining ApplicationsApplications
• Banking– Detecting patterns of fraudulent credit card use– Identifying loyal customers– Predicting customers likely to change their credit
card affiliation– Determining credit card spending by customer
groups
26University of Sunderland CIFM06 DB Systems Development Session 17
Data Mining Data Mining TechniquesTechniques
• Four main techniques– Predictive Modelling– Database Segmentation– Link Analysis– Deviation Direction
27University of Sunderland CIFM06 DB Systems Development Session 17
Data Mining Data Mining TechniquesTechniques
• Predictive Modelling– using observations to form a model of the
important characteristics of some phenomenon
• Techniques:– Classification– Value Prediction
28University of Sunderland CIFM06 DB Systems Development Session 17
Classification Example- Classification Example- Tree InductionTree Induction
Customer renting property> 2 years
Rent property
Rent property Buy property
Customer age> 25 years?
No Yes
No Yes
Source: Connolly and Begg
29University of Sunderland CIFM06 DB Systems Development Session 17
Data Mining Data Mining TechniquesTechniques
• Database Segmentation:– to partition a database into an unknown number
of segments (or clusters) of records which share a number of properties
• Techniques:– Demographic clustering– Neural clustering
30University of Sunderland CIFM06 DB Systems Development Session 17
Segmentation: Segmentation: Scatterplot ExampleScatterplot Example
Source: Connolly and Begg
31University of Sunderland CIFM06 DB Systems Development Session 17
Data Mining Data Mining TechniquesTechniques
• Link Analysis– establish associations between individual
records (or sets of records) in a database• e.g. ‘when a customer rents property for more than
two years and is more than 25 years old, then in 40% of cases, the customer will buy the property’
– Techniques• Association discovery• Sequential pattern discovery• Similar time sequence discovery
32University of Sunderland CIFM06 DB Systems Development Session 17
Data Mining Data Mining TechniquesTechniques
• Deviation Detection– identify ‘outliers’, something which deviates from
some known expectation or norm
– Statistics– Visualisation
33University of Sunderland CIFM06 DB Systems Development Session 17
Deviation Detection: Deviation Detection: Visualisation ExampleVisualisation Example
Source: Connolly and Begg
34University of Sunderland CIFM06 DB Systems Development Session 17
Mining and Mining and WarehousingWarehousing
• Data mining needs single, separate, clean, integrated, self-consistent data source
• Data warehouse well equipped:– populated with clean, consistent data– contains multiple sources– utilises query capabilities– capability to go back to data source
35University of Sunderland CIFM06 DB Systems Development Session 17
Further ReadingFurther Reading
• Connolly and Begg, chapters 31 to 34.
• W H Inmon, Building the Data Warehouse, New York, Wiley and Sons, 1993.
• Benyon-Davies P, Database Systems (2nd ed), Macmillan Press, 2000, ch 34, 35 & 36.