designing the data warehouse / data mart methodologies and techniques
Post on 15-Jan-2016
218 Views
Preview:
TRANSCRIPT
Designing the data warehouse/ data mart
Methodologies and Techniques
Mainframe and non-mainframe data sources
Extraction, Transformation and Loading Tool - for data quality and integrity
Using flat-files or a staging area
Relational Database on a dedicated Server:
Normalised, Snowflake or Star data schema
Static/Flexible Reports:
Batch and on-line reporting environment
Data Mining:
Using an inductive logic engine to identify correlations and hidden relationships
Analytical Tool:
Interactive analysis -OLAP tool
User Interface Deployment:
Web deployment, Power-user client interfaces
Need to provide rich presentation visualisation, user interaction and supplementary documentation
Scrutinising MultidimensionalSummary Cubes
Surveying Transaction-level
Data
Showing Reportsand
Analytical Results
Data Warehouse Store
Source Systems
Discovering Surprise
Relationships
Data Staging for Cleansing
Basic principles
Life cycle of the DW
Operational DatabasesOperational Databases Warehouse DatabaseWarehouse Database
First time loadFirst time load
RefreshRefresh
RefreshRefresh
RefreshRefresh
Purge or ArchivePurge or Archive
Data transfers into a database
• First time system implementation– From a manual system
• Data warehousing projects
• Database version upgrade
• ERP projects
• Migration– From old to new system
Data transfers between systems
• Dynamic data (eg. sales orders)– Interface required?
• Static data (eg. customers)– Conversion required?
What can go wrong
• Data not available– feature activated from implementation onwards– Massive data entry– Eg: different account structure
• Data incomplete• Data inconsistent (eg: engineering vs accounts)• Wrong level of granularity• Data not clean• New system requires changes – new product
codes
Data cleaning must address
• Different department record same info under different codes
• Multiple records of same company (under different names)
• Fields missing in input tables (eg: c/o)• Different depts. Record different
addresses for same customer• Use of different units for time periods
Labour intensive tasks
• Data entry• Data checks• Working on solving conflicts• Allocating new codes
• Solution = introduce as much automation as possible– SQL / SQL loader (Oracle)– Custom conversion programmes to extract, modify and upload
data– Filtering– Parsing (eg: excel)– Staging areas for conversion in progress
Data utilities
• ORACLE is king of data handling• Export: to transfer data between DBs
– Extract both table structure and data content into dump file
• Import: corresponding facility• SQL*loader automatic import from a
variety of file formats into DB files– Needs a control file
Control files: using SQLloader
• Data tranfers in and out of DB can be automated using the loader– Create a data file with the data(!)– Create a control file to guide the operation
• Load creates two files– Log file– “bad transactions” file
• Also a discard file if control file has selection criteria in it
Example 1 – the supplier file
Sup code Sup name Sup address City Phone4 digits
OLD
New supplier code to include city where firm is basedAssignation of category based on amounts purchased
Example 1 – the supplier file
Sup code Sup name Sup address City Phone4 digits
Sup code Sup name Sup address… Phone Cat3 letters + 1,2,3 depending4 digits on total purchases
last year
OLD
NEW
New supplier code to include city where firm is basedAssignation of category based on amounts purchased
Example 2 – New Cost Accounting Structure
Maintenance department expenditure:1 account => separate accounts for different production activities
Intervention code Desc. Date LabourParts Total
OLD
Example 2 – New Cost Accounting Structure
Maintenance department expenditure:1 account => separate accounts for different production activities
Intervention code Desc. Date LabourParts Total
Intervention code Desc. Date labour Parts Total Account
OLD
NEW
Example 3: merging files
• Complete customer file based on Accounts and Sales and Shipping
OLD (finance)
CustID name address city account number credit limit balance
OLD (sales)
OLD (Shipping)
CustID* name address city discount rates sales_to_date rep_name
CustID** name address city Preferred haulier
Example 4: change of business practices
• Payment by bank draft for international customers• Automatic payment into account for national customers
• Payment direct into account for all customers
Data Staging Area
• The construction site for the warehouse• Required by most scenarios• Connected to wide variety of sources• Clean / aggregate / compute / validate
data
ExtractExtract
TransformTransform
Operationalsystem TransportTransport
(Load)(Load)
Warehouse
Datastaging
area
Remote Staging ModelData staging area within the warehouse environment
Extract,Extract,transform,transform,transporttransport
TransformTransform
Operationalsystem TransportTransport
(Load)(Load)
Datastaging
areaWarehouse
Warehouse environmentWarehouse environmentOper. envt.Oper. envt.
Data staging area in its own environment, avoiding negative impact on the warehouse environment
Extract,Extract,transform,transform,transporttransport
TransformTransform
Operationalsystem TransportTransport
(Load)(Load)
Datastaging
areaWarehouse
Staging envt.Staging envt.Oper. envt.Oper. envt. Warehouse envt.Warehouse envt.
Onsite Staging Model
ExtractExtract
TransformTransform
Operationalsystem TransportTransport
(Load)(Load)
Datastaging
area Warehouse
Operational environmentOperational environment WH envt.WH envt.
Data staging area within the operational environment,
possibly affecting the operational system
Data Mart • A subset of a data warehouse that
supports the requirements of a particular department or business function.
• Characteristics include:– Do not normally contain detailed operational data
unlike data warehouses.– May contain certain levels of aggregation
MarketingSales
FinanceHuman Resources
Dependent Data Mart
Data Data WarehouseWarehouse
Data MartsData Marts
External DataExternal Data
Flat FilesFlat Files
Operational Systems Marketing
Sales
Finance
Independent Data Mart
Sales or MarketingSales or Marketing
External DataExternal Data
Flat FilesFlat FilesOperational Systems
Reasons for Creating a Data Mart
• To give users more flexible access to the data they need to analyse most often.
• To provide data in a form that matches the specific needs of a group of users
• To improve end-user response time.
• Potential users of a data mart are clearly defined and can be targeted for support
Reasons for Creating a Data Mart
• To provide appropriately structured data as dictated by the requirements of the end-user access tools.
• Building a data mart is simpler (and much quicker) compared with establishing a corporate data warehouse.
• The cost of implementing data marts is far less than that required to establish a data warehouse.
Exploiting the DW data
• DW is a platform for creating a wide array of reports
• It solves data feed problems, but does not lead to specific decision support
• Need a model for organising data into meaningful reports
• Need specific interfaces for users
ExtractionCleaningTransformationLoading
Relational Databaseon a dedicated Server
De normalised, data
Static Reporting
Scrutinising
MultidimensionalData CubesOLAP tools
Data Warehouse
Source Systems
Discovering
Data Mining…….
Data StagingArea
Exploiting the DW data
Multidimensional Models
The data is found at the intersection of dimensions.
ProductProduct
P/L_LineP/L_Line
TimeTime
FINANCE
MarketMarket
ProductProduct
TimeTime
SALES
CustomerCustomer
Representing multidimensional data
MOLAP Server• The application layer
stores data in a multidimensional structure
• The presentation layer provides the multidimensional view MOLAP
Engine
DSS clientDSS client
Application Application layerlayer
WarehouseWarehouse
• Efficient storage and processing• Complexity hidden from the
user (but NOT from developer)• Analysis using preaggregated
summaries and precalculated measures
ROLAP Server
• The warehouse stores atomic data.
• The application layer generates SQL for the three- dimensional view.
• The presentation layer provides the multidimensional view.
ROLAPengine
DSS clientDSS client
Application Application layerlayer
WarehouseWarehouseserverserver
Multiple Multiple SQLSQL
MOLAPMOLAP
ServerServer useruserWarehouseWarehouse
QueryQuery
DataData
MDDBMDDB
PeriodicPeriodicloadload
ROLAP
ServerServer useruserWarehouseWarehouse
DataDatacachecache
LiveLivefetchfetch
CacheCache
QueryQuery
DataData
Also Hybrid (HOLAP)
Choosing a Reporting Architecture
• Business needs
• Potential for growth
• interface
• enterprise architecture
• Network architecture
• Speed of access
• Openness
MOLAP
ROLAP
SimpleSimple ComplexComplex
QueryQueryPerformancePerformance
GoodGood
OKOK
AnalysisAnalysis
Modeling• Warehouses differ from operational Warehouses differ from operational
structures: structures: – Analytical requirementsAnalytical requirements– Subject orientationSubject orientation
• Data must map to subject oriented Data must map to subject oriented information:information:– Identify business subjectsIdentify business subjects– Define relationships between subjectsDefine relationships between subjects– Name the attributes of each subjectName the attributes of each subject
• Modeling is iterativeModeling is iterative• Modeling tools are availableModeling tools are available
1. Defining the business model
2. Creating the dimensional model
3. Modeling summaries
4. Creating the physical model
Physical model
11
2, 32, 3
44
Select a business process
Modeling the Data Warehouse
Identifying Business Rules
Product
Type Monitor Status
PC 15 inch NewServer 17 inch Rebuilt
19 inch CustomNone
Location
Geographic proximity
0 - 1 miles1 - 5 miles > 5 miles
Store
Store > District > Region
Time
Month > Quarter > Year
Creating the Dimensional ModelIdentify fact tables– Translate business measures into fact
tables– Analyze source system information for
additional measures– Identify base and derived measures– Document additivity of measures
Identify dimension tablesLink fact tables to the dimension tablesCreate views for users
Dimension TablesDimension tables have the following
characteristics:
• Contain textual information that represents the attributes of the business
• Contain relatively static data
• Are joined to a fact table through a foreign key reference Product Channel
Facts(units,price)
Customer Time
Fact TablesFact tables have the following characteristics:• Contain numeric measures (metrics) of the
business• May contain summarized (aggregated) data• May contain date-stamped data• Are typically additive• Have key value that is typically a
concatenated key composed of the primary keys of the dimensions
• Joined to dimension tables through foreign keys that reference primary keys in the dimension tables
Dimensional Model (Star Schema)
Dimensional Model (Star Schema)
Product Channel
Facts(units,price)
Customer Time
Dimension tablesDimension tables
Fact tableFact table
Star Schema Model
• Central fact table
• Radiating dimensions
• Denormalized model
Store TableStore_idDistrict_id...
Item TableItem_idItem_desc...
Time TableDay_idMonth_idPeriod_idYear_id
Product TableProduct_idProduct_desc…
Sales Fact TableProduct_idStore_idItem_idDay_idSales_dollarsSales_units...
Star Schema Model
• Easy for users to understand
• Fast response to simple queries
• Simple metadata
• Supported by many front end tools
• Less robust to change
• Does not support history
Using Summary Data
• Provides fast access to precomputed data
• Reduces use of I/O, CPU, and memory
• Is distilled from source systems and precalculated summaries
• Usually exists in summary fact tables
Phase 3: Modeling summaries
Designing Summary Tables
Units Sales(€) Store
Product ATotal
Product BTotal
Product CTotal
• Average• Maximum
• Total• Percentage
Summary Tables Example
SALES FACTSSales Region Month10,000 North Jan 9912,000 South Feb 9911,000 North Jan 9915,000 West Mar 9918,000 South Feb 9920,000 North Jan 9910,000 East Jan 992,000 West Mar 99
SALES BY MONTH/REGIONMonth Region Tot_Sales$Jan 99 North 41,000Jan 99 East 10,000Feb 99 South 40,000Mar 99 West 17,000
SALES BY MONTHMonth Tot_SalesJan 99 51,000Feb 99 40,000Mar 99 17,000
top related