ist722 data warehousing - syracuse...

19
IST722 Data Warehousing ETL Design and Development Michael A. Fudge, Jr.

Upload: vonhan

Post on 16-Feb-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

IST722 Data Warehousing

ETL Design and Development

Michael A. Fudge, Jr.

Page 2: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Recall: Kimball Lifecycle

Presenter
Presentation Notes
Describes an approach for data warehouse projects
Page 3: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Objective:Outline ETL design and development

process.A “Recipe” for ETL

Page 4: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Before You BeginBefore you begin, you’ll need 1. Physical Design –

Star Schema implementation in ROLAP.

2. Architecture Plan – understanding of your DW/BI architecture.

3. Source to Target Mapping –Part of the detailed design process.

Page 5: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

The Plan• How the 34

subsystems map /related to the 10 step plan.

Page 6: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 1 – Draw The High Level Plan

Page 7: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 2 – Choose an ETL Tool

• Your ETL tool is responsible for moving data from the various sources into the data warehouse.

• Programming language vs. Graphical tool.• PL Flexibility, Customizable• GT Self Documenting, Easy for beginners

Page 8: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

ETL: Code vs Tool

Which of these is easier to understand?

Page 9: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 3 – Develop Detailed Strategies

• Data Extraction & Archival of Extracted Data• Data quality checks on dimensions & facts• Manage changes to dimensions• Ensure the DW and ETL meet systems

availability requirements• Design a data auditing subsystem• Organize the staging data

Page 10: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

The Role of theStaging Database

• Staging database stores copies of source extracts• Can create a history when none exists.• Reduces unnecessary processing of data source.

Data Sources

StagingDatabase

Data Warehouse

EXTRACT LOAD

ETL:TRANSFORM

ELT:TRANSFORM

Page 11: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 4 – Drill Down by Target Table

• Start drilling down into the detailed source to target flow for each target dimension and fact table

• Flowcharts and pseudo code are useful for building out your transformation logic.

• ETL Tools allow you to build and document the data flow at the same time:

Page 12: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 5 – Populate Dimension Tables w/ Historic Data

• Part of the one-time historic processing step.• Start with the simplest dimension table

(usually type 1 SCD’s)• Transformations

o Combine from separate sourceso Convert data ex. EBCDIC ASCIIo Decode production codes ex. TTT Track-Type Tractoro Verify rollups ex: Category Producto Ensure a “Natural” or “Business” key exists for SCD’so Assign Surrogate Keys to Dimension table

Page 13: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 6 – Perform the Fact Table Historic Load

• Part of the one-time historic processing step.• Transformations:

o Replace special codes (eg. -1) with NULL on additive and semi- additive facts

o Calculate and store complex derived facts ex: shipping amount is divided among the number of items on the order.

o Pivot rows into columns ex: account type, amount checking amount, savings amount

o Associate with Audit Dimensiono Lookup Dimension Keys using Natural/Business

Keys….

Page 14: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Example Surrogate Key Pipeline

Handles SCD’s

Page 15: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 7 – Dimension Table Incremental Processing

• Oftentimes the same logic used in the Historic load can be used.

• Identify New/ Changed data based on different attributes for the same natural keyo ETL tools usually can assist with this logic.

Page 16: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 8 – Fact Table Incremental Processing

• A complex ETL:o Can be difficult to determine which facts need

to be processed?o What happens to a fact when it is re-processed?o What if a dimension key lookup fails?

• Some ETL tool assist with processing this logic.o Degenerate dimensions can be used ex:

transaction number in order summaryo A combination of dimension keys ex: StudentKey

and ClassKey for grade processing.

Page 17: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 9 – Aggregate Table and OLAP Loads

• Further processing beyond the ROLAP star schema.

• Refresh / Reprocess o MOLAP cubeso INDEXED / MATERIALIZED viewso Aggregate summary tables

Page 18: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

Step 10 – ETL System Operation & Automation

• Schedule jobs• Catch and Log errors / exceptions• Database management tasks:

o Cleanup old datao Shrink Databaseo Rebuild indexeso Update Statistics

Page 19: IST722 Data Warehousing - Syracuse Universityclasses.ischool.syr.edu/ist722fudge/old/units/09/ETL-Design-And... · Step 2 – Choose an ETL Tool • Your ETL tool is responsible for

IST722 Data Warehousing

ETL Design and Development

Michael A. Fudge, Jr.