© 2006 ibm corporation ibm information server simplifying the creation of the data warehouse

© 2006 IBM Corporation

IBM Information Server

Simplifying the Creationof the Data Warehouse

2

The New Role of the Data Warehouse

The data warehouse is becoming a more active and integrated participant in enterprise architectures

– A source of the best information in the business

– Active source of analytics

Because of this, the data warehouse has new requirements

– Must be more flexible and adaptable to change

– Must have trustworthy, auditable information

– Must represent the business view

– Must be capable of scaling to meet ever-growing information volumes

3

Critical Success Factors for Data Warehousing

Metadata-Driven Design Acceleration

& Automation Automate connection between

design and build tasks Provide in-tool metadata

visibility Easily connect to any data

source

Auditable Data Quality Ensure quality embedded in

data flows Understand quality changes

over time Provide proof of quality and

lineage

Scalability Seamless expansion of capacity Resource estimation Accurate performance analysis

and balancing

Collaboration Seamless flow of

metadata across roles Shared understanding

between business & IT Team development

Integrated object search Object reuse optimization Reuse of data flows

through shared services

Reuse

4

The IBM Solution: IBM Information ServerDelivering information you can trust

Understand

Cleanse Transform Deliver

Parallel ProcessingRich Connectivity to Applications, Data, and

Content


Discover, model, and govern information

structure and content

Standardize, merge,and correct information

Combine and restructure

information for new uses

Synchronize, virtualize and move information for in-

line delivery

Unified Deployment

Unified Metadata Management

5






source




lineage


and balancing






Reuse

6

Collaboration for Data Warehouse Design

IBM Metadata Server

Data-driven analysis, reporting,

monitoring, data rule and integration

specification

Analysts

Business definition &

ontology mapped to

physical data

Subject Matter Experts, Data

Stewards

Simplify integration

Metadata and data-driven

data modeling and

management

Architects

IBMInformation

Analyzer

IBMBusinessGlossary

Rational

Data Architect

Increase trust and confidence in information

Increase compliance to standards

Facilitate change management & reuse

Database application and transformation development

ImplementersData Admin

IBMDataStage

IBMQualityStage

Collaboration

7

Common metamodel provides seamless flow of metadata

– Analysis activities populate information into data flow design

– DataStage users can see the table metadata from Information Analyzer

– Analysis Results and notes visible

• Provides insight into quality of source

• Provides guidance on how flow should

be defined

• Notes allow free-form collaboration

across roles – ensuring knowledge is

completed transferred from analysis

to build

Collaborative Metadata: From Analysis to Build Collaboration

8





lineage





and balancing





source



Reuse

9

Easily Embed Data Quality with Unified Design

One design experience

Speeds Development time

Extended User orientation in a simplified design environment

Performance Oriented

Auditable Data Quality

10

Measure Data Quality Over Time Using Baseline Reporting

Compare quality results to a baseline to understand quality changes over time

Embed profiling tasks into sequencer to take before and after snapshots of data quality and rules adherence


11






source






Reuse




lineage


and balancing

12

In-tool Metadata VisibilityMetadata-Driven Design

Results shown using the Advanced Find window

-Find dependencies …What does this item depend on?-Find where used …Where is this item used?

Impact Analysis:

13

Job Difference – Integrated report

Difference report displayedin Designer - jobs opened automatically from report hot links

Options available to:

- Print report- Save report as

HTML

Metadata-Driven Design

14

Slowly Changing Dimension Design Acceleration New engine capabilities

– Surrogate Key management

– Updatable in-memory lookups

New & enhanced stages

– Surrogate Key Generator

– Slowly Changing Dimension

Single stage per Dimension

– Quick setup and definition

– Easy single point of maintenance


15

Rapid Connectivity: Common Connectors

Connection objects allow

properties to be dropped onto stage

Diagram lets you select the link to edit as though your

on the canvas

Warning sign tells you

which fields are

mandatory

Test the connection

instantly

Parameter button on every field

Graphical ODBC specific

SQL builder


16




Reuse




lineage


and balancing








source

17

Reuse: Find ItReuse

Find item in Repository tree

– In-place find

– Find by Name (Full or Partial)

– Wild card support

– Find next…

– Filter on type

18

Find – Advanced Search Criteria

Search on following criteria:

– Object type• Job, Table Definition, Stage etc.

– Creation• Date/Time• By User

– Last Modification• Date/Time• By User

– Where Used• What other objects use this object?

– Dependencies of• What does this object use?

Options

– Case

– Match on “name & description” or “name or description”

Reuse

19

Reuse: Connection Objects

Allows saving of a re-usable connection path to a specific source or target

– Username, password, db name etc.

Can be used for:

– Stage connection properties• Loading onto a stage in the stage

editor• Drag ‘n’ drop from Repository tree

– Meta data import from that source or target

– Drag ‘n’ drop table imported from that source or target onto canvas to create a pre-configured stage instance

Reuse

20

Reuse: Job Parameter Sets

New object in repository that contains the names and values of job parameters.

A Job Parameter set can be referenced by one or more jobs enabling easier deployment of jobs across machines and also enabling easy propagation of a changed job parameter value

Reuse

21

Reuse: Simply Deploy Data Flows as Shared Services

Automates the creation of information integration services including federation

Provides fundamental infrastructure services (security, logging, monitoring)

Provisions to leading web services JMS, EJB and SOAP over HTTP

Provides load balancing & fault tolerance for requests across multiple Service providers

Reuse

22



and balancing






Reuse




lineage





source

23

Job Performance Analysis

A new visualization tool which:

Provides deeper insight into runtime job behavior.

Offers several categories of visualizations, including:

– Record Throughput

– CPU Utilization

– Job Timing

– Job Memory Utilization

– Physical Machine Utilization

Hides runtime complexity by emphasizing the stages the customer placed on the designer canvas.

Scalability

24

Record Throughput

Breakdown of records read and records written per second.

Filters to show one line for each link drawn on the canvas initially.

Names used to refer to each dataset are the actual stage names on the canvas.

Advanced users can turn off filters and see every runtime dataset, including the inner operators of composites, and inserted operator datasets.

One tab for each partition, as well as the ability to show an overlay view including every partition for smaller jobs.

Scalability

25

CPU Utilization

Visualizes the time in CPU of each operator.

Shows what operators were dominating the CPU at different points during the run.

Percentage view shows what percentage of the CPU load of the job each stage on the canvas was responsible for.

Inserted operators and Composite sub-operators automatically get bundled up in these results.

Advanced users can see combination, which will change this chart to reflect each process and the stages contained within.

Percentage CPU Pie Chart

Total CPU and System Time

Scalability

26

Physical Machine Utilization

Disk ThroughputAverage Process Distribution

Percent CPU UtilizationFree Memory Whisker Box

Scalability

27

Resource Estimation

Provides estimates for required disk space and CPU utilization. Helps with:– Job design –detect bottlenecks and optimize transformation logic to

improve performance

– Error protection – run with a range of data of particular interest for a better protection from job aborts due to bad data formats or insufficient null-handling

– Resource allocation – determine allocation of scratch space and disk space to protect the job from aborts due to lack of space

Two Statistical Models– Static – provides worst case disk space estimates based on

schema and job design.

– Dynamic – Runs job and statistically samples actual resource usage. Then provides calculated estimates per node

Scalability

28

ScalabilityResource Estimation Tool Layout

29

Migration Path

Seamless upgrade for WebSphere DataStage users into the IBM Information Server– All DataStage jobs along with all other objects will migrate into

the IBM Information Server

Upgrade for WebSphere QualityStage into the IBM Information Server– Existing QualityStage projects will migrate into the IBM

Information Server

– Conversion utilities for Standardize, Match, and Survive stages

– All other stages will continue to execute

30

Summary

Data Warehouses are becoming a tier one operational system in many companies

They must be able to adapt to change more quickly, must have authoritative information, and must be scalable

Platforms for building data warehouses must support metadata-driven design, collaboration, reuse, and auditable data quality, and must be able to scale to support growing data volumes

The IBM Information Server provides all of this in a unified platform


Metadata-Driven Design Acceleration &

Automation


Scalability

Collaboration

Reuse

© 2006 ibm corporation ibm information server simplifying the creation of the data warehouse

Documents

datadriven data modeling

data rule

restructure information

best information

information structure

table metadata

tool metadata visibilityeasily

timeprovide proof of