© 2006 ibm corporation ibm information server simplifying the creation of the data warehouse
TRANSCRIPT
© 2006 IBM Corporation
IBM Information Server
Simplifying the Creationof the Data Warehouse
2
The New Role of the Data Warehouse
The data warehouse is becoming a more active and integrated participant in enterprise architectures
– A source of the best information in the business
– Active source of analytics
Because of this, the data warehouse has new requirements
– Must be more flexible and adaptable to change
– Must have trustworthy, auditable information
– Must represent the business view
– Must be capable of scaling to meet ever-growing information volumes
3
Critical Success Factors for Data Warehousing
Metadata-Driven Design Acceleration
& Automation Automate connection between
design and build tasks Provide in-tool metadata
visibility Easily connect to any data
source
Auditable Data Quality Ensure quality embedded in
data flows Understand quality changes
over time Provide proof of quality and
lineage
Scalability Seamless expansion of capacity Resource estimation Accurate performance analysis
and balancing
Collaboration Seamless flow of
metadata across roles Shared understanding
between business & IT Team development
Integrated object search Object reuse optimization Reuse of data flows
through shared services
Reuse
4
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand
Cleanse Transform Deliver
Parallel ProcessingRich Connectivity to Applications, Data, and
Content
IBM Information Server
Discover, model, and govern information
structure and content
Standardize, merge,and correct information
Combine and restructure
information for new uses
Synchronize, virtualize and move information for in-
line delivery
Unified Deployment
Unified Metadata Management
5
Critical Success Factors for Data Warehousing
Metadata-Driven Design Acceleration
& Automation Automate connection between
design and build tasks Provide in-tool metadata
visibility Easily connect to any data
source
Auditable Data Quality Ensure quality embedded in
data flows Understand quality changes
over time Provide proof of quality and
lineage
Scalability Seamless expansion of capacity Resource estimation Accurate performance analysis
and balancing
Collaboration Seamless flow of
metadata across roles Shared understanding
between business & IT Team development
Integrated object search Object reuse optimization Reuse of data flows
through shared services
Reuse
6
Collaboration for Data Warehouse Design
IBM Metadata Server
Data-driven analysis, reporting,
monitoring, data rule and integration
specification
Analysts
Business definition &
ontology mapped to
physical data
Subject Matter Experts, Data
Stewards
Simplify integration
Metadata and data-driven
data modeling and
management
Architects
IBMInformation
Analyzer
IBMBusinessGlossary
Rational
Data Architect
Increase trust and confidence in information
Increase compliance to standards
Facilitate change management & reuse
Database application and transformation development
ImplementersData Admin
IBMDataStage
IBMQualityStage
Collaboration
7
Common metamodel provides seamless flow of metadata
– Analysis activities populate information into data flow design
– DataStage users can see the table metadata from Information Analyzer
– Analysis Results and notes visible
• Provides insight into quality of source
• Provides guidance on how flow should
be defined
• Notes allow free-form collaboration
across roles – ensuring knowledge is
completed transferred from analysis
to build
Collaborative Metadata: From Analysis to Build Collaboration
8
Critical Success Factors for Data Warehousing
Auditable Data Quality Ensure quality embedded in
data flows Understand quality changes
over time Provide proof of quality and
lineage
Collaboration Seamless flow of
metadata across roles Shared understanding
between business & IT Team development
Scalability Seamless expansion of capacity Resource estimation Accurate performance analysis
and balancing
Metadata-Driven Design Acceleration
& Automation Automate connection between
design and build tasks Provide in-tool metadata
visibility Easily connect to any data
source
Integrated object search Object reuse optimization Reuse of data flows
through shared services
Reuse
9
Easily Embed Data Quality with Unified Design
One design experience
Speeds Development time
Extended User orientation in a simplified design environment
Performance Oriented
Auditable Data Quality
10
Measure Data Quality Over Time Using Baseline Reporting
Compare quality results to a baseline to understand quality changes over time
Embed profiling tasks into sequencer to take before and after snapshots of data quality and rules adherence
Auditable Data Quality
11
Critical Success Factors for Data Warehousing
Metadata-Driven Design Acceleration
& Automation Automate connection between
design and build tasks Provide in-tool metadata
visibility Easily connect to any data
source
Collaboration Seamless flow of
metadata across roles Shared understanding
between business & IT Team development
Integrated object search Object reuse optimization Reuse of data flows
through shared services
Reuse
Auditable Data Quality Ensure quality embedded in
data flows Understand quality changes
over time Provide proof of quality and
lineage
Scalability Seamless expansion of capacity Resource estimation Accurate performance analysis
and balancing
12
In-tool Metadata VisibilityMetadata-Driven Design
Results shown using the Advanced Find window
-Find dependencies …What does this item depend on?-Find where used …Where is this item used?
Impact Analysis:
13
Job Difference – Integrated report
Difference report displayedin Designer - jobs opened automatically from report hot links
Options available to:
- Print report- Save report as
HTML
Metadata-Driven Design
14
Slowly Changing Dimension Design Acceleration New engine capabilities
– Surrogate Key management
– Updatable in-memory lookups
New & enhanced stages
– Surrogate Key Generator
– Slowly Changing Dimension
Single stage per Dimension
– Quick setup and definition
– Easy single point of maintenance
Metadata-Driven Design
15
Rapid Connectivity: Common Connectors
Connection objects allow
properties to be dropped onto stage
Diagram lets you select the link to edit as though your
on the canvas
Warning sign tells you
which fields are
mandatory
Test the connection
instantly
Parameter button on every field
Graphical ODBC specific
SQL builder
Metadata-Driven Design
16
Critical Success Factors for Data Warehousing
Integrated object search Object reuse optimization Reuse of data flows
through shared services
Reuse
Auditable Data Quality Ensure quality embedded in
data flows Understand quality changes
over time Provide proof of quality and
lineage
Scalability Seamless expansion of capacity Resource estimation Accurate performance analysis
and balancing
Collaboration Seamless flow of
metadata across roles Shared understanding
between business & IT Team development
Metadata-Driven Design Acceleration
& Automation Automate connection between
design and build tasks Provide in-tool metadata
visibility Easily connect to any data
source
17
Reuse: Find ItReuse
Find item in Repository tree
– In-place find
– Find by Name (Full or Partial)
– Wild card support
– Find next…
– Filter on type
18
Find – Advanced Search Criteria
Search on following criteria:
– Object type• Job, Table Definition, Stage etc.
– Creation• Date/Time• By User
– Last Modification• Date/Time• By User
– Where Used• What other objects use this object?
– Dependencies of• What does this object use?
Options
– Case
– Match on “name & description” or “name or description”
Reuse
19
Reuse: Connection Objects
Allows saving of a re-usable connection path to a specific source or target
– Username, password, db name etc.
Can be used for:
– Stage connection properties• Loading onto a stage in the stage
editor• Drag ‘n’ drop from Repository tree
– Meta data import from that source or target
– Drag ‘n’ drop table imported from that source or target onto canvas to create a pre-configured stage instance
Reuse
20
Reuse: Job Parameter Sets
New object in repository that contains the names and values of job parameters.
A Job Parameter set can be referenced by one or more jobs enabling easier deployment of jobs across machines and also enabling easy propagation of a changed job parameter value
Reuse
21
Reuse: Simply Deploy Data Flows as Shared Services
Automates the creation of information integration services including federation
Provides fundamental infrastructure services (security, logging, monitoring)
Provisions to leading web services JMS, EJB and SOAP over HTTP
Provides load balancing & fault tolerance for requests across multiple Service providers
Reuse
22
Critical Success Factors for Data Warehousing
Scalability Seamless expansion of capacity Resource estimation Accurate performance analysis
and balancing
Collaboration Seamless flow of
metadata across roles Shared understanding
between business & IT Team development
Integrated object search Object reuse optimization Reuse of data flows
through shared services
Reuse
Auditable Data Quality Ensure quality embedded in
data flows Understand quality changes
over time Provide proof of quality and
lineage
Metadata-Driven Design Acceleration
& Automation Automate connection between
design and build tasks Provide in-tool metadata
visibility Easily connect to any data
source
23
Job Performance Analysis
A new visualization tool which:
Provides deeper insight into runtime job behavior.
Offers several categories of visualizations, including:
– Record Throughput
– CPU Utilization
– Job Timing
– Job Memory Utilization
– Physical Machine Utilization
Hides runtime complexity by emphasizing the stages the customer placed on the designer canvas.
Scalability
24
Record Throughput
Breakdown of records read and records written per second.
Filters to show one line for each link drawn on the canvas initially.
Names used to refer to each dataset are the actual stage names on the canvas.
Advanced users can turn off filters and see every runtime dataset, including the inner operators of composites, and inserted operator datasets.
One tab for each partition, as well as the ability to show an overlay view including every partition for smaller jobs.
Scalability
25
CPU Utilization
Visualizes the time in CPU of each operator.
Shows what operators were dominating the CPU at different points during the run.
Percentage view shows what percentage of the CPU load of the job each stage on the canvas was responsible for.
Inserted operators and Composite sub-operators automatically get bundled up in these results.
Advanced users can see combination, which will change this chart to reflect each process and the stages contained within.
Percentage CPU Pie Chart
Total CPU and System Time
Scalability
26
Physical Machine Utilization
Disk ThroughputAverage Process Distribution
Percent CPU UtilizationFree Memory Whisker Box
Scalability
27
Resource Estimation
Provides estimates for required disk space and CPU utilization. Helps with:– Job design –detect bottlenecks and optimize transformation logic to
improve performance
– Error protection – run with a range of data of particular interest for a better protection from job aborts due to bad data formats or insufficient null-handling
– Resource allocation – determine allocation of scratch space and disk space to protect the job from aborts due to lack of space
Two Statistical Models– Static – provides worst case disk space estimates based on
schema and job design.
– Dynamic – Runs job and statistically samples actual resource usage. Then provides calculated estimates per node
Scalability
28
ScalabilityResource Estimation Tool Layout
29
Migration Path
Seamless upgrade for WebSphere DataStage users into the IBM Information Server– All DataStage jobs along with all other objects will migrate into
the IBM Information Server
Upgrade for WebSphere QualityStage into the IBM Information Server– Existing QualityStage projects will migrate into the IBM
Information Server
– Conversion utilities for Standardize, Match, and Survive stages
– All other stages will continue to execute
30
Summary
Data Warehouses are becoming a tier one operational system in many companies
They must be able to adapt to change more quickly, must have authoritative information, and must be scalable
Platforms for building data warehouses must support metadata-driven design, collaboration, reuse, and auditable data quality, and must be able to scale to support growing data volumes
The IBM Information Server provides all of this in a unified platform
IBM Information Server
Metadata-Driven Design Acceleration &
Automation
Auditable Data Quality
Scalability
Collaboration
Reuse
31