Download - Managing Data Integration Initiatives
Leading Practices and Insights for Managing Data Integration
Initiatives
May 7, 2010
04/12/232
Agenda
Introductions
Overview
Key Drivers
Approaches / Strategy
Tools
Case Studies
Success Factors / Lessons Learned
Q&A
04/12/233
About Allin Wakefield, MA, based service provider of enterprise-quality solutions
and services to small and large companies Consulting practice provides technical and business expertise that
helps businesses implement strategic solutions, integrate key functions, and extend technical capability. Services include:
Integrative Application Development and Performance Tuning Systems Integration and Management Data Migration to comprehensive Design Services.
Solution focused in two key areas:
Microsoft® SharePoint® Virtualization software from Microsoft® and VMware®
Mark Bramhall, CTO Over 35 years of experience working with small and large firms Designed and implemented numerous data integration solutions Experienced technologist and integrator
04/12/234
About Optima Middleton, MA, based technology consulting firm providing IT
leadership for both strategic and tactical projects
Consulting practice specializes in advising small- to mid-sized firms on optimizing their technology investments and improving business processing. Key services include:
Application Implementations
Data Integrations
Technology Optimization
Business Process Transformation
Irving Burday, President
Experienced technology leader (CIO at several companies)
Led several data integration and data warehouse projects
Directed / managed many complex integration efforts
04/12/235
Data Integration Overview
Definition - combining data residing in different sources and providing users with a unified view of these dataDefinition - combining data residing in different sources and providing users with a unified view of these data
Mediated Schema Example Data Warehouse Example
04/12/236
Business / Technical Drivers
Application Integrations
Legacy to new system migration and conversion
Legacy system feeds to new system
New application / Web site processing
Retirement of legacy application
Business Intelligence / Analytics
Data aggregation for data mining
Supporting predictive models and analytics
Feeding decision support and other business intelligence needs
04/12/237
Business / Technical Drivers
Data Architecture Improvements
Managing feeds to/from Data Warehouse
Reorganization of Operational Data Stores and Marts
Coordinating feeds to/from Legacy systems
Coordinating feeds to/from External entities
External Factors
Supporting Acquisitions / Divestitures
Managing Mergers
Facilitating new Channels and/or Sales Growth
04/12/238
Key Variables / Considerations Business Related
Scale of Effort
Availability / Skills of Resources
Funding
Timing
Technical Factors Platforms and operating systems Data management software Data models, schemas, and data semantics Middleware User interfaces
Frequency of integrations Business rules and integrity constraints
Approaches / Strategies
04/12/239
04/12/2310
Integration approach depends on architectural level
Approaches / Strategies
04/12/2311
Several strategies for integrating data: Manual Integration – users directly interact with all relevant
information systems and manually integrate data. Requires users to have detailed knowledge of logical data representation and data semantics as well as dealing with different interfaces and query languages.
Application Specific – modifying applications or layering application specific code around an application to enable it to take data from or give data to external data stores.
Data Propagation - involves replicating data in different locations from different sources. Technologies include replication, database log scrapers and change data capture software.
Data Federation - enables a single unified virtual view of one or more source data files. Data federation technique normally employs a metadata reference file to connect related customer information together based on a common key.
Approaches
12
Application Specific Solutions Application specific tools and utilities are frequently
provided by vendors to integrate and manage data. Key considerations for utilizing this approach:
Developing and using requires deep system knowledge Best results for special-purpose applications
A new data source requires new code to be written Usually optimal for one-time conversions / migrations
Data cleanup frequently requires multiple human interventions
Fragile if updating / changing the underlying data sources (may affect the application)
Can be expensive - in terms of time and skills
04/12/231313
Data Propagation
Data Propagation is the distribution of data from one or more source data warehouses to one or more local access databases. Data Propagation methods include:
Bulk Extract – utilizes copy management tools or unload utilities to extract all or a subset of the operational relational database. The data which has been extracted may be transformed to the format used by the target on the host or target server.
File Compare – process compares the newly extracted operational data to the previous version. After that, a set of incremental change records is created and are applied.
Change Data Propagation – captures and records the changes to the file as part of the application change process. Techniques include triggers, log exits, or DBMS extensions. A file of incremental changes is created to contain the captured changes.
14
Data Federation
Links data from two or more physically different locations, making the access/linkage appear transparent as if the data was co-located (versus a data warehouse that houses data in one location). Key elements of a data federated approach:
Middleware consisting of database management system
Uniform access to number of heterogeneous data sources
Provides query language used to combine, contrast, analyze and manipulate the data
Data integration can be done through database integration
Combine data from multiple sources with a single SQL statement
Create a master system that relates data elements from all line of business systems
Data Integration Tools
04/12/2315
04/12/2316
ETL
ETL: Extract, Transform and Load
ETL tools extract data from a chosen source(s), transform it into new formats according to business rules, and then load it into target data structure(s)
Enables rules and processes for managing diverse data sources and processing of high volumes of data
Direct insight into source data before; provides data profiling and quality control capabilities
Provides ability to map physical data items with a unique metadata description or create an abstraction layer of common business definitions to map all similar data items to the same definition
ETL Integrated Architecture
XYZ Corp Systems
Data Extraction & Integration
Business Process LayerInformation Management
Presentation LayerAccessible throughout the organization
DistributionAd-hoc Analysis
To
ols
Core Business Systems
• Financial Systems• CRM Systems• Business Line systems• Other…• External Sources
Arc
hit
ec
ture
Portals/Dashboards Production Reports Ad-hoc reports Query Extracts Other data analysis and reporting tools
Reporting
Data Storage
Transformation
Subject Areas
Transformation
Extract, Transformation, & Load (ETL) Layer
The “Unified Business Model” and Information Management
Analytics and Reporting Tools
OLAP Cubes & Predictive Models
• Conformed facts and shared dimensions
• Pre-aggregated data stored in OLAP models to support variance analysis, exception reporting and drill-down
• Real-time drill through into relational storage
• Structured and unstructured content
ExceptionNotifications
Mining
• Extract, Transformation, and Load (ETL)
• Message Broker• Integration of actual data • Metadata (structured data)
management• Security Management• Administration
17
04/12/2318
EAI
EAI: Enterprise Application Integration
An integration framework composed of a collection of technologies and services
A centralized broker that handles security, access, and communication
An independent data model based on a standard data structure (e.g., XML)
A connector or agent model where each vendor, application, or interface can build a single component that can speak natively to that application and communicate with the centralized broker
A system model that defines the APIs, data flow and rules of engagement to the system such that components can be built to interface with it in a standardized way
04/12/2319
EAI Architecture
04/12/2320
Tool Comparison
ETL versus EAI – what’s the difference?
04/12/2321
ETL Tools - Key Features Architecture
Parallel Processing Scalability
− Job Distribution, Pipelining, Partitioning
− Common Warehouse Model (CWM) compliant
− Version Control
ETL Functionality Managing Data Streams (multiple targets,
splitting) Pivoting, De-pivoting, Unions Lookups Scheduling Error Handling
Data Extraction & Integration
Data Storage
Transformation
Transformation
Extract, Transformation, & Load (ETL) Layer
• Extract, Transformation, and Load (ETL)
• Message Broker• Integration of actual
data • Metadata (structured
data) management• Security Management• Administration
04/12/2322
ETL Tools – Key Features (continued) Reusability
Reuse of components Decomposition
Debugging Step by step, row by row, breakpoints Compiler / Validater
Connectivity Native connections support (ODBC, OLE DB,
Flat Files) Integration to package / application meta data Data Quality, Data Validation
Ease-of-Use WYSIWYG Documentation
Data Extraction & Integration
Data Storage
Transformation
Transformation
Extract, Transformation, & Load (ETL) Layer
• Extract, Transformation, and Load (ETL)
• Message Broker• Integration of actual
data • Metadata (structured
data) management• Security Management• Administration
Challenges
04/12/2323
04/12/2324
Data Preparation / Quality
Completeness / Accuracy of Data Records Duplicates Half Match-able Data Freshness of Data
Technology Issues
Multiple and Mixed Data Formats Disparate Operating systems and processing platforms Source system constraints
Organizational
Business and IT Politics Ownership / Stewardship of Source Data Dedication of IT resource to manage daily functions
Challenges
04/12/2325
Level of automation
Time based Event based
Frequency
Error handling
Reporting Requirements
Ownership of Error Remediation
− Technical failures
− Data failures
Auto-correction versus manual updating
Batch integrity
Challenges (continued)
04/12/2326
Data Handling
Transformations Manipulations Transmission
Magnitude of Effort
Number of systems Volume of data Number of runs
Integration Challenges (continued)
Case Studies
04/12/2327
04/12/2328
Business Case 1
Service Company – Revenue: $200m, Size: 300 FTEsService Company – Revenue: $200m, Size: 300 FTEs
Client’s Business Challenge:Integrating data from customer web sites / CRM systems into operational and financial systems
Client’s objective was to build a one-time solution to manage data migrations
Solution:Use SSIS to develop a data migration framework that would allows transformation of data
Build custom stored procedure scripts to extract data from legacy applications
Lessons Learned:Data rules and manipulations required extensive analysis and documentation in order to streamline future state process
Created cross tabular map of legacy application tables to facilitate data mappings and data handling procedures during conversion and testing activities
04/12/2329 04/12/2329
Education Company – Revenue: $500m, Size: 300 FTEsEducation Company – Revenue: $500m, Size: 300 FTEs
Client’s Business Challenge:Integrating data from a set of outsourced-function partners
Integration needed to be real time as clients transited Web sites, but could not fail in the face of network outages, system failures, etc
Solution:• Understand who is the data master and who only keep shadow copies for each type of data
• Design a way to uniquely identify data, even if multiple sources can create it• Deploy a publish / subscribe solution using reliable, persistent message queuing
Lessons Learned:• You cannot know your data too well; subtle relationships must become explicit
• Multi-partner integration requires extremely simple interfaces and definitions
Business Case 2
04/12/2330
Healthcare Company – Revenue: $150m, Size: 200 FTEsHealthcare Company – Revenue: $150m, Size: 200 FTEs
Client’s Business Challenge:Integrating data feeds from source systems into new data warehouse
Implementing a data hub to manage data feeds from external entities (e.g., customers, banks) into financial and customer support systems
Solution:• Select / implement a full featured ETL tool to manage and handle data warehouse and miscellaneous data feeds
• Created data extracts from sources to manage data extract requirements and file formats
• Deploy a data quality program that cleansed incoming and transferred data prior to loading into destination system
Lessons Learned:• Error handling required additional time and effort to define error cases and remediation actions
• Data ownership required executive intervention to staff and manage data management process
Business Case 3
Conclusions
04/12/2331
The Value of a Data Quality Effort Data Remediation
Data management processes can not allow junk data to be loaded, migrated or transported into a target system
Data remediation procedures should be designed into every solution
Key Performance Indicators: Data Quality Compliance
Data quality indicators should be defined and monitored at all times
The KPIs should be used by the data management operations team to manage data processing and testing
The KPIs for management must be business focused and should show how poor data quality is financially effecting their business
32 04/12/23
The Value of a Data Quality Effort Check Twice, Load Once
Data should be checked for validity prior to being loaded into target
Designers and developer MUST log the exact data quality errors and issues that are present in the data being processed
The data quality errors and issues must be summarized and reported on. Reports can be used by operations and source data owners to remediate the data and drive data compliance
33 04/12/23
Importance of Governance
Poor Governance and Lack of Communication account for over 85% of the issues in a data integration project
SizeIncorrect hardware or software
7%
2%
Failure to define objectives
17%
Unfamiliarity with scopeand complexity
17%
Lack of communication
20%
InadequateProject Management
32%
Other
5%
34 04/12/23
Data Stewardship
Data stewards act as the conduit between IT and the business and accept accountability for the data management process.
IT
Business
Business
Data Stewards
Domain Values Data Standards Business Rule Specifications Data Ownership Rules Data Quality Rules Security Requirements Data Retention Criteria
Data Stewards play a the central role in the management of data across the organization and in assuring its usefulness for the business.35 04/12/23
Data stewards become the “public face” for data and have the following responsibilities:
Success Factors
Establish and agree upon scope, high level requirements, expected benefits, and architecture
Benefits need to be emphasized from the top down and understood from the bottom up
Data integrity and Data cleansing cannot be over-emphasized
Even well-documented systems are usually prone to poor data quality
Common definitions and mapping is crucial
A complex business is not made any less complex by documenting the data and putting it in a operational store
Knowledgeable use of the data will still require knowledgeable users
36 04/12/23
Success Factors (continued)
Technology is only part of the answer No mater how sophisticated the implementation, significant process
change will be required
But, technology is key to success Having a key partner who has done this before will minimize risk
Much can be learned from similar efforts
This effort requires a full-time dedicated set of highly-skilled resources
Both technical and business knowledge are required
37 04/12/23
Appendices
04/12/2338
04/12/2339
ETL Vendors
ETL Vendors ETL ToolsMicrosoft SQL Server Integration Services Oracle Oracle Warehouse Builder (OWB)SAP Business Objects Data Integrator & Data Services IBM IBM Information Server (Datastage)IBM Data Manager/Decision Stream (Cognos)SAS Institute SAS Data Integration StudioInformatica PowerCenterAb Initio Co>Operating SystemInformation Builders Data MigratorAdeptia Adeptia Integration ServerCastIron Systems OmniConnect PlatformPitney Bowes Business Insight DataFlow ManagerPervasive Data IntegratorElixir Elixir RepertoireJavlin Clover ETLPentaho Pentaho Data Integration Talend Talend Open Studio
04/12/2340
ETL / EAI - Tool Strengths
ETL EAI Excels at bulk data movement Limited in data movement
capabilities
Provides for complex transformations, aggregation from multiple sources and sophisticated business rules.
Offer less sophisticated transformation and extraction functions
Assumes data delays. Operates in real time
Are batch-oriented, making them fast and simple for one-time projects and testing
Work better with continuously interacting systems
Offers little in the way of workflow Workflow-oriented at the core
Works primarily at the session layer Works primarily at the transport layer