the fundamentals of data lifecycle management in the era

17
IBM Software The fundamentals of data lifecycle management in the era of big data How data lifecycle management complements a big data strategy

Upload: others

Post on 04-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The fundamentals of data lifecycle management in the era

IBM Software

The fundamentals of data lifecycle management in the era of big dataHow data lifecycle management complements a big data strategy

Page 2: The fundamentals of data lifecycle management in the era

Introduction Big data, big impact: Dealing with the three Vs

Best practices: Putting data lifecycle management into action

The power of enterprise-scale data lifecycle management

Enhance data warehouse agility with IBM InfoSphere

Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

1 2 3 4 5 6

Page 3: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

3

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Introduction

Organizations are eager to harness the power of big data. But as new big data opportunities emerge, ensuring that information is trusted and protected becomes exponentially more difficult. If these challenges are not addressed directly, end users may lose confidence in the insights generated from their data—which can leave them unable to act on new opportunities or address threats.

The tremendous volume, variety and velocity of big data means that the old manual methods of discovering, governing and correcting data are no longer feasible. Organizations need to automate information integration and governance from the start. By automating information integration and governance and employing it at the point of data creation and throughout its lifecycle, organizations can help protect information and improve the accuracy of big data insights.

1 Introduction

Page 4: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

4

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Information integration and governance solutions must become a natural part of big data projects. They must support automated discovery and profiling and they must facilitate an understanding of diverse data sets to provide the complete context required to make informed decisions.

They must be agile enough to accommodate a wide variety of data and seamlessly integrate with diverse technologies, from data marts to Apache Hadoop systems. Plus, they must discover, protect and monitor sensitive information across its lifecycle as part of big data applications.

Understanding the context of data and being able to extract the precise information necessary to meet a business objective is key to utilizing big data to the fullest. Managing the data lifecycle so that data is accurate, is appropriately used and is correctly stored to meet the required service levels and retention needs has wide-ranging benefits. These benefits include risk reduction, performance improvements and preventing an overload of useless information.

1 Introduction

This e-book explores the challenges of managing big data, best practices for enterprise-scale data lifecycle management and how IBM® InfoSphere® Optim™ data lifecycle management solutions incorporate a comprehensive range of information integration and governance capabilities that enable companies to properly manage data over its lifetime.

Page 5: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

5

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Without effective data lifecycle management, the increasing volume, variety and velocity of big data can reduce performance, increase margins and amplify risks.

Performance and time-to-marketAs more users execute more queries on larger data volumes, slow response times and degraded application performance become major issues. If left unchecked, continued data growth will stretch resources beyond capacity and negatively impact response time for critical queries and reporting processes. These problems can affect production environments and hamper upgrades, migrations and disaster recovery efforts. Implementing intelligent data

Big data, big impact: Dealing with the three Vs

management of historical, dormant data is essential for avoiding these potentially business-halting issues.

Rapid data growth also makes testing more difficult. As data warehouses and big data environments grow to petabytes or more, testing processes are taxed by having to cull data for their specific needs. The results include longer test cycles, slower time-to-market and fewer defects identified in advance of release. Speeding up testing workflows and delivery of data warehouses requires organizations to automate the creation of realistic rightsized test data—while keeping appropriate security measures in place.

MarginsExponential data growth also can drive up infrastructure and operational costs, often consuming most of an organization’s data warehousing or big data budget. Rising data volumes require more capacity, and organizations often must buy more hardware and spend more money to maintain, monitor and administer their expanding infrastructure. Large data warehouses and big data environments generally require bigger servers, appliances and testing environments, which can also increase software licensing costs for the database and database tooling, not to mention labor, power and legal costs.

2 Big data, big impact: Dealing with the three Vs

Page 6: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

6

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Risks Following the “let’s keep it in case someone needs it later” mandate, many organizations already keep too much historical data. According to the CGOC 2012 Summit Survey, 69 percent of data has no value. Opening the doors to excessive storage and retention only exacerbates the situation.

At the same time, organizations must ensure the privacy and security of the growing volumes of confidential information.

Government and industry regulations from around the world, such as the Health Insurance Portability and Accountability Act (HIPAA), the Personal Information

Maintaining compliance with data retention regulations, protecting privacy and archiving data are not just legal matters—they are essential for sustaining customer satisfaction and brand reputation. In recent IBM surveys, respondents indicate that data theft/cybercrime is the number-one threat to a company’s reputation—a greater threat than system failures. Sixty-four percent of respondents say their company will be focusing more on managing and protecting their reputation than they did five years ago.1

Protection and Electronic Documents Act (PIPEDA) and the Payment Card Industry Data Security Standard (PCI DSS) require organizations to protect personal information no matter where it lives—even in test and development environments.

Source: Insights from the 2012 Global Reputational Risk and IT Study.

Data breaches and attacks risk negative consumer sentiment

75% of IT risks impact customer satisfaction and brand reputation

43% are increasing focus on reputational risk because of growth in emerging technologies such as social media

2 Big data, big impact: Dealing with the three Vs

75% 43%

Page 7: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

7

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

2 Big data, big impact: Dealing with the three Vs

The danger of treating a backup as an archive

Many organizations are confused about the difference between archiving and backing up data. Archiving preserves data, providing a long-term repository of information that can be used by litigation and audit teams. By contrast, backing up data involves copying production data and moving it to another environment to enable disaster recovery and the restoration of deleted files. Backups are often retained for a short time, until a fresh backup replaces the existing backup.

Archiving complements backups by removing old, redundant and infrequently accessed data from a system and by reducing the size of databases and their backups. Approximately 75 percent of the data stored is typically inactive, rarely accessed by any user, process or application. An estimated 90 percent

of all data access requests are serviced by new data—usually data that is less than a year old.2 With an effective archiving strategy, organizations can protect old data and comply with data retention rules while reducing costs and enhancing system performance.

In an attempt to meet archiving needs, some organizations simply back up data to a Hadoop environment. But this kind of backup will not ensure that data will be fully protected or remain query-able, the way a true archive would. With an effective data lifecycle management solution, companies can create an archive that protects data, meets compliance standards, and supports queries and reporting. An emerging trend is for organizations to use Hadoop as a lower-cost storage alternative for archives.

Page 8: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

8

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Best practices: Putting data lifecycle management into action

The data lifecycle stretches through multiple phases as data is created, used, shared, updated, stored and eventually archived or defensively disposed. Data lifecycle management plays an especially key role in three of these phases of data’s existence: archiving, test data management and data masking.

ArchivingRetention policies are designed to keep important data elements for reference and for future use while deleting data that is no longer necessary to support the legal needs of an organization. Effective data lifecycle management includes the intelligence not only to archive data in its full context, which may include information across dozens of databases, but also to archive it based on specific parameters or business rules, such as the age of the data. It can also help storage administrators develop a tiered and automated storage strategy to archive dormant data in a data warehouse, thereby improving overall warehouse performance.

3 Best practices: Putting data lifecycle management into action

The entire data lifecycle (shown as the grey circle) benefits from good governance, but management capabilities that focus on the use, share and archive steps have wide-ranging benefits for cost reduction and efficiency gains.

Create

Use

Share

Test datamanagement

Where management tasks fall in the data lifecycle

Datamasking

Archiving

Update

Archive

Store /retain

Dispose

Page 9: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

9

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Many organizations envision big data as a large, pristine, centralized “data lake.” But a data lake can quickly turn into a “data swamp” when data is poorly managed and controlled. By setting up an intelligent data lifecycle management strategy and archiving to inexpensive storage, you can avoid turning your big data environment into a dumping ground.

Test data managementIn development, testers must automate the creation of realistic, rightsized data sources that mirror the behaviors of existing production databases. To ensure that queries can be run easily and accurately, they must create a subset of actual production data and

reproduce actual conditions to help identify defects or problems as early as possible in the testing cycle.

The tremendous size of big data systems creates challenges for testers. There is a greater need to speed delivery of big data applications, requiring organizations to create realistic, rightsized, masked test data for testing those applications for performance and functionality. Testers also need ways to generate test data sets that facilitate realistic functional and performance testing. Because production data contains information that may identify customers, organizations must mask that information in test environments to maintain compliance and privacy. Many organizations hope that big data will provide a large,

centralized “lake” of data, but in many cases, it becomes a data swamp full of unreliable information.

31%

Enterprise information

69%Everything

else

Subject to legal

hold25%

Has businessutility

5%Regulatory

record keeping

1%

3 Best practices: Putting data lifecycle management into action

Page 10: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

10

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Applying data masking techniques to the test data means testers use realistic-looking, but fictional data—no actual sensitive data is revealed. Application developers can also use test data management technologies to easily access and refresh test data, which speeds the testing and delivery of the new data source.

Organizations also need ways to mask certain sensitive data, such as credit card and phone numbers. While testing their big data environments, they must mask sensitive data from unauthorized users,

Data masking techniques protect the confidentiality of private information.

3 Best practices: Putting data lifecycle management into action

Customers table

Orders table

Cust ID Item # Order date27645 80-2382 20 June 200427645 86-4538 10 October 2005

Cust ID Name Street10000 Auguste Renoir 23 Mars 10001 Claude Monet 24 Venus Pablo Picasso 25 Saturn

Cust ID Item # Order date10002 80-2382 20 June 200410002 86-4538 10 October 2005

Original data

De-identified data

Customers table

Orders table

Cust ID Name Street08054 Alice Bennett 2 Park Blvd19101 Carl Davis 258 Main Elliot Flynn 96 Avenue27645

10002

even though those users might be authorized to see the data in aggregate.

For example, a pharmaceutical company that is testing its data warehouse environment might mask Social Security numbers and dates of birth but not patients’ ages and other demographic information. Masking certain data this way satisfies corporate and industry regulations by removing identifiable information, while still maintaining business context and referential integrity for testing in non-production environments.

Page 11: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

11

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

3 Best practices: Putting data lifecycle management into action

As volume, variety and velocity impacts the complexity of data infrastructures, scaling test environments becomes a significant problem. It isn’t unusual for Fortune 500 companies to spend up to USD30 million building a single test lab—and many of these organizations have dozens of labs. Add in rising wages, and testing costs begin to spiral out of control.

Routing services

Public cloud

Private cloud

Third-party services

Collaboration Web/Internet

Portals

Content providers

Business partners

Archives

Shared services

Messaging services

EJB

File systemsEnterprise service bus

MainframeDirectory identity

Data warehouse

Heterogeneous environments

Complex IT landscapes make setting up test labs extremely costly

Page 12: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

12

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Effective data lifecycle management benefits both IT and business stakeholders.

• Increasing margin: Lower infrastructure and capital costs, improved productivity and reduced application defects during the development lifecycle.

• Reducing risks: Reduced application downtime, minimized service and performance disruptions, and adherence to data retention requirements.

• Promoting business agility: Improved time-to-market, increased application performance and improved quality of applications through realistic test data.

With InfoSphere Optim, organizations gain a single data lifecycle management solution that can scale to meet enterprise needs. Whether they implement InfoSphere Optim for a single application, data warehouse or big data environment, organizations can streamline data lifecycle management with a consistent strategy. The unique relationship engine in InfoSphere Optim provides a single point of control to guide data processing activities such as archiving, subsetting and retrieving data.

4 The power of enterprise-scale data lifecycle management

The power of enterprise-scale data lifecycle management

Page 13: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

13

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

5 Enhance data warehouse agility with IBM InfoSphere

InfoSphere Optim solutions help organizations meet requirements for information integration and governance and address challenges exacerbated by the increasing volume, variety and velocity of data. By archiving old data from huge data warehouse environments, businesses can improve response times and reduce costs by reclaiming valuable storage capacity. By creating realistic, rightsized data sources for testing, they can enhance the accuracy of testing and identify problems early in the testing cycle. And by implementing data masking capabilities, they can protect sensitive data and help ensure compliance with privacy regulations.

Enhance data warehouse agility with IBM InfoSphere

As a result, organizations gain more control of their IT budget while simultaneously helping their big data and data warehouse environments run more efficiently and reducing the risk of exposure of sensitive data.

InfoSphere Optim supports major big data and data warehouse environments, including IBM PureData™ for Analytics, IBM PureData for Transactions, IBM InfoSphere BigInsights™, Teradata, Oracle and popular Hadoop distributions. It also supports enterprise databases and operating systems, including IBM DB2®, Oracle Database, Sybase, Microsoft SQL Server, IBM Informix®, IBM IMS™,

IBM Virtual Storage Access Method (VSAM), Microsoft Windows, UNIX, Linux and IBM z/OS®.

In addition, InfoSphere Optim supports key enterprise resource planning (ERP) and customer relationship management (CRM) applications such as Oracle E-Business Suite, PeopleSoft Enterprise, JD Edwards EnterpriseOne, Siebel, Amdocs CRM and the SAP ERP and CRM applications, as well as many custom applications.

Page 14: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

14

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

5 Enhance data warehouse agility with IBM InfoSphere

With 42 high-volume back-end systems needed to generate a full end-to-end system test, a US insurance company could not confidently launch new features. Testing in production was becoming the norm. In fact, claims could not be processed in certain states because of application defects that the teams skipped over during the testing process. IT was consuming an increasing number of resources—yet application quality was declining rapidly.

After implementing a process to govern test data management, the insurance company reduced the costs of testing by USD400,000 per year. Today, the company can easily refresh 42 test systems from across the organization in record time while finding defects in advance.

The value of test data management at a US insurance company

The business value from implementing test data management included:

$500,000

41 percent less labor required over 12 months

Cost savings of approximately USD500,000 per year

44 percent fewer untested scenarios44%

41%

Page 15: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

15

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Why InfoSphere?

As the foundation of the IBM big data platform, InfoSphere provides market-leading functionality across all the capabilities of information integration and governance. It is designed to handle the challenges of big data by providing optimal scale and performance for massive data volumes, agile and rightsized integration and governance for the increasing velocity of data, and support for a wide variety of data types and big data systems. InfoSphere helps make big data and analytics projects successful by delivering the confidence to act on insight.

InfoSphere capabilities include:

• Metadata, business glossary and policy management: Define metadata, business terminology and governance policies with IBM InfoSphere Business Information Exchange.

• Data integration: Handle all integration requirements, including batch data transformation and movement (InfoSphere Information Server), real-time replication (InfoSphere Data Replication) and data federation (InfoSphere Federation Server).

• Data quality: Parse, standardize, validate and match enterprise data with InfoSphere Information Server for Data Quality.

• Master data management: Act on a trusted view of your customers, products, suppliers, locations and accounts with InfoSphere MDM.

• Data lifecycle management: Manage data throughout its lifecycle, from requirements through retirement, with InfoSphere Optim test data automation and database archiving capabilities.

• Data security and privacy: Continuously monitor data access and protect repositories from data breaches, and support compliance with IBM InfoSphere Guardium®. Ensure sensitive data is masked and protected with InfoSphere Optim.

6 Why InfoSphere?

Page 16: The fundamentals of data lifecycle management in the era

The fundamentals of data lifecycle management in the era of big data

16

1 Introduction

2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Additional resourcesReady to get started? Take a self-service InfoSphere Optim Business Value Assessment and show the ROI results to your big data project owner.

To learn more about InfoSphere Optim, check out these resources:

• Manage the Data Lifecycle of Big Data Environments

• IBM InfoSphere Optim solutions for data warehouses

• Demo: IBM InfoSphere Optim Data Growth Solution

• Demo: IBM InfoSphere Optim Test Data Management Solution

To learn more about the IBM approach to information integration and governance for big data, please contact your IBM representative or IBM Business Partner, or visit: ibm.com/software/data/information-integration-governance

6 Why InfoSphere?

Page 17: The fundamentals of data lifecycle management in the era

IMM14126-USEN-00

© Copyright IBM Corporation 2013

IBM Corporation Software Group Route 100 Somers, NY 10589

Produced in the United States of America August 2013

IBM, the IBM logo, ibm.com, BigInsights, DB2, Guardium, IMS, Informix, InfoSphere, Optim, PureData, and z/OS are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.

THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.

The client is responsible for ensuring compliance with laws and regulations applicable to it. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the client is in compliance with any law or regulation.

Please Recycle

1 Yuhanna, Noel. “Your Enterprise Data Archiving Strategy.” Forrester. February 2011. ftp://ftp.boulder.ibm.com/software/data/sw-library/data-management/optim/papers/your-enterprise-data-archiving-strategy.pdf

2 IBM 2012 Global Reputational Risk and IT Study. ibm.com/services/us/gbs/bus/html/risk_study-2012-infographic.html