debunking common myths of hadoop backup & test data management

Post on 08-Jan-2017

22 Views

Category:

Engineering

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Confidential and Proprietary1

Debunking Common Myths About Hadoop Backup and Test Data ManagementHari Mankude, CTONovember 2016

Confidential and Proprietary2

My Background

Confidential and Proprietary3

Why Bother With Backup and Test Data Mgmt?

The average cost of a data loss incident is $900,00090% of enterprises delay applications because of a lack

of test data

• Source: EMC, Talena

Confidential and Proprietary4

Myth #1 Data Replicas Prevent Data Loss

Name Node

Data Node Data Node Data Node Data NodeData Node

Confidential and Proprietary5

Myth #2 Hadoop Replication Prevents Data Loss

Name Node

Data Node Data Node Data Node

Name Node

Data Node Data Node Data Node

Data Center #1 Data Center #2

DistCp

Confidential and Proprietary6

Myth #3: Hadoop Snapshots Are An Effective Backup Strategy

Snapshots result in storage

amplification

PROBLEM

Need scheduler to take timely snapshots & delete older

restore points

PROBLEM

Confidential and Proprietary7

Myth #4: Restoring From Snapshots Is Trivial

Requires metadata

and data to be restored

in synch

PROBLEM

Versioning complicates the restore

process

PROBLEM

Confidential and Proprietary8

Myth #5: DistCp Is Good Enough

DistCp only copies data,

not metadata or attributes

Very resource intensive – takes up

MapReduce slots on

production

Error recovery is not robust

and can lead to failed jobs

No restore point

management (aka no point

in time recovery)

Confidential and Proprietary9

Myth #6: The traditional backup/restore process works

• 500 TB with 5% daily change = 650 TB moved per week

Weekly Fulls and Daily

Incrementals

• Impact on CPU• Management overhead

of agents on 100s of nodes

Agents

• Involves going back to last full backup and applying all the incrementals

Restores

Confidential and Proprietary10

Myth #7 Test Data Management Is A Simple Process

Change Request - 1 week

Provision Production Data - 1 week

Create Test DB

and Mask Data - 1

week

Create Samples

of Production Data – 2 days

Push Production Data To

Test – Hours

Repeat Process –

3-4 weeks

Confidential and Proprietary11

The Evolution of Data Management

THE NEXT 25 YEARS

THE TRADITIONALWORLD

Data ManagementData Platforms

Confidential and Proprietary12

Talena in Production

Test Cluster

ResearchCluster

Talena GUI

Hadoop/Spark Cluster

Cassandra Cluster

Vertica Cluster

Couchbase Cluster

Talena Smart

Storage Cluster

Confidential and Proprietary13

The Talena Architecture

• Deep de-duplication and compression with app-aware architecture

• Incremental-forever backup architecture• High availability via erasure coding in distributed cluster

architecture

Smart Storage Optimizer

Confidential and Proprietary14

The Talena Architecture

Native querying and analytics via active compute layer

Unbounded scale with a Hadoop-native architecture

Smart Storage Optimizer

Active Compute Services Distributed File System

Confidential and Proprietary15

The Talena Architecture

• Google-like catalog shortens data recovery time

• Automatic schema generation for mirroring and backups

• Granular recovery at an object level

• Recovery to multiple topologies

• Native integration with LDAP and Kerberos for authentication

• Role-based access control defines specific privileges

• Transparent data encryption

• Masking for PII data

Smart Storage Optimizer

Active Compute Services Distributed File System

Metadata Catalog Data Orchestration ServicesSecurity Services

Confidential and Proprietary16

Smart Storage Optimizer

The Talena Architecture

GUI CLI API

Active Compute Services Distributed File System

• ‘Single pane of glass’ for multiple use cases and data platforms• Agentless architecture minimizes management overhead• GUI, CLI, REST-based Talena API options

Metadata Catalog Data Orchestration ServicesSecurity Services

Confidential and Proprietary17

Hadoop Support

Supports and/or certified against multiple distributions–Apache, Cloudera, Hortonworks, IBM BigInsights

Supports multiple applications–HDFS, Hive, HBase, Impala, Presto

Deployed either on-premise or in private/public clouds

Confidential and Proprietary18

Q&A We’ll send you a link to our eBook “The Hadoop Backup Guide”

Additional resources: talena-inc.com/resources and talena-inc.com/blog

Ping us with any additional questions: info@talena-inc.com

Confidential and Proprietary19

Q and A

top related