hadoop archive and tiering

1

Simplify and Automate Tiering & Archive for Hadoop Today

2

Reasons For Storage Tiering with Hadoop:

• Single tier lends to a large imbalance of compute and storage resources• More applications create varying workloads• Large percent of data is cold in most cases• More recently ingested data can be better balanced• Fewer nodes per GB with archive nodes• Lower infrastructure costs

Tiering on Hadoop

Existing Tier NodeMedium ComputeMedium Capacity

Cold Tier NodeLow ComputeHigh Density Capacity4x Less Per GB

Name Nodes

Accessed Data

Cold Data

Archive Node Example

3

• Over 65% less hardware • 60% fewer nodes (software licensing) • Significant performance improvement• Immediate ROI for cloud and private infrastructures

Archive Data Nodes 80%

Disk Data Nodes20%

Disk Data Nodes100%

Single Tier HDFS Storage

“The price per GB of the ARCHIVE tier is 4x less” -eBay Hadoop Engineering Blog

Example of Archive Tiering Benefits

4x Fewer Nodes

Capacity 10PBCapacity 10PB

4

Pillars of Intelligent Tiering for Hadoop

HEA

T

AG

E

SIZE

USA

GE

Access frequency of data is the most important metric for effective tiering

Age is easiest to determine. CAUTION: Some data is long-term active so this cannot be the only criteria.

Zero and small files should be treated differently in tiering Hadoop.Large cold files should have priority for archive

Knowing how long data is accessed once ingested can provide better capacity planning for your tiers.

5

Installed on a server or VM outside your existing Hadoop cluster without inserting any proprietary technology on the cluster or in the data path.

FactorData Approach

Report data usage (heat), small files, user activity, replication, and HDFS tier utilization. Customize rules and queries to properly utilize infrastructure and plan better for future scale.

Automatically archive, promote, or change the replication factor of data based on usage patterns and user defined rules.

Tier Hadoop HDFS By Heat, Age, Size & Activity In Three Easy Steps

01/INSTALL WITHOUT CHANGES TO CLUSTER

02/VISUALIZE & REPORT

03/AUTOMATE OPTIMIZATION

6

FactorData Automates HDFS TieringHDFSplus

Apply storage policy based on custom query

HDFS

Files are optimized during normal balancing window

Query list based on size, heat, activity, and age

1 2 3

• Move all files 120 days old and not accessed for 90 days to ARCHIVE…..

• FactorData creates a data list based on query

FactorData Archive Tiering Example:

• Limit automated run by max files or capacity

• FactorData tracks completion of each run

• Data can be excluded from run according to path, size and application

Custom Query Example: Automated Tiering:

7

FactorData HDFSplus Architecture

Completely out of the data pathFactorData HDFSplus sits outside the Hadoop cluster and collects only metadata information from the Hadoop cluster

No software to install on the existing Hadoop clusterBecause HDFSplus leverages only existing Hadoop APIs and features, there is no software to install on the cluster.

Provides a highly scalable solution in a small foot-printHDFS visibility and automation for thousands of Hadoop nodes on a single node, VM or server

HDFSplus

Namenodes

Communicates withExisting Hadoop API

VM or Physical Machine32GB RAM

4 CPU or vCPU500GB Free Disk

8

Simplify and Automate Archive and Tiering in Hadoop Today• Move less accessed data to storage dense nodes for better utilization• Lower software licensing• Free resources on existing namenodes and datanodes

FactorData Tiering & Archive on Hadoop

How can we get more performance out of our existing Hadoop cluster?

How can we move data not accessed for 90 days to archive nodes?

How can we better plan for future scale with real Hadoop storage metrics?

Result: Better Performance, Lower Hardware Costs, Lower Software Costs

Plus: Get Necessary Storage Visibility To Answer These Questions & More with FactorData HDFSplus

9

Thank YouVisit us at: http://www.factordata.com

http://www.factordata.com/

hadoop archive and tiering

Technology