development history data management in hadoop

Development of history data management in HadoopJohan Gustavsson

WHO AM I?• Name: Johan Gustavsson

• Born: Sweden

• At Treasure Data since August last year

• Hadoop Team

• Previously as DeNA

• Hadoop Infra Team

JOBTRACKER

JT - DATA RETENTION• Container logs managed by TaskTrackers

• A rather random feeling to how they are flushed and purged

• Usually the logs you want the most were already deleted

• Especially when the container dies for one reason or another

• Parameters kept in JT memory

• Purged after an arbitrary time period

• No RestApi

• Only way to collect it for monitoring is to parse the HTML pages

JT - MEMORY LEAK

MR2 HISTORY SERVER

MRV.2 HS - WHAT IT IS• A server to manage historic data for MapReduce jobs running on top of Yarn

• Manages container logs, metrics, and job configs

• Stores data on HDFS

• Handles indexing, serving and purging of historic data

MRV.2 HS - THE GOOD PARTS• Keeping historic data put no strains on RM

• All historic data managed with a centralised method

• Container logs are collected and saved to HDFS

• Metric data is collected and stored on HDFS

• RestAPI available to get job metrics

• Historic data purging settings are clear and functioning

• With separate config several servers can run on one cluster at ones serving different clients

MRV.2 HS - THE NOT SO GOOD PARTS• Keeping too long history puts a strain on the Namenode

• Lots of small files from metrics data

• At least earlier versions have issues with memory leaks

• No HA support

• Pig and Hive dags use MR2HS for DAG execution

• If this one process goes down all Pig and Hive jobs stop running

• Due to design choices HA workarounds doesn’t work well either

• Data is stored in a dir-tree like this “/jobhistory/done/{year}/{month}/{day}/{offset}/{jobid}”

• Metadata is stored in memory

• Only supports MRv.2 jobs history data

TIMELINE SERVER

TLS - WHAT IT IS• A server that collects both framework centric and generic data from applications running on Yarn

• Storing it either in memory or LevelDB

• Rest API to access that data and a webUI with support for generic data

• Application data and container log purging management

• Original plans also contained support for

• Full HA

• Auxiliary services to plug in framework specific frontends

TLS - THE GOOD PARTS• It does support any Yarn based framework

• It supports a unified endpoint for generic metrics

• Application centric data is supported too, but only accessible through REST

• Handle data lifecycle of both metrics and container logs

• No outside dependencies needed

TLS - THE NOT SO GOOD PARTS• No framework centric views yet (delayed due to API changes)

• No auxiliary view support yet either

• No HA support yet

• The Resourcemanager proxy service still forwards MR jobs to MRv.2 HS

• Still can’t replace MRv.2 HS on MR centric clusters

• Generic data is written by RM directly to Timeline server so no possibility of running more than one

RM GENERIC HISTORY DATA• A method for the RM to publish generic history data to the TimeLine server

• Early adopters used `FileSystemApplicationHistoryStore.class` to write data to HDFS

• Timeline server then served it’s content

• Recent ones post directly to TimeLine server using it’s client

• Early adopter using `FileSystemApplicationHistoryStore.class` could run into the following error since there were no purging built in

`NameSystem.startFile: /var/log/hadoop-yarn/yarn/system/history/ApplicationHistoryDataRoot/application_1445919751965_2871108 The directory item limit of /var/log/hadoop-yarn/yarn/system/history/ApplicationHistoryDataRoot is

exceeded: limit=1048576 items=1048576`

TIMELINE SERVER V2

THE PLAN• New API meaning services written for TLSv.1 might stop working

• Use scalable storage (current talks seems to be in the direction of HBase)

• Support for HA configuration (through the use of Hbase)

• Auxiliary view support

MY HOPES EXPECTATIONS• Addition of MR auxiliary service

• If Hbase becomes default storage, keep the storage engine pluggable

• A HDFS based storage option similar to that of MRv.2 HS

• For HA support limit outside systems to Zookeeper (same as RM and NN)

THANKS FOR LISTENING

development history data management in hadoop

Data & Analytics