development history data management in hadoop
TRANSCRIPT
Development of history data management in HadoopJohan Gustavsson
WHO AM I?• Name: Johan Gustavsson
• Born: Sweden
• At Treasure Data since August last year
• Hadoop Team
• Previously as DeNA
• Hadoop Infra Team
JOBTRACKER
JT - DATA RETENTION• Container logs managed by TaskTrackers
• A rather random feeling to how they are flushed and purged
• Usually the logs you want the most were already deleted
• Especially when the container dies for one reason or another
• Parameters kept in JT memory
• Purged after an arbitrary time period
• No RestApi
• Only way to collect it for monitoring is to parse the HTML pages
JT - MEMORY LEAK
MR2 HISTORY SERVER
MRV.2 HS - WHAT IT IS• A server to manage historic data for MapReduce jobs running on top of Yarn
• Manages container logs, metrics, and job configs
• Stores data on HDFS
• Handles indexing, serving and purging of historic data
MRV.2 HS - THE GOOD PARTS• Keeping historic data put no strains on RM
• All historic data managed with a centralised method
• Container logs are collected and saved to HDFS
• Metric data is collected and stored on HDFS
• RestAPI available to get job metrics
• Historic data purging settings are clear and functioning
• With separate config several servers can run on one cluster at ones serving different clients
MRV.2 HS - THE NOT SO GOOD PARTS• Keeping too long history puts a strain on the Namenode
• Lots of small files from metrics data
• At least earlier versions have issues with memory leaks
• No HA support
• Pig and Hive dags use MR2HS for DAG execution
• If this one process goes down all Pig and Hive jobs stop running
• Due to design choices HA workarounds doesn’t work well either
• Data is stored in a dir-tree like this “/jobhistory/done/{year}/{month}/{day}/{offset}/{jobid}”
• Metadata is stored in memory
• Only supports MRv.2 jobs history data
TIMELINE SERVER
TLS - WHAT IT IS• A server that collects both framework centric and generic data from applications running on Yarn
• Storing it either in memory or LevelDB
• Rest API to access that data and a webUI with support for generic data
• Application data and container log purging management
• Original plans also contained support for
• Full HA
• Auxiliary services to plug in framework specific frontends
TLS - THE GOOD PARTS• It does support any Yarn based framework
• It supports a unified endpoint for generic metrics
• Application centric data is supported too, but only accessible through REST
• Handle data lifecycle of both metrics and container logs
• No outside dependencies needed
TLS - THE NOT SO GOOD PARTS• No framework centric views yet (delayed due to API changes)
• No auxiliary view support yet either
• No HA support yet
• The Resourcemanager proxy service still forwards MR jobs to MRv.2 HS
• Still can’t replace MRv.2 HS on MR centric clusters
• Generic data is written by RM directly to Timeline server so no possibility of running more than one
RM GENERIC HISTORY DATA• A method for the RM to publish generic history data to the TimeLine server
• Early adopters used `FileSystemApplicationHistoryStore.class` to write data to HDFS
• Timeline server then served it’s content
• Recent ones post directly to TimeLine server using it’s client
• Early adopter using `FileSystemApplicationHistoryStore.class` could run into the following error since there were no purging built in
`NameSystem.startFile: /var/log/hadoop-yarn/yarn/system/history/ApplicationHistoryDataRoot/application_1445919751965_2871108 The directory item limit of /var/log/hadoop-yarn/yarn/system/history/ApplicationHistoryDataRoot is
exceeded: limit=1048576 items=1048576`
TIMELINE SERVER V2
THE PLAN• New API meaning services written for TLSv.1 might stop working
• Use scalable storage (current talks seems to be in the direction of HBase)
• Support for HA configuration (through the use of Hbase)
• Auxiliary view support
MY HOPES EXPECTATIONS• Addition of MR auxiliary service
• If Hbase becomes default storage, keep the storage engine pluggable
• A HDFS based storage option similar to that of MRv.2 HS
• For HA support limit outside systems to Zookeeper (same as RM and NN)
THANKS FOR LISTENING