Execution Environments for Distributed Computing
Self-Adapting, Energy-Conserving Distributed
File Systems
EEDC 34330
European Master in Distributed Computing - EMDC
EEDC PresentationMário Almeida– 4knahs[@]gmail.com
www.marioalmeida.eu
*
Outline● Introduction
○ Green Computing○ Distributed File Systems○ DFS issues
● Hadoop Distributed File System○ Overview○ Evaluation
● Green HDFS○ Overview○ Design○ Goal○ Energy-management
policies○ Machine learning○ Evaluation
● Conclusions● References
*
Introduction - Green Computing● Environmentally sustainable computing with minimal
impact on the environment.● Reduction of the energy consumption, the GreenHouse
Gas emissions and the operational costs.
*
Introduction - Distributed FS● A Distributed File System (DFS) is any file system that
allows access to files from multiple hosts sharing via a computer network.
● May include facilities for transparent replication and fault tolerance.
*
Introduction - DFS Issues● Distributed File Systems are often built to run on a large
number of commodity servers.● Which means that:
○ it generates heat and consumes large amounts of energy.
○ costs are dependent on the initial acquisition costs and power, cooling, etc.
*
Introduction - DFS Issues● Common approach:
○ Scale-Down -Transitioning servers into low power consumption states.
○ Other approaches not exclusive to DFS might include renewable energy, free cooling, etc.
*
● Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.
● HDFS creates multiple replicas of data blocks and
distributes them on compute nodes throughout a cluster of enable reliable, extremely rapid computations.
HDFS Overview
*
In 2010, a detailed analysis of files was done in a production Yahoo! Hadoop cluster with the following characteristics:
● 2600 servers● 34 million files● Over 5 PB of data● 3 months of observation
HDFS Evaluation
*
Key observations:
● Files are heterogeneous in access and lifespan patterns.● 60% of data is "cold" or dormant.● 95-98% of files have a very short "hotness" lifespan of
less than 3 days.● 90% of files were dormant or "cold" for more than 18
days.● Majority of the data had a news-server-like access
pattern.
HDFS Evaluation
*
GHDFS Overview
● Self-Adaptive - depends only on HDFS and file access patterns
● Applies Data-Classification techniques● Energy-Aware placement of data● Trades cost, performance and power by separating
cluster into logical zones.
*
GHDFS Design
Hot Zone
Files currently accessed and newly created
High energy usage and performance
Cold Zone
Files with low to rare access
Low energy use
and Sleepingmode
*
GHDFS - Management Policies GreenHDFS uses three different management policies:
● FMP - File Migration Policy
● SCP - Server Power Conserver Policy● FRP - File Reversal Policy
Hot Zone
Cold Zone
*
GHDFS - File Migration Policy
Hot Zone
Cold Zone
Coldness > Threshold
Hotness > Threshold
● FMP monitors the dormancy of files● Runs in the Hot Zone
● Gives higher storage effiency for the Hot Zone as less
accessed files are moved to the Cold Zone
*
GHDFS - Power Conserver Policy
Cold Zone
● SCP runs in the ColdZone● Determines which servers can go to stanby/sleep mode.
● Uses hardware techniques to transfer CPU, Disks and
FRAM into low power state.
● Wakes the server up only if:○ Data on that server is accessed○ New data needs to be placed on that server
*
GHDFS - File Reversal Policy● FRP runs in the ColdZone.● Ensures QoS, bandwidth and response time is well
managed in case a file becomes popular.
Hot Zone
Cold Zone#accesses > Threshold
*
GHDFS - Machine Learning● Designing and developing algorithms that allow
computers to evolve behaviors based on empirical data.
● Recognize patterns and make decisions based on data.
*
GHDFS uses:● Supervised machine learning.● A variant of Multiple Linear Regression to find the
statistical correlation between directory and file attributions.
● Training data preparation - audit logs and metadata.● Predicts the files Lifespan, Size and Heat upon creation
of file. It works because there is a high correlation between the directory hierarchy and file attributes in a well-laid out and partitioned name space!!
GHDFS - Machine Learning
*
GHDFS - Machine Learning
*
GHDFS - Evaluation
*
GHDFS - Evaluation
*
GHDFS - Evaluation
*
GHDFS - Evaluation
*
GHDFS - Evaluation
● Energy consumption reduced by 24% and saved $2.1 millions saved in energy costs per annum (38000 servers).
● Maximizes the usage of the power budget by allowing
the infrastructure to expand. More Hot Zone servers offer more availability and performance.
*
Conclusions
● Machine learning can be applied for a predictive self-managed energy control system that achieves better results than reactive approaches.
● Good Energy Management Policies can result in high
savings in energy consumption.
● Data-Classification techniques can help achieving a better energy-aware placement of data in Distributed File Systems.
● The presented techniques applied in conjunction to
other more common green computing technologies can impact significantly the maintenance costs of the cluster.
*
References
● GreenHDFS : Torwards an Energy-Conserving Storage-Efficient, Hybrid Hadoop Compute Cluster
● Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System
● Predictive Data and Energy Management in GreenHDFS
● The Hadoop Distributed File System● Introduction to Machine Learning (Adaptive Computation
and Machine Learning)
Execution Environments for Distributed Computing
Self-Adapting, Energy-Conserving Distributed
File Systems
EEDC 34330
European Master in Distributed Computing - EMDC
EEDC PresentationMário Almeida– 4knahs[@]gmail.com
www.marioalmeida.eu