hdfs - hadoop and solid state drives

Upload: topimaster

Post on 23-Feb-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 HDFS - Hadoop and Solid State Drives

    1/8

    This are my musings related to the Apache Hadoop Project.

    Saturday, May 5, 2012

    Is there a story for the Hadoop Storage Stack (HDFS+HBase) on Solid State Drive (SSD)? This is a question that I have been

    asked by quite a few people in the last two days, mostly by people at the OpenComputeSummit. This piece discusses the

    possible use cases of using SSD with Hadoop or HBase.

    Use Case

    Currently, there are two primary use cases for HDFS: data warehousingusing map-reduce and a key-value store via HBase. In

    the data warehouse case, data is mostly accessed sequentially from HDFS, thus there isn't much benefit from using a SSD to

    store data. In a data warehouse, a large portion of queries access only recent data, so one could argue that keeping the last

    few days of data on SSDs could make queries run faster. But most of our map-reduce jobs are CPU bound (decompression,

    deserialization, etc) and bottlenecked on map-output-fetch; reducing the data access time from HDFS does not impact thelatency of a map-reduce job. Another use case would be to put map outputs on SSDs, this could potentially reduce

    map-output-fetch times, this is one option that needs some benchmarking.

    For the secone use-case, HDFS+HBase could theoretically use the full potential of the SSDs to make online-transaction-

    processing-workloads run faster. This is the use-case that the rest of this blog post tries to address.

    Background

    The read/write latency of data from a SSD is a magnitude smaller than the read/write latency of a spinning disk storage, this is

    especially true for random reads and writes. For example, a random read from a SSD takes about 30 micro-seconds while a

    random read from a spinning disk takes 5 to 10 milliseconds. Also, a SSD device can support 100K to 200K operations/sec

    while a spinning disk controller can possibly issue only 200 to 300 ops/sec. This means that random reads/writes are not a

    bottleneck on SSDs. On the other hand, most of our existing database technology is designed to store data in spinning disks,

    so the natural question is "can these databases harness the full potential of the SSDs"? To answer the above question, we ran

    two separate artificial random-read workloads, one on HDFS and one on HBase. The goal was to stretch these products to

    the limit and establish their maximum sustainable throughput on SSDs.

    HDFS random-read on cached data

    In the first experiment, we created a HDFS cluster with a single NameNode and a single DataNode. We created a 2 GB HDFS

    file with a HDFS block size of 256 MB and a replication factor of 1. We configured the DataNode to run on a 16 hyper-

    threaded cores and it stored block-data on xfs. Our benchmark program was co-located on the DataNode machine and had

    hdfs-read-shortcircuit swicthed on, i.e. the DFSClient bypassed the DataNode and issued read-calls directly to the local xfs

    filesystem. The entire 2 GB of data was cached in the OS buffer cache and this benchmark did not trigger any IO from disk.

    The fact that all the data was in the OS cache essentially simulated the behavior of an ultra-fast SSD. We varied the number of

    client threads and each client thread did a pre-configured number of 16K read calls from HDFS. Since there were only 8 blocks

    in the file, the DFSClient cached all the block locations of all these 8 blocks and there were no repeatative calls to the

    NameNode. The first few iterations of this test showed that HDFS can sustain a max random-read-throughput of around 50K

    ops/sec, but surprisingly the CPU was not maxed out. We found that the read-shortcircuit code path spent considerable time in

    DNS lookup calls and updating metric-counters. We fixed these two pieces of code and observed that HDFS could sustain a

    peak random-read-throughput of around 92K ops/sec, the CPUs was now close to 95% usage. HdfsPreadImageis a plot that

    captures this scenario. The takeaway is that a database that is layered above HDFS would not be able to utilize all the iops

    offered by a single SSD.

    A profiled run of the HDFS code shows that the DFSClient's code path are quite long and causes appreciable impact to

    throughput for cached random reads. If data-fetch times are in the millisecond range(from spinning disks), the long code paths

    in the DFSClient do not add appreciable overhead, but when the data is cached in the OS cache (or in SSDs), these code

    paths need some major rework. Another option would be to write a HDFS readonly-client in C or C++, thereby avoiding some

    of the overhead of the current Java-based DFSClient.

    HBase random-get on cached data

    In the second experiment, we did a similar experiment on HBase. We created a single table with a single region and all data

    was cached in the OS cache of a single HBase regionserver. The OS cache is simulating a super fast SSD device. We used a

    set of 4 client machines to drive random-get calls to the regionserver. The regionserver was configured to use a maximum of

    2000 threads. The HBase table has lzo compression and delta-encoding-fast-diff enabled. Since the data set is cached in OS

    buffers, this benchmark does not cause any disk io fr om spinning disks. We saw that the HBase throughput maxes out at

    Hadoop and Solid State Drives

    Dhruba Borthakur

    Create Your Badge

    Facebook Badge

    Hadoop Distributed File

    System

    Promote Your Page Too

    Facebook Badge

    2012 (2)

    May (1)

    HadoopandSolidStateDrives

    February

    (1)

    2011 (4)

    2010 (5)

    2009 (10)

    Blog Archive

    S http://hadoopblog.blog

    1/9/2016

  • 7/24/2019 HDFS - Hadoop and Solid State Drives

    2/8

    around 35K ops/sec and we were not able to drive the CPU usage on that machine to more than 45%. Heavy lock contention

    and heavy context switching causes the regionserver to not be able to use all the available CPU on the machine. The detailed

    chart is at Cache4G.

    What does this mean

    The two experiments show that HBase+HDFS, as it stands today, will not be able to harness the full potential that is offered by

    SSDs. It is possible that some code restructuring could improve the random-read-throughput of these solutions but my guess is

    that it will need significant engineering time to make HBase+HDFS sustain a throughput of 200K ops/sec.

    These results are not unique to HBase+HDFS. Experiments on other non-Hadoop databases show that they also need to be

    re-engineered to achieve SSD-capable throughputs. My conclusion is that database and storage technologies would need to

    be developed from scratch if we want to utilize the full potential of Solid State Devices. The search is on for there newtechnologies!

    130 people like this. Sign Upto see what your f riends like.LikeLike

    Posted by Dhruba Borthakur at 10:48 PM 33 comments

    Labels: hadoop and solid state, hadoop and ssd, hdfs and ssd, SSD

    Sunday, February 19, 2012

    Recently, I was asked to write up about my vision of a BigData Benchmark. That begs the question: What is BigData? Does it

    refer to a dataset that is large in size, and if so, what is large? Does it refer to the type of data that such a data store

    contains? Shall we refer to BigData only if it does not conform to a relational schema? Here are some of my random thoughts.

    Software industry professionals have started to use the term BigData to refer to data sets that are typically many magnitudes

    larger than traditional databases. The largest Oracle database or the largest NetApp filer could be many hundred terabytes at

    most, but BigData refers to storage sets that can scale to many hundred petabytes. Thus, the first and foremost chracteristics

    of a BigData store is that a single instance of it can be many petabytesin size. These data stores can have a multitude of

    interfaces, starting from traditional SQL-likequeries to customized key-value access methods. Some of them are batch

    systems while others are interactive systems. Again, some of them are organized for full-scan-index-free access while others

    have fine-grain indexes and low latency access. How can we design a benchmark(s) for such a wide variety of data stores?

    Most benchmarks focus on latency and throughput of queries, and rightly so. However, in my opinion, the key to designing a

    BigData benchmark lies in understanding the deeper commonalities of these systems. A BigData benchmark should measure

    latencies and throughput, but with a great deal of variations in the workload, skews in the data set and in the presence of

    faults. I list below some of the common characteristics that distinguish BigData installations from other data storage systems.

    Elasticity of resources

    A primary feature of a BigData System is that it should be elastic in nature. One should be able to add software and hardware

    resources when needed. Most BigData installations do not want to pre-provision for all the data that they might collect in the

    future, and the trick to be cost-efficient is to be able to add resources to a production store without incurring downtime. A

    BigData system typically has the ability to decommission parts of the hardware and software without off-lining the service, so

    that obselete or defective hardware can be replaced dynamically. In my mind, this is one of the most important features of a

    BigData system, thus a benchmark should be able to measure this feature. The benchmark should be such that we can add

    and remove resources to the system when the benchmark is concurrently executing.

    Fault Tolerance

    The Elasticity feature described above indirectly implies that the system has to be fault-tolerant. If a workload is running on

    your system and some parts of the system fails, the other parts of the system should configure themselves to share the work

    of the failed parts. This means that the service does not fail even in the face of some component failures. The benchmarkshould measure this aspect of BigData systems. One simple option could be that the benchmark itself introduces

    component failures as part of its execution.

    Skew in the data set

    Many big data systems take in un-curated data. That means there are always data points that are extreme outliers and

    introduces hotspots in the system. The workload on a BigData system is not uniform; some small parts of it is are major

    hotspotsand incur tremendously higher load than the rest of the system. Our benchmarks should be designed to operate on

    datasets that have large skew and introduce workload hotspots.

    There are a few previous attempts to define a unified benchmark for BigData. Dewitt and Stonebraker touched upon a few

    areas in their SIGMOD paper. They describe experiments that use a grep task, a join task and a simple sql aggregation query.

    But none of those experiments are done in the presence of system faults, neither do they add or remove hardware when the

    experiment is in progress . Similarly, the YCSB benchmarkproposed by Cooper and Ramakrishnan suffers from the same

    Salient features for a BigData Benchmark

    S http://hadoopblog.blog

    1/9/2016

  • 7/24/2019 HDFS - Hadoop and Solid State Drives

    3/8

    deficiency.

    How would I run the experiments proposed by Dewitt and Stonebraker? Here are some of my early thoughts:

    Focus on a 100 node experiment only. This is the setting that is appropriate for BigData systems.

    Increase the number of URLs such that the data set is at least a few hundred terabytes.

    Make the benchmark run for at least an hour or so. The workload should be a set of multiple queries. Pace the

    workload so that the there is constant fluctuations in the number of inflight queries.

    Introduce skew in the data set. The URL data should be such that maybe 0.1% of those URLs occur 1000 times

    more frequently that other URLs.

    Introduce system faults by killing one of the 100 nodes once every minute, keep it shutdown for a minute, then bring

    it back online and then continue with process with the remainder of the nodes till the entire benchmark is done.

    My hope is that there is somebody out there who can repeat the experiments with the modified settings listed above and

    present their findings. This research would greatly benefit the BigData community of users and developers!

    On a side note, I am working with some of my esteemed colleagues to document a specific data model and custom workload

    for online serving of queries from a multi-petabyte BigData system. I will write about it in greater detail in a future post.

    38 people like this. Sign Upto see what your f riends like.LikeLike

    Posted by Dhruba Borthakur at 9:21 PM 9 comments

    Saturday, July 2, 2011

    I had earlier blogged about whyFacebook is starting to use Apache Hadoop technologies to serve realtimeworkloads. We

    presented the paper at the SIGMOD 2011conference and it was very well received.

    Here is a linkto the complete paper for those who are interested in understanding the details of why we decided to use

    Hadoop technologies, the workloads that we have on realtime Hadoop, the enhancements that we did to Hadoop for

    supporting our workloads and the processes and methodologies we have adopted to deploy these workloads successfully. A

    shortened version of the first two sections of the paper are also described in the slidesthat you can find here.

    28 people like this. Sign Upto see what your f riends like.LikeLike

    Posted by Dhruba Borthakur at 6:02 PM 29 comments

    Labels: hadoop and facebook, sigmod and hbase

    Realtime Hadoop usage at Facebook: The Complete Story

    Saturday, May 28, 2011

    This is the second part of our SIGMOD-2011paper that describes our use case for Apache Hadoopand Apache HBasein

    realtime workloads. You can find the first part here. We describe why Hadoop and HBase fits the requirements of each of

    these applications.

    OUR WORKLOADS

    Before deciding on a particular software stack and whether or not to move away from our MySQL-based architecture, we

    looked at a few specific applications where existing solutions may be problematic. These use cases would have workloads that

    are challenging to scale because of very high write throughput, massive datasets, unpredictable growth, or other patterns that

    may be difficult or suboptimal in a sharded RDBMS environment.

    1. Facebook Messaging

    The latest generation of Facebook Messaging combines existing Facebook messages with e-mail, chat, and SMS. In addition

    to persisting all of these messages, a new threading model also requires messages to be stored for each participating user.

    As part of the application server requirements, each user will be sticky to a single data center.

    Realtime Hadoop usage at Facebook -- Part 2 - Workload Types

    S http://hadoopblog.blog

    1/9/2016

  • 7/24/2019 HDFS - Hadoop and Solid State Drives

    4/8

    1.1 High Write Throughput

    With an existing rate of millions of messages and billions of instant messages every day, the volume of ingested data would be

    very large from day one and only continue to grow. The denormalized requirement would further increase the number of writes

    to the system as each message could be written several times.

    1.2 Large Tables

    As part of the product requirements, messages would not be deleted unless explicitly done so by the user, so each mailbox

    would grow indefinitely. As is typical with most messaging applications, messages are read only a handful of times when they

    are recent, and then are rarely looked at again. As such, a vast majority would not be read from the database but must be

    available at all times and with low latency, so archiving would be difficult. Storing all of a users thousands of messages meant

    that wed have a database schema that was indexed by the user with an ever-growing list of threads and messages. With thistype of random write workload, write performance will typically degrade in a system like MySQL as the number of rows in the

    table increases. The sheer number of new messages would also mean a heavy write workload, which could translate to a high

    number of random IO operations in this type of system.

    1.3 Data Migration

    One of the most challenging aspects of the new Messaging product was the new data model. This meant that all existing

    users messages needed to be manipulated and joined for the new threading paradigm and then migrated to the new system.

    The ability to perform large scans, random access, and fast bulk imports would help to reduce the time spent migrating users

    to the new system.

    2 Facebook Insights

    Facebook Insights provides developers and website owners with access to real-time analytics related to Facebook activity

    across websites with social plugins, Facebook Pages, and Facebook Ads. Using anonymized data, Facebook surfaces activitysuch as impressions, click through rates and website visits. These analytics can help everyone from businesses to bloggers

    gain insights into how people are interacting with their content so they can optimize their services. Domain and URL analytics

    were previously generated in a periodic, offline fashion through our Hadoop and Hive analytics data warehouse. However, this

    does not yield a rich user experience as the data is only available several hours after it has occurred.

    2.1 Realtime AnalyticsThe insights teams wanted to make statistics available to their users within seconds of user actions rather than the hours

    previously supported. This would require a large-scale, asynchronous queuing system for user actions as well as systems to

    process, aggregate, and persist these events. All of these systems need to be fault-tolerant and support more than a million

    events per second.

    2.2 High Throughput Increments

    To support the existing insights functionality, time and demographic-based aggregations would be necessary. However, these

    aggregations must be kept up-to-date and thus processed on the fly, one event at a time, through numeric counters. With

    millions of unique aggregates and billions of events, this meant a very large number of counters with an even larger number of

    operations against them.

    3. Facebook Metrics System

    At Facebook, all hardware and software feed statistics into a metrics collection system called ODS (Operations Data Store).

    For example, we may collect the amount of CPU usage on a given server or tier of servers, or we may track the number of

    write operations to an HBase cluster. For each node or group of nodes we track hundreds or thousands of different metrics,

    and engineers will ask to plot them over time at various granularities. While this application has hefty requirements for write

    throughput, some of the bigger pain points with the existing MySQL-based system are around the resharding of data and the

    ability to do table scans for analysis and time roll-ups. This use-case is gearing up to be in production very shortly.

    3.1 Automatic Sharding

    The massive number of indexed and time-series writes and the unpredictable growth patterns are difficult to reconcile on a

    sharded MySQL setup. For example, a given product may only collect ten metrics over a long period of time, but following a

    large rollout or product launch, the same product may produce thousands of metrics. With the existing system, a single MySQL

    server may suddenly be handling much more load than it can handle, forcing the team to manually re-shard data from this

    server onto multiple servers.

    3.2 Fast Reads of Recent Data and Table Scans

    A vast majority of reads to the metrics system is for very recent, raw data, however all historical data must also be available.

    Recently written data should be available quickly, but the entire dataset will also be periodically scanned in order to perform

    time- based rollups.

    (Credit to the authors of the paper: Dhruba Borthakur Kannan Muthukkaruppan Karthik Ranganathan Samuel Rash

    Joydeep Sen Sarma Jonathan Gray Nicolas Spiegelberg Hairong Kuang Dmytro Molkov Aravind Menon Rodrigo

    Schmidt Amitanand Aiyer)

    306 people like this. Sign Upto see what your f riends like.LikeLike

    S http://hadoopblog.blog

    1/9/2016

  • 7/24/2019 HDFS - Hadoop and Solid State Drives

    5/8

    Posted by Dhruba Borthakur at 12:56 AM 9 comments

    Labels: facebook and hdfs, hadoop, hadoop and facebook, hbase, sigmod and hbase

    Tuesday, May 17, 2011

    Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. It

    uses HDFSand HBaseas core technologies for this solution. Since then, there are many more applications that have started

    to used HBase. We have gained some experience in deploying and operating HDFS and HBase at peta-byte scale for realtime-

    workloads and decided to write a paper detailing some of these insights. This paper will be published in SIGMOD 2011.

    You can find the full paper here later, but here are some highlights:

    WHY HADOOP AND HBASE

    The requirements for the storage system for our workloads can be summarized as follows:

    1. Elasticity: We need to be able to add incremental capacity to our storage systems with minimal overhead and no downtime.

    In some cases we may want to add capacity rapidly and the system should automatically balance load and utilization across

    new hardware.

    2. High write throughput: Most of the applications store (and optionally index) tremendous amounts of data and require high

    aggregate write throughput.

    3. Efficient and low-latency strong consistency semantics within a data center: There are important applications like

    Messages that require strong consistency within a data center. This requirement often arises directly from user expectations.

    For example unread message counts displayed on the home page and the messages shown in the inbox page view should

    be consistent with respect to each other. While a globally distributed strongly consistent system is practically impossible, a

    system that could at least provide strong consistency within a data center would make it possible to provide a good user

    experience. We also knew that (unlike other Facebook applications), Messages was easy to federate so that a particular user

    could be served entirely out of a single data center making strong consistency within a single data center a critical requirement

    for the Messages project. Similarly, other projects, like realtime log aggregation, may be deployed entirely within one data

    center and are much easier to program if the system provides strong consistency guarantees.

    4. Efficient random reads from disk: In spite of the widespread use of application level caches (whether embedded or via

    memcached), at Facebook scale, a lot of accesses miss the cache and hit the back-end storage system. MySQL is very

    efficient at performing random reads from disk and any new system would have to be comparable.

    5. High Availability and Disaster Recovery:We need to provide a service with very high uptime to users that covers both

    planned and unplanned events (examples of the former being events like software upgrades and addition of hardware/capacity

    and the latter exemplified by failures of hardware components). We also need to be able to tolerate the loss of a data center

    with minimal data loss and be able to serve data out of another data center in a reasonable time frame.

    6. Fault Isolation:Our long experience running large farms of MySQL databases has shown us that fault isolation is critical.

    Individual databases can and do go down, but only a small fraction of users are affected by any such event. Similarly, in our

    warehouse usage of Hadoop, individual disk failures affect only a small part of the data and the system quickly recovers from

    such faults.

    7. Atomic read-modify-write primitives: Atomic increments and compare-and-swap APIs have been very useful in building

    lockless concurrent applications and are a must have from the underlying storage system.

    8. Range Scans:Several applications require efficient retrieval of a set of rows in a particular range. For example all the last

    100 messages for a given user or the hourly impression counts over the last 24 hours for a given advertiser.

    It is also worth pointing out non-requirements:

    1. Tolerance of network partitions within a single data center: Different system components are often inherently

    centralized. For example, MySQL servers may all be located within a few racks, and network partitions within a data center

    would cause major loss in serving capabilities therein. Hence every effort is made to eliminate the possibility of such events at

    the hardware level by having a highly redundant network design.

    2. Zero Downtime in case of individual data center failure:In our experience such failures are very rare, though not

    impossible. In a less than ideal world where the choice of system design boils down to the choice of compromises that are

    acceptable, this is one compromise that we are willing to make given the low occurrence rate of such events. We might revise

    this non-requirement at a later time.

    3. Active-active serving capability across different data centers: As mentioned before, we were comfortable making the

    assumption that user data could be federated across different data centers (based ideally on user locality). Latency (when

    Realtime Hadoop usage at Facebook -- Part 1

    S http://hadoopblog.blog

    1/9/2016

  • 7/24/2019 HDFS - Hadoop and Solid State Drives

    6/8

    user and data locality did not match up) could be masked by using an application cache close to the user.

    Some less tangible factors were also at work. Systems with existing production experience for Facebook and in-house

    expertise were greatly preferred. When considering open-source projects, the strength of the community was an important

    factor. Given the level of engineering investment in building and maintaining systems like these it also made sense to choose

    a solution that was broadly applicable (rather than adopt point solutions based on differing architecture and codebases for

    each workload).

    After considerable research and experimentation, we chose Hadoop and HBase as the foundational storage technology for

    these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our

    confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly

    consistent, high write-throughput key-value store. The HDFS NameNode stood out as a central point of failure, but we wereconfident that our HDFS team could build a highly-available NameNode (AvatarNode) in a reasonable time-frame, and this

    would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending

    adding Bloom filters to HBases version of LSM Trees, making local DataNode reads efficient and caching NameNode

    metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and

    isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal

    of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects,

    both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.

    HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides

    row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-

    orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single

    table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and

    maintain the flexibility to scale out quickly.

    HBase is now being used by many other workloads internally at Facebook . I will describe these different workloads in a later

    post.

    (Credit to the authors of the paper: Dhruba Borthakur Kannan Muthukkaruppan Karthik Ranganathan Samuel Rash Joydeep

    Sen Sarma Jonathan Gray Nicolas Spiegelberg Hairong Kuang Dmytro Molkov Aravind Menon Rodrigo Schmidt Amitanand

    Aiyer)

    635 people like this. Sign Upto see what your f riends like.LikeLike

    Posted by Dhruba Borthakur at 8:50 AM 20 comments

    Labels: hadoop, hadoop and facebook, hbase, hdfs, sigmod and hbase, sigmod and hdfs

    Thursday, April 14, 2011

    Many people have asked me to describe the best practices that we have adopted to run a multi PB data warehouse using

    Hadoop. Most of the details were described in a paperthat we presented at SIGMOD 2010. This document refers to our

    state-of-affairs as it was about a year back, but is still an interesting read. Below is the abstract of this paper. You can find the

    complete paper here.

    ABSTRACT

    Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non- engineering.

    Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's

    site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the

    Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever

    increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and

    contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and

    Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will

    present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB

    after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our

    design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements

    that we are working on.

    Facebook has opensourced the version of Apache Hadoop that we use to power our production clusters. You can find more

    details about our usage of Hadoop at the Facebook Engineering Blog.

    9 people like this. Sign Upto see what your fr iends like.LikeLike

    Posted by Dhruba Borthakur at 12:15 AM 7 comments

    Data warehousing at Facebook

    S http://hadoopblog.blog

    1/9/2016

  • 7/24/2019 HDFS - Hadoop and Solid State Drives

    7/8

    Older PostsHome

    Subscribe to: Posts (Atom)

    Labels: hadoop and facebook, hadoop warehouse

    Sunday, November 21, 2010

    Recently, I visited a few premier educational institutes in India, e.g. Indian Institute of Technology (IIT) at Delhi and Guwahati.

    Most of the undergraduate students at these two institutes are somewhat familiar with Hadoop and would like to work on

    Hadoop related projects as part of their course work. One commonly asked question that I got from these students is what

    Hadoop feature can I work on?

    Here are some items that I have in mind that are good topics for students to attempt if they want to work in Hadoop.

    Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources.

    The current implementation is based on statically configured slots.

    Abilty to make a map-reduce job take new input splits even after a map-reduce job has

    already started.

    Ability to dynamically increase replicas of data in HDFS based on access patterns. This is

    needed to handle hot-spots of data.

    Ability to extend the map-reduce framework to be able to process data that resides partly in

    memory. One assumption of the current implementation is that the map-reduce framework isused to scan data that resides on disk devices. But memory on commodity machines is

    becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about

    200TB of data in memory! It would be nice if the hadoop framework can support caching the

    hot set of data on the RAM of the tasktracker machines. Performance should increase

    dramatically because it is costly to serialize/compress data from the disk into memory for every

    query.

    Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are

    laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but

    rather anomalies that makes parts of the cloud slow (but not fail completely), these impact

    performance of jobs.

    Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster

    cannot fit into a single data center and a user has to partition the dataset into two hadoop

    clusters in two different data centers.

    High Availability of the JobTracker. In the current implementation, if the JobTracker machine

    dies, then all currently running jobs fail.

    Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a

    dataset that was erroneously modified/deleted by a buggy application.

    The first thing for a student who wants to do any of these projects is to download the code from HDFSand MAPREDUCE.

    Then create an account in the bug trackingsoftware here. Please search for an existing JIRA that describes your project; if

    none exists then please create a new JIRA. Then please write a design document proposal so that the greater Apache

    Hadoop community can deliberate on the proposal and post this document to the relevant JIRA.

    If anybody else have any new project ideas, please add them as comments to this blog post.

    23 people like this. Sign Upto see what your f riends like.LikeLike

    Posted by Dhruba Borthakur at 8:22 AM 45 comments

    Hadoop Research Topics

    S http://hadoopblog.blog

    1/9/2016

  • 7/24/2019 HDFS - Hadoop and Solid State Drives

    8/8

    Search This Blog StatCounter

    with Google Friend Connect

    Members (185) More

    Already a member? Sign in

    Followers

    Ethereal template. Powered by Blogger.

    S http://hadoopblog.blog