pipelined-mapreduce an improved mapreduce
Post on 08-Nov-2014
65 Views
Preview:
DESCRIPTION
TRANSCRIPT
ASeminar-II Report
On
“Pipelined-MapReduce: An improved MapReduceParallel programing model”
submitted in partial fulfilment of the requirements
for the award of the degree of
First Year Master of Engineering
in
Computer Engineering
By
Prof.Deshmukh Sachin B.
Under the guidance of
Prof. B.L.Gunjal
DEPARTMENT OF COMPUTER ENGINEERING
AMRUTVAHINI COLLEGE OF ENGINEERING.SANGAMNER
A/P-GHULEWADI-422608, TAL. SANGAMNER, DIST. AHMEDNAGAR (MS) INDIA
2013-2014
DEPARTMENT OF COMPUTER ENGINEERING AMRUTVAHINI
COLLEGE OF ENGINEERING. SANGAMNER A/P-GHULEWADI-
422608, TAL. SANGAMNER, DIST. AHMEDNAGAR (MS) INDIA
2013-14
CERTIFICATEThis is to certify that Seminar-II report entitled
“Pipelined-MapReduce: An improved MapReduceParallel programing model”
is submitted as partial fulfilment of curriculum of First Year M.E.
of Computer Engineering
ByProf.Deshmukh Sachin B.
Prof. B.L.Gunjal Prof. B.L.Gunjal
(Seminar Guide) (Seminar Coordinator)
Prof. R.L.Paikrao Prof. Dr.G.J. Vikhe Patil
(H.O.D.) (Principal)
University of Pune
CERTIFICATE
This is to Certify that
Prof.Deshmukh Sachin B.
Student of
First Year M.E.(Computer Engineering)
was examined in Dissertation Report entitled
”Pipelined-MapReduce: An improved MapReduceParallel programing model”
on / /2013
at
Department of Computer Engineering,
Amrutvahini College of Engineering,Sangamner - 422608
(. . . . . . . . . . . . . . . . . . . . .) (. . . . . . . . . . . . . . . . . . . . .)
Prof. B.L.Gunjal Prof. A.N.Nawathe
Internal Examiner External Examiner
ACKNOWLEDGEMENT
i
As wise person said about the nobility of the teaching profession, Catch a fish and
you feed a man his dinner, but teach a man how to catch a fish, and you feed him for
life.
In this context, I would like to thank my helpful seminar guide Prof. B.L. Gunjal
who had been an incessant source of inspiration and help. Not only did she inspire me
to undertake this assignments, she also advise me throughout its course and help me
during my times of trouble. I would like to thank H.O.D. of Department of Computer
Engineering Prof. Paikrao R. L. and M. E. co-ordinator Prof. Gunjal B. L. for motivating
me.
Also, I would like to thank other member of Computer Engineering department who
helped me to handle this assignment efficiently, and who were always ready to go out of
their were to help us during our times of need.
Prof.Deshmukh Sachin B.
M.E.Computer Engineering
ABSTRACT
ii
Cloud MapReduce (CMR) is a framework for processing large data sets of batch data
in cloud. The Map and Reduce phases run sequentially, one after another. This leads to:
1.Compulsory batch processing
2.No parallelization of the map and reduce phases
3.Increased delays.
The current implementation is not suited for processing streaming data. We propose a
novel architecture to support streaming data as input using pipelining between the Map
and Reduce phases in CMR, ensuring that the output of the Map phase is made available
to the Reduce phase as soon as it is produced. This ’Pipelined MapReduce’ approach
leads to increased parallelism between the Map and Reduce phases; thereby
1.Supporting streaming data as input.
2.Reducing delays
3. Enabling the user to take ’snapshots’ of the approximate output generated in a
stipulated time frame.
4. Supporting cascaded MapReduce jobs. This cloud implementation is light-weight and
inherently scalable.
iii
INTRODUCTION 1
1.1 Need of system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
LITERATURE SURVEY 3
2.1 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Mappers and Reducers: . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Introduction to Hadoop: . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 design Of HDFS: . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Basic concepts of HDFS: . . . . . . . . . . . . . . . . . . . . . . . 6
Contents
Acknowledgement
i Abstract
ii List of Figures
v
1
2
2.3 Cloud Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Online Map Reduce(Hadoop Online Prototype) . . . . . . . . . . 8
3 3. PROPOSED ARCHITECTURE 9
3.1 Architecture of Pipelined CMR . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Mapper Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Reduce Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 The first design option: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Handling Reducer Failures . . . . . . . . . . . . . . . . . . . . . . 11
iv
3.3 The second design option: . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
3.4 Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Pipelined Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Hadoop and HDFS 15
4.1 Hadoop System Architecture: . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Operating a Server Cluster: . . . . . . . . . . . . . . . . . . . . . 15
4.2 Data Storage and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Comparison with Other Systems: . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 RDBMS: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Hadoop Distributed File System (HDFS): . . . . . . . . . . . . . . . . . 19
4.4.1 HDFS Federation: . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.2 HDFS High-Availability . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Algorithm: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5.1 Algorithm For Word Count: . . . . . . . . . . . . . . . . . . . . . 21
5 ADVANTAGES, FEATURES AND APPLICATIONS 23
5.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.1 Time windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.2 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.3 Cascaded MapReduceJobs . . . . . . . . . . . . . . . . . . . . . . 24
Conclusion 24
References 25
v
List of Figures
2.1 Map Reduce Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Architecture Of Complete Hadoop Structure . . . . . . . . . . . . . . . . 5
2.3 Architecture Of Hadoop Distributed File System . . . . . . . . . . . . . . 7
3.1 Architecture of Pipelined Map Reduce . . . . . . . . . . . . . . . . . . . 11
3.2 Hadoop Data flow for Batch And Pipelined Map Reduce Data Flow . . . 13
4.1 Hadoop System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Operation Of Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . 16
1
Chapter 1
INTRODUCTION
1.1 Need of system
Cloud Map Reduce (CMR) is gaining popularity among small companies for pro-
cessing large data sets in cloud environments. The current implementation of CMR
is designed for batch processing of data. Today, streaming data constitutes an impor-
tant portion of web data in the form of continuous click-streams, feeds, micro-blogging,
stock-quotes, news and other such data sources. For processing streaming data in Cloud
MapReduce (CMR), significant changes are required to be made to the existing CMR ar-
chitecture. We use pipelining between Map and Reduec phases as an approach to support
stream data processing, in contrast to the current implementation where the Reducers
do not start working unless all Mappers have finished. Thus, in our architecture, the
Reduce phase too gets a continuous stream of data and can produce continuous output.
This new architecture is explained in subsequent section.
Computing and data intensive data processing are increasingly prevalent . In the
near future , it is expected that the data volumes processed by applications will cross the
peta-scale threshold , and increase the computional requirements .
1.2 Map Reduce
Google’s MapReduce is a programming model and process, resulting in large data
sets related to implementation. MapReduce is a programming model, it is with the
processing / production related to implementation of large data sets. Users specify a
Chapter1 INTRODUCTION
map function, through the map function handles key/value pairs, and produce a series
of intermediate key/value pairs and use the reduce function to combine all of the key
values have the same middle button pair The value part. MapReduce allows program
developers to data-centric way of thinking: a focus on the application data record set
conversion, and allows the implementation of the MapReduce framework for distributed
processing, network communication, coordination and fault-tolerance and other details.
MapReduce model is usually applied to the completion of the major concerns of
large quantities of computing time. Google MapReduce framework and the open source
Hadoop framework through the implementation of strategies to strengthen the batch
using the model: each map and reduce all of the output stage of the next stage is to
be materialized before the consumer to stable storage, or produce output. Batch entity
allows a simple and practical of the checkpoint,restart fault tolerance, which is critical
for large deployments.
Piped-MapReduce is an improved implementation of the intermediate data transfer
in the operating room with the pipe, while retaining the structure before the MapReduce
programming interface and fault tolerance mode. Piped- MapReduce has many advan-
tages, downstream elements of the data elements can be completed in the producer began
before the implementation of consumption data, which can increase the opportunities for
parallelism and improve efficiency and reduce response time. Because of a production
data reducers mappers start treatment, they can be generated in the implementation of
projects and improve the final result of the approximation. Piped-MapReduce MapRe-
duce broadened the field can be applied.
Pipelined-MapReduce: An improved MapReduce Parallel programing model P2
Chapter 2
LITERATURE SURVEY
2.1 Map Reduce
MapReduce is a programming model developed by Google for processing large data
of each other and can hence be processed in parallel fashion. Each split consists of a set
of (key, value) pairs that form records. The splits are divided among sets in a distributed
fashion. The model consists of two phases: a Map phase and a reduce phase. Initially,
the data is divided into smaller ’splits’. These splits of data are independent computing
nodes called Mappers. Mappers map the input (key, value) pairs into intermediate (keyi,
valuei) pairs. The following stage consists of Reducers which are computing nodes that
take the intermediate (keyi, valuei) pairs generated by the Mappers. The Reducers
combine the set of values associated with a particular keyi obtained from all the Mappers
to produce a final output in the form of (keyi, value). For e.g. the often cited word count
example elegantly illustrates the computing phases of MapReduce. A large document is
to be scanned and the numbers of occurrences of each word are to be determined. The
solution using MapReduce proceeds as follows:
1. The document is divided into several splits (ensuring that no split is such that it
results into splitting of a whole word). The ID of a split is the input key, and the actual
split (or a pointer to it) is the value corresponding to that key. Thus, the document is
divided into a number of splits, each containing a set of (key, value) pairs.
2. There is a set of m Mappers. Each mapper is assigned an input split based on
some scheduling policy (FIFO, Fair scheduling etc.) (If the mapper consumes its split, it
asks for the next split). Each Mapper processes each input (key, value) pair in the split
assigned to it according to some user defined Map function to produce intermediate key
value pairs. In this case, a Mapper, for 2 occurrences of the word ’cake’ and 3 occurrences
3
Chapter2 LITERATURE SURVEY
Pipelined-MapReduce: An improved MapReduce Parallel programing model P4
of the word ’pastry’ outputs the following: (Cake, 1) (Cake, 1) (Pastry, 1) (Pastry, 1)
(Pastry, 1) Here, each intermediate (key, value) pair indicates one occurrence of the key.
3. The set of intermediate (key, value) pairs is thenpulled by the Reducer workers
from the Mappers.
4. The Reducers then work to combine all the values associated with a particular
intermediate key (keyi) according to a user defined Reduce function. Thus, in the above
example, a Reducer worker, on processing the set of intermediate (key, value) pairs of
word count, would output: (Cake, 2) (Pastry, 3)
5. This output is then written to disk as the final output of the MapReduce job, which
gives the count of all words in the document in this example. An optional Combiner phase
is usually included at the Mapper worker to combine the local result of the Mapper before
sending it over to the Reducer. This works to reduce network traffic between the Mapper
and the Reducer.
2.1.1 Mappers and Reducers:
The Map Reduce framework operates exclusively on ¡key, value¿ pairs, that is, the
framework views the input to the job as a set of ¡key, value¿ pairs and produces a set
of ¡Key, value¿ pairs as the output of the job, conceivably of different types. Key-value
pairs form the basic data structure in Map Reduce. Keys and values may be primitives
such as integers, oating point values, strings, and raw bytes, or they may be arbitrarily
complex structures (lists, tuples, associative arrays, etc.).
In Map Reduce, the programmer de nes a mapper and a reducer with the following sig-
natures:
Map: (k1; v1) ? [(K2; v2)]
Reduce: (K2; [v2]) ? [(k3; v3)]
Hence, the complete Hadoop Cluster i.e. HDFS and Map Reduce is look like as
shown in following figure.
Figure suggests, the programmer can specify that the system sort the bins contents
according to any provided criterion. This customization occasionally can be useful. The
programmer can also specify partitioning methods to more coarsely or more finely bin
the map results. Although every job must specify both a mapper and a reducer, either
Chapter2 LITERATURE SURVEY
Pipelined-MapReduce: An improved MapReduce Parallel programing model P5
Figure 2.1: Map Reduce Process
Figure 2.2: Architecture Of Complete Hadoop Structure
of them can be the identity, or an empty stub that creates output records equal to the
input records. In particular, an algorithm frequently calls for input categorization or
ordering for many successive stages, suggesting a reducer sequence without the need for
intermediate mapping; identity mappers fill in to satisfy architectural needs.
2.2 Hadoop
Hadoop is an implementation of the MapReduce programming model developed by
Apache. The Hadoop framework is used for batch processing of large data sets on a
physical cluster of machines. It incorporates a distributed file system called Hadoop Dis-
tributed File System (HDFS), a Common set of commands, scheduler, and the MapRe-
Chapter2 LITERATURE SURVEY
Pipelined-MapReduce: An improved MapReduce Parallel programing model P6
duce evaluation framework. As the entire complexity of cluster management, redundancy
in HDFS, consistency and reliability in case of node failure is included in the framework
itself, the code base is huge: around 3,00,000 lines of code (LOC). Hadoop is popular
for processing huge data sets, especially in social networking, targeted advertisements,
internet log processing etc.
2.2.1 Introduction to Hadoop:
Technically, Hadoop consists of two key services: reliable data storage using the
Hadoop Distributed File System (HDFS) and high-performance parallel data processing
using a technique called Map Reduce. Hadoop runs on a collection of commodity, shared-
nothing servers.
2.2.2 design Of HDFS:
Very large in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Very large files:
Very large in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access:
HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern.
Commodity hardware:
Hadoop doesnt require expensive, highly reliable hardware to run on. Its designed to
run on clusters of commodity hardware.
2.2.3 Basic concepts of HDFS:
Blocks:
HDFS has the concept of a block, it is a much larger unit64 MB by default. Like in a
file system for a single disk, files in HDFS are broken into block-sized chunks, which are
Chapter2 LITERATURE SURVEY
Pipelined-MapReduce: An improved MapReduce Parallel programing model P7
stored as independent units. Unlike a file system for a single disk, a file in HDFS that is
smaller than a single block does not occupy a full blocks worth of underlying storage.
Name Node:
The name node manages the file system namespace. It maintains the file system tree
and the metadata for all the files and directories in the tree. This information is stored
persistently on the local disk in the form of two files: the namespace image and the edit
log. The name node also knows the data nodes on which all the blocks for a given file
are located, however, it does not store block locations persistently, since this information
is reconstructed from data nodes when the system starts. Without the name node, the
file system cannot be used.
Data Node:
Data nodes are the work horses of the file system. They store and retrieve blocks
when they are told to and they report back to the name node periodically with lists of
blocks that they are storing. The following diagram shows the basic HDFS architecture.
Figure 2.3: Architecture Of Hadoop Distributed File System
2.3 Cloud Map Reduce
Cloud MapReduce is a light-weight implementation of MapReduce programming
model on top of the Amazon cloud OS, using Amazon EC2 instances. It is very compact
Chapter2 LITERATURE SURVEY
Pipelined-MapReduce: An improved MapReduce Parallel programing model P8
(around 3000 LOC) as compared to Hadoop. It also gives significant performance im-
provements (in one case, over 60x) over traditional Hadoop. The architecture of CMR,
as described in consists of one input queue, multiple reduce queues which act as stag-
ing areas for holding the intermediate (key, value) pairs produced by the Mappers, a
master reduce queue that holds the pointers to the reduce queues, and an output queue
that holds the final results. In addition, S3 file system is used to store the data to be
processed, and SimpleDB is used to communicate the status of the worker nodes which
is then used to ensure consistency and reliability. Initially all the queues are set up.
The user then puts the data (key, value) pairs to be processed in the input SQS queue.
Typically, the key is a unique ID and the value is a pointer into S3. The Mappers poll
the queue for work whenever they are free. Then, they dequeue one message from the
queue and process it according to the user defined Map function. The intermediate (key,
value) pairs are pushed by the Mappers to the intermediate Reduce queues as soon as
they are produced.
The choice of a reduce queue is made by hashing on the intermediate key generated,
to ensure even load balancing, and also to ensure that the records with the same key land
up in the same reduce queue. After the Map phase is complete, each Reducer dequeues a
pointer to one of the reduce queues from the master reduce queue and process the reduce
queue associated with that pointer. They apply the user defined Reduce function on the
reduce queue by sorting the queue by intermediate key, merging the values associated
with a key, and then writing the final output to the Output queue. This is again batch
processing, as the user submits a batch of data to be processed. Also, the Reducer work-
ers don?t start working unless all the Mappers have finished all their Map tasks. Thus,
it is not suited for processing streaming data.
2.4 Online Map Reduce(Hadoop Online Prototype)
It is a modification to traditional Hadoop framework that incorporates pipelining be-
tween the Map and Reduce phases, thereby supporting parallelism between these phases,
and providing support for processing streaming data. The output of the Mappers is made
available to the Reducers as soon as it is produced. A downstream dataflow element can
begin processing before an upstream producer finishes. It carries out online aggregation
of data to produce incrementally correct output. This also supports continuous queries.
Chapter2 LITERATURE SURVEY
Pipelined-MapReduce: An improved MapReduce Parallel programing model P9
This can also be used with batch data, and gives approximately 30 better throughput
and response time because of parallelism between the Map and Reduce phases. It also
supports ’snapshots’ of output data where the user can get a view of the output produced
till some time instant instead of waiting for the entire batch of data to finish processing.
Also, Cascaded MapReduce jobs with pipelining are supported in this implementation,
whereby the output of one job is fed to the input of another job as soon as it is produced.
This leads to parallelism within a job, as well as between jobs. The architecture is par-
ticularly suitable for processing streaming data, as it is modelled as dataflow architecture
with windowing and online aggregation.
Chapter 3
3. PROPOSED ARCHITECTURE
The above implementations suffer from the following drawbacks:
1. HOP is unsuitable for cloud, as Hadoop is a framework for distributed computing.
Hence, it lacks the inherent scalability and flexibility of cloud.
2. In HOP, code for handling of HDFS, reliability, scheduling etc. is a part of the Hadoop
framework itself, and hence makes it large and heavy-weight (around 3lac LOC).
3. Cloud MapReduce does not support stream data processing, which is an important
use case. Our proposal aims at bridging this gap between heavyweight HOP and the
light-weight, scalable Cloud MapReduce implementation, by providing support for pro-
cessing stream data in Cloud MapReduce. This architecture is easier to build, maintain
and run as the underlying complexity of managing resources in cloud is handled by the
Amazon cloud OS, in contrast to a Hadoop-like framework where the issues of handling
filesystem, scheduling, availability, reliability are done in the framework itself. The chal-
lenges involved in the implementation include:
1. Providing support for streaming data at input
2. A novel design for output aggregation
3. Handling Reducer failures
4. Handling windows based on timestamps.
Currently, no open-source implementation exists for processing streaming data using
MapReduce on top of Cloud. To the best of our knowledge, this is the first such attempt
to integrate stream data processing capability with MapReduce on Amazon Web Services
using EC2 instances. Such an implementation will provide significant value to stream
processing applications such as the ones outlined in section IV. We now describe the
architecture of the Pipelined CMR approach.
10
Chapter3 3. PROPOSED ARCHITECTURE
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1111
3.1 Architecture of Pipelined CMR
3.1.1 Input
A drop-box concept can be used, where a folder on S3 (for example) is used to hold the
data that is to be processed by Cloud MapReduce. The user is responsible for providing
data in the drop-box from which it will be sent to the input SQS queue. That way, the
user can control the data that is sent to the drop-box and can use the implementation
in either a batch or a stream processing manner. The data provided by the user must
be in (key, value) format. An Input Handler function running on one EC2 instance polls
the drop-box for new data. When a new record of (key, value) format is available in the
drop-box, the Input Handler appends it with a unique MapID, Timestamp, and Unique
Tag to the record and generates a record of a new form: (Metadata, Key, Value). The
addition of Timestamp is important for taking ’snapshots’ of the output (described later)
and for preserving the order of data processing in case of stragglers. The Input Handler
then pushes this generated record into SQS for processing by the Mapper.
3.1.2 Mapper Operation
Input Queue-Mapper interface is unchanged from the CMR implementation [2]. The
Mapper, whenever it is free, pops one message from the input SQS queue thereby re-
moving the message from the queue for a visibility timeout and processes it according to
the user-defined Map function. If the Mapper fails, the message re-appears on the Input
queue after the visibility timeout. If the Mapper is successful, it deletes the message from
the queue and writes its status to SimpleDB. Mapper generates a set of intermediate (key,
value) pairs for each input split processed.
3.1.3 Reduce Phase
The Mapper writes the intermediate records produced to ReduceQueues. Reduce-
Queues are intermediate staging queues implemented using SQS for holding the mapper
output. Reducers take their input from these queues. The Reduce phase as implemented
in CMR will have to be modified to process Stream Data. For this, we considered two
design options:
Chapter3 3. PROPOSED ARCHITECTURE
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1212
Figure 3.1: Architecture of Pipelined Map Reduce
3.2 The first design option:
As in CMR, the Mapper pushes each intermediate (key, value) pair to one of the
reduce queues based on hash value of the intermediate key. The hash function can be
user- defined or a default function provided, as in CMR. The number of reduce queues
must be at least equal to the number of reducers and preferably much larger to balance
load better. Unlike the CMR implementation, where a reducer can process any of the
reduce queues by dequeueing on message from the master reduce queue (which holds
pointers to all the ReduceQueues), in this case, the reducer will have to be statically
bound to one or more reduce queues, so that records with a particular intermediate
key are always processed by the same reducer. This is essential to aggregate values
associated with the same key in the Reduce phase. Also, each reducer will be statically
bound to an output split associated with that reducer, so that during online aggregation
of values associated with a particular reduce key, a particular key does not land up
in two different output splits, giving incorrect aggregation. Thus, each reducer will
be bound statically to: one or more reduce queues and one output split. This can
be achieved by maintaining a database Containing (ReducerID, ReduceQueuePointers,
OutputQueuePointers, Status, timeLastUpdated) Or an equivalent data-structure in non-
relational database, like Simple DB. In the static implementation, ReduceQueuePointers
hold the pointers to Reduce queues associated with a particular Reducer identified by
the ReducerID. OutputQueuePointers will hold Pointers to OutputQueues (or rather
OutputFilePointers) associated with a particular ReducerID.
Chapter3 3. PROPOSED ARCHITECTURE
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1313
3.2.1 Handling Reducer Failures
The Status field is used for handling Reducer failures. Status can be one of Live,
Dead, Idle. Each Reducer periodically updates the isAlive bit in its Status field to
indicate that it has not failed (similar to the heartbeat signal in traditional Hadoop). A
thread running in an EC2 instance monitors to see if any reducer has not updated its
status in the past TIMEOUT seconds. If it finds a Reducer which has not updated its
status in such time, its sets its Status to Dead, searches for a Reducer with Idle Status
and assigns the new Reducer the ReduceQueuePointers and OutputPointers previously
held by the old Reducer. When the visibility timeout expires, the messages processed
by the failed Reducer again become visible on the ReduceQueues. As the Reducer does
not update its Status to Done in the SimpleDB unless it has finished processing a given
window of the input and removed the corresponding messages from the ReduceQueues, it
is guaranteed that if the Reducer fails, the messages will reappear on the ReduceQueues,
as SQS requires that messages be specifically deleted, or else they re-appear after the
visibility timeout. Thus, Idle Reducers are used to handle other Reducer failures.
Also, if on getting the job of some other failed reducer, the original queues of the Idle
reducer start filling up, it will also process those queues, as the queues associated with the
failed reducer are added to the original list of queues associated with that reducer, without
replacement. The issue of which output split to write to, when a record is read from a
reduce queue (because it may be read from the original queue associated with a reducer,
or a new queue assigned because of some other Reducer’s failure) can be determined by
holding another database table that associates a particular reduce queue to a particular
output split. This is essential for correct aggregation. The Reducer then writes the output
generated to the correct output split, checking for duplicates and if the key already
exists, aggregating the new output with the output already available. This is called
online aggregation. Thus, the following sequence of steps occurs for each intermediate
(key, value) pair generated by the mapper (ignoring here the issues involved in handling
reducer failures that have been discussed above): 1. Mapper pushes intermediate (key,
value) pairs to one of the reduce queues based on hash value of the key. 2. Each reducer
is statically bound to one or more ReduceQueues, and polls those queues only for work.
3. It then writes the output to the output split assigned to it.
Chapter3 3. PROPOSED ARCHITECTURE
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1414
3.3 The second design option:
Alternatively, we can have a single queue between the Mappers and the Reducers,
with all the intermediate (key, value) pairs generated by all the Mappers pushed to this
intermediate queue. Reducers poll this IntermediateQueue for records. Aggregation is
carried out as follows: There are a fixed number of Output splits. Whenever a Reducer
reads a (key, value) record from the IntermediateQueue, it applies a user-defined Reduce
function to the record, to produce an output (key, value) pair. It then selects an output
split by hashing on the output key produced. If there already exists a record for that key
in the output split, the new value is merged with the existing value. Otherwise a new
record is created with (key, value) as the output record. An issue with this approach is
that, for aggregating the records directly in S3 the issue of latency of S3 must be taken
into account. If the delays are too long, the first design option would be better.
3.4 Hybrid Approach
Both the above approaches could be combined as follows: Have multiple Reduce-
Queues, each linked statically to a particular Reducer, but instead of linking the output
splits to the Reducer statically, use hashing on the output key of the (key, value) pair
generated by the Reducer to select an output split. This will require fewer changes to the
existing CMR architecture, but will involve static linking of Reducers to ReduceQueues.
This is the approach that we prefer.
3.5 Pipelined Architecture
Figure depicts the dataflow of two MapReduce implementations. The dataflow on
the left corresponds to the output materialization approach used by Hadoop; the dataflow
on the right allows pipelining and We called it Pipelined-MapReduce. In the remainder
of this section, we present our design and implementation for the Pipelined-MapReduce
dataflow.
In general , reduce tasks traditionally issue HTTP requests to pull their output from
each Task-Tracker. This means that map task execution is completely decoupled from
reduce task execution. To support pipelining, we modified the map task to instead push
data to reducers as it is produced. To give an intuition for how this works, we begin by
Chapter3 3. PROPOSED ARCHITECTURE
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1515
Figure 3.2: Hadoop Data flow for Batch And Pipelined Map Reduce Data Flow
describing a straightforward pipelined design, and then discuss the changes we had to
make to achieve good performance.
In our Pipelined-MapReduce implementation , we modified Hadoop to send data directly
from map to reduce tasks. When a client submits a new job to Hadoop, the Job- Tracker
assigns the map and reduce tasks associated with the job to the available Task-Tracker
slots. For purposes of discussion, we assume that there are enough free slots to assign all
the tasks for each job. We modified Hadoop so that each reduce task contacts every map
task upon initiation of the job, and opens a TCP socket which will be used to pipeline
the output of the map function. As each map output record is produced, the mapper
determines which partition (reduce task) the record should be sent to, and immediately
sends it via the appropriate socket.
A reduce task accepts the pipelined data it receives from each map task and stores it
in an in-memory buffer, spilling sorted runs of the buffer to disk as needed. Once the
reduce task learns that every map task has completed, it performs a final merge of all
the sorted runs and applies the userdefined reduce function as normal, wirte the ouput
to the HDFS.
process was occurred rapidly. So make the siphonic pressure increase consumedly, and
enhance the siphonic power, attained the purpose of saving water.
The data we have for the experiment is enwiki- 20100904-pages-articles.xml[6]. enwiki-
20100904-pagesarticles. xml is contains the full content of the Wikipedia,which is a
around 6GB XML datasets. We implement the Naive Bayes algorithm[7] in the Hadoop
environment[8] for classify the XML datasets. First, the datasets divided into some
chunks, then chunks of datasets write into HDFS. Naive Bayes classifier consists of two
Chapter3 3. PROPOSED ARCHITECTURE
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1616
processes: tracking specific documents related to the characteristics and types, and then
use this model to predict the new documents, not containd the contents of the category.
The first step is called training, it has been classified by looking at the contents of the
sample to create a model, and then follow with specific content related to the probability
of each word. The second step is called classification, it will use the training model and
the content of the new documents, combined with the Bayes theorem to predict the of
new documents. We setting the test datasets , and use the the training model trained it
, then use the model to classify the new documents. After setting the the training and
test datasets, classified the test datasets. Therefore, to apply the Naive Bayes classifier
to the XML datasets, we need to train the training model , and then use the model to
classified the new documents.
In order to evaluate the performance of MapReduce technologies, we first tested different
implementations[9] of the environment increases with the amount of data the task exe-
cution time. Fig.2 depicts our results .Hadoop and Pipelined-MapReduce have almost
similar performance. May be subject to I/O bandwidth impact, performance, or less.
The consumption generated by the MapReduce implementation completed time for the
whole is small.
17
Chapter 4
Hadoop and HDFS
4.1 Hadoop System Architecture:
Each cluster has one master node with multiple slave nodes. The master node runs
Name Node and Job Tracker functions and coordinates with the slave nodes to get
the job done. The slaves run Task Tracker, HDFS to store data, and map and reduce
functions for data computation. The basic stack includes Hive and Pig for language
and compilers, HBase for NoSQL database management, and Scribe and Flume for log
collection. ZooKeeper provides centralized coordination for the stack.
Figure 4.1: Hadoop System Architecture
Chapter4 Hadoop and HDFS
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1818
4.1.1 Operating a Server Cluster:
A client submits a job to the master node, which orchestrates with the slaves in the
cluster. Job Tracker controls the Map Reduce job, reporting to Task Tracker. In the
event of a failure, Job Tracker reschedules the task on the same or a different slave node,
whichever is most efficient. HDFS is location-aware or rack-aware and manages data
within the cluster, replicating the data on various nodes for data reliability. If one of
the data replicas on HDFS is corrupted, Job Tracker, aware of where other replicas are
located, can reschedule the task right where it resides, decreasing the need to move data
back from one node to another. This saves network bandwidth and keeps performance
and availability high. Once the job is mapped, the output is sorted and divided into
several groups, which are distributed to reducers. Reducers may be located on the same
node as the mappers or on another node.
Figure 4.2: Operation Of Hadoop Cluster
4.2 Data Storage and Analysis
The problem is simple: while the storage capacities of hard drives have increased
massively over the years, access speeds the rate at which data can be read from drives
have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a
transfer speed of 4.4 MB/s, so you could read all the data from a full drive in around five
minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed
Chapter4 Hadoop and HDFS
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1919
is around 100 MB/s, so it takes more than two and a half hours to read all the data off
the disk.
This is a long time to read all data on a single drive and writing is even slower. The
obvious way to reduce the time is to read from multiple disks at once. Imagine if we
had 100 drives, each holding one hundredth of the data. Working in parallel, we could
read the data in less than two minutes. Only using one hundredth of a disk may seem
wasteful. But we can store one hundred datasets, each of which is one terabyte, and
provide shared access to them. We can imagine that the users of such a system would be
happy to share access in return for shorter analysis times, and, statistically, that their
analysis jobs would be likely to be spread over time, so they wouldnt interfere with each
other too much. Theres more to being able to read and write data in parallel to or from
multiple disks, though. The first problem to solve is hardware failure: as soon as you
start using many pieces of hardware, the chance that one will fail is fairly high.
A common way of avoiding data loss is through replication: redundant copies of
the data are kept by the system so that in the event of failure, there is another copy
available. This is how RAID works, for instance, although Hadoops file system, the
Hadoop Distributed File system (HDFS), takes a slightly different approach, as you shall
see later. The second problem is that most analysis tasks need to be able to combine
the data in some way; data read from one disk may need to be combined with the data
from any of the other 99 disks. Various distributed systems allow data to be combined
from multiple sources, but doing this correctly is notoriously challenging. Map Reduce
provides a programming model that abstracts the problem from disk reads and writes,
transforming it into a computation over sets of keys and values.
This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS and analysis by Map Reduce. There are other
parts to Hadoop, but these capabilities are its kernel.
4.3 Comparison with Other Systems:
The approach taken by Map Reduce may seem like a brute-force approach. The
premise is that the entire dataset or at least a good portion of it is processed for each
query. But this is its power. Map Reduce is a batch query processor, and the ability to
run an ad hoc query against your whole dataset and get the results in a reasonable time
is transformative. It changes the way you think about data, and unlocks data that was
Chapter4 Hadoop and HDFS
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2020
previously archived on tape or disk. It gives people the opportunity to innovate with
data. Questions that took too long to get answered before can now be answered, which
in turn leads to new questions and new insights. For example, Mail trust, Rack spaces
mail division, used Hadoop for processing email logs. One ad hoc query they wrote was
to find the geographic distribution of their users. In their words: This data was so useful
that weve scheduled the Map Reduce job to run monthly and we will be using this data
to help us decide which Rack space data centers to place new mail servers in as we grow.
By bringing several hundred gigabytes of data together and having the tools to analyze
it, the Rack space engineers were able to gain an understanding of the data that they
otherwise would never have had, and, furthermore, they were able to use what they had
learned to improve the service for their customers.
4.3.1 RDBMS:
Why cant we use databases with lots of disks to do large-scale batch analysis? Why
is map Reduce needed? The answer to these questions comes from another trend in disk
drives: seek time is improving more slowly than transfer rate. Seeking is the process
of moving the disks head to a particular place on the disk to read or write data. It
characterizes the latency of a disk operation, whereas the transfer rate corresponds to a
disks bandwidth. If the data access pattern is dominated by seeks, it will take longer
to read or write large portions of the dataset than streaming through it, which operates
at the transfer rate. On the other hand, for updating a small proportion of records in
a database, a traditional B-Tree (the data structure used in relational databases, which
is limited by the rate it can perform, seeks) works well. For updating the majority of a
database, a B-Tree is less efficient than Map Reduce, which uses Sort/Merge to rebuild
the database. In many ways, Map Reduce can be seen as a complement to an RDBMS
Map Reduce is a good fit for problems that need to analyze the whole dataset, in a
batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or
updates, where the dataset has been indexed to deliver low-latency retrieval and update
times of a relatively small amount of data.
Map Reduce suits applications where the data is written once, and read many times,
whereas a relational database is good for datasets that are continually updated. Another
difference between Map Reduce and an RDBMS is the amount of structure in the datasets
that they operate on. Structured data is data that is organized into entities that have a
defined format, such as XML documents or database tables that conform to a particular
Chapter4 Hadoop and HDFS
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2121
predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other
hand, is looser, and though there may be a schema, it is often ignored, so it may be used
only as a guide to the structure of the data: for example, a spreadsheet, in which the
structure is the grid of cells, although the cells themselves may hold any form of data.
Unstructured data does not have any particular internal structure: for example, plain
text or image data. Map Reduce works well on unstructured or semi structured data,
since it is designed to interpret the data at processing time. In other words, the input
keys and values for Map Reduce are not an intrinsic property of the data, but they are
chosen by the person analyzing the data.
Relational data is often normalized to retain its integrity and remove redundancy.
Normalization poses problems for Map Reduce, since it makes reading a record a nonlocal
operation, and one of the central assumptions that Map Reduce makes is that it is possible
to perform (high-speed) streaming reads and writes. A web server log is a good example
of a set of records that is not normalized (for example, the client hostnames are specified
in full each time, even though the same client may appear many times), and this is one
reason that log files of all kinds are particularly well-suited to analysis with Map Reduce.
Map Reduce is a linearly scalable programming model. The programmer writes two
functions a map function and a reduce function each of which defines a mapping from one
set of key-value pairs to another. These functions are oblivious to the size of the data or
the cluster that they are operating on, so they can be used unchanged for a small dataset
and for a massive one. More important, if you double the size of the input data, a job
will run twice as slow. But if you also double the size of the cluster, a job will run as fast
as the original one. This is not generally true of SQL queries. Over time, however, the
differences between relational databases and Map Reduce systems are likely to blur both
as relational databases start incorporating some of the ideas from Map Reduce (such as
Aster Datas and Green plums databases) and, from the other direction, as higher-level
query languages built on Map Reduce (such as Pig and Hive) make Map Reduce systems
more approachable to traditional database programmers.
4.4 Hadoop Distributed File System (HDFS):
When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. File systems that manage
the storage across a network of machines are called distributed file systems. Since they
Chapter4 Hadoop and HDFS
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2222
are network-based, all the complications of network programming kick in, thus making
distributed file systems more complex than regular disk file systems.
For example, one of the biggest challenges is making the file system tolerate node
failure without suffering data loss.Hadoop comes with a distributed file system called
HDFS, which stands for Hadoop Distributed File system. HDFS is Hadoops flagship file
system and is the focus of this chapter, but Hadoop actually has a general purpose file
system abstraction, so well see along the way how Hadoop integrates with other storage
systems (such as the local file system)
4.4.1 HDFS Federation:
The name node keeps a reference to every file and block in the file system in memory,
which means that on very large clusters with many files, memory becomes the limiting
factor for scaling adding name nodes, each of which manages a portion of the file system
namespace. For example, one name node might manage all the files rooted under /user,
say, and a second name node might handle files under /share.
Under federation, each name node manages a namespace volume, which is made up
of the metadata for the namespace, and a block pool containing all the blocks for the files
in the namespace. Namespace volumes are independent of each other, which mean name
nodes do not communicate with one another, and furthermore the failure of one name
node does not affect the availability of the namespaces managed by other name nodes.
Block pool storage is not partitioned, however, so data nodes register with each name
node in the cluster and store blocks from multiple block pools. To access a federated
HDFS cluster, clients use client-side mount tables to map file paths to name nodes. This
is managed in configuration using the view File System, and viewfs: // URIs.:
4.4.2 HDFS High-Availability
The combination of replicating name node metadata on multiple file systems, and
using the secondary name node to create checkpoints protects against data loss, but does
not provide high-availability of the file system. The name node is still a single point
of failure (SPOF), since if it did fail, all clients including Map Reduce jobs would be
unable to the file-to-block mapping. In such an event the whole Hadoop system would
effectively be out of service until a new name node could be brought online. To recover
from a failed name node in this situation, an administrator starts a new primary name
Chapter4 Hadoop and HDFS
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2323
node with one of the file system metadata replicas, and configures ata nodes and clients
to use this new name node. The new name node is not able to serve requests until it
has i) loaded its namespace image into memory, ii) replayed its edit log, and iii) received
enough block reports from the data nodes to leave safe mode. On large clusters with
many files and blocks, the time it takes for a name node to start from cold can be 30
minutes or more. The long recovery time is a problem for routine maintenance too. In
fact, since unexpected failure of the name node is so rare, the case for planned downtime
is actually more important in practice. The 0.23 release series of Hadoop remedies this
situation by adding support for HDFS high-availability (HA).
In this implementation there is a pair of name nodes in an active standby config-
uration. In the event of the failure of the active name node, the standby takes over
its duties to continue servicing client requests without a significant interruption. A few
architectural changes are needed to allow this to happen:
The name nodes must use highly-available shared storage to share the edit log. (In the
initial implementation of HA this will require an NFS filer, but in future releases more
options will be provided, such as a Bookkeeper-based system built on Zoo- Keeper.)
When a standby name node comes up it reads up to the end of the shared edit log to
synchronize its state with the active name node, and then continues to read new entries
as they are written by the active name node.
Data nodes must send block reports to both name nodes since the block mappings are
stored in a name nodes memory, and not on disk. Clients must be configured to handle
name node failover, which uses a mechanism that is transparent to users. If the active
name node fails, then the standby can take over very quickly (in a few tens of seconds)
since it has the latest state available in memory: both the latest edit log entries, and an
up-to-date block mapping.
The actual observed failover time will be longer in practice (around a minute or so),
since the system needs to be conservative in deciding that the active name node has
failed. In the unlikely event of the standby being down when the active fails, the admin-
istrator can still start the standby from cold. This is no worse than the non-HA case, and
from an operational point of view its an improvement, since the process is a standard
operational procedure built into Hadoop.
Chapter4 Hadoop and HDFS
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2424
4.5 Algorithm:
4.5.1 Algorithm For Word Count:
Example: Wordcount Mapper
public static class MapClass extends MapReduceBase
implements Mapper¡LongWritable, Text, Text, IntWritable¿
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector¡Text, IntWritable¿ output,
Reporter reporter) throws IOException
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens())
word.set(itr.nextToken());
output.collect(word, one);
Example: Wordcount Reducer
public static class Reduce extends MapReduceBase
implements Reducer¡Text, IntWritable, Text, IntWritable¿
public void reduce(Text key, Iterator¡IntWritable¿ values,
OutputCollector¡Text, IntWritable¿ output,
Reporter reporter) throws IOException
int sum = 0;
while (values.hasNext())
sum += values.next().get();
output.collect(key, new IntWritable(sum));
25
Chapter 5
ADVANTAGES, FEATURES AND
APPLICATIONS
5.1 Advantages
The design has the following advantages as compared to the existing implementations
surveyed above: 1. Either design allows Reducers to start processing as soon as data is
made available by the Mappers. This allows parallelism between the Map and Reduce
phases. 2. A downstream processing element can start processing as soon as some
data is available from an upstream element. 3. The network is better utilized as data
is continuously pushed from one phase to the next. 4. The final output is computed
incrementally. 5. Introduction of a pipeline between the Reduce phase of one job and
the Map phase of the next job will support Cascaded MapReduce jobs. the delays
experienced between the jobs. A module that pushes data produced by one job, into the
InputQueue of the next job will pipeline and parallelise the operation of the two jobs. A
typical application is database queries where for example, the output of a join operation
may be required as the input to a grouping operation.
5.2 Features
5.2.1 Time windows
The user can define a time-window for processing where the user can specify the range
of time values over which he wishes to calculate the output for the job. For e.g. the user
could specify that he wishes to run some query X on the click-stream data from 24 hrs
Pipelined-MapReduce: An improved MapReduce Parallel programing model P26
Chapter5 ADVANTAGES, FEATURES AND APPLICATIONS
ago, till present time. This will use the timestamp metadata added to the input record
and process only those records that lie within the timestamp range.
5.2.2 Snapshots
Incremental processing of data can allow users to take ’snapshots’ of partial output
generated, based on the job running time and/or the percentage of job completed. For
eg. In the word count example, the user could request a snapshot of the data processed
till time t=15minutes from start of job. A snapshot will essentially be a snapshot of the
output folder of S3 where the output is being collected. Snapshots allow users to get an
approximate idea of the output without having to wait for all processing to be done till
the final output is available.
5.2.3 Cascaded MapReduceJobs
Cascaded MapReduce jobs are those where the output of one MapReduce job is
to be used by the next MapReduce job. Pipelining between jobs further increases the
throughput while decreasing popular and typical stream processing applications. With
accurate delay guarantees, this design can also be used to process real-time data.
Pipelined-MapReduce: An improved MapReduce Parallel programing model P26
2727
Conclusion
As described above, the design fulfills a real need of processing streaming data using
MapReduce. It is also inherently scalable as it is cloud-based. This also gives it a light-
weight? nature, as the handling of distributed resources is done by the Cloud OS. The
gains due to parallelism are expected to be similar to, or better than those obtained in [3].
Currently, the authors are working on implementing the said design. Future work will
include supporting rollingwindows for obtaining outputs of arbitrary time-intervals of the
input stream. Further work can be done in maintaining intermediate output information
for supporting rolling windows. Also, reducing the delays to levels acceptable for real-
time data is an important usecase that needs to be supported. Future scope also includes
designing a generic system that is portable across several cloud operating systems.
2828
Bibliography
[1] Jeffrey Dean and Sanjay Ghemawat: MapReduce Simplified Data Processing on
Large Clusters in OSDI, 2004.
[2] Huan Liu, Dan Orban: Cloud MapReduce: a MapReduce Implementation on top
of a Cloud Operating System in , Accenture Technology Labs, Cluster, Cloud and
Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium.
[3] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy,
Russell Sears: MapReduce Online in 7th USENIX conference on Networked systems
design and implementation 2010
[4] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph,R. H. Katz, A. Konwinski, G.
Lee, D. A. Patterson,A. Rabkin, I. Stoica, and M. Zaharia. Above theclouds: A
berkeley view of cloud computing in TechnicalReport UCB/EECS-2009-28, EECS
Department,University of California, Berkeley, Feb 2009.
[5] Daniel Warneke and Odej Kao: Exploiting Dynamic Resource Allocation for Effi-
cient Parallel Data Processing in the Cloud in IEEE Transactions on parallel and
distributed systems, 2011
[6] Shrideep Pallickara, Jaliya Ekanayake and Geoffrey Fox: Granules- A Lightweight,
Streaming Runtime for Cloud Computing With Support for MapReduce in Cluster
Computing and Workshops, CLUSTER’09, IEEE International Conference.
[7] Floreen P, Przybilski M, Nurmi P, Koolwaaij J, Tarlano A, Wagner M, Luther M,
Bataille F, Boussard M, Mrohs B, et al 2005). Towards a context management
framework for mobiLife. 14th IST Mobile Wireless Summit 7.
[8] Nathan Backman, Karthik Pattabiraman, Ugur Cetintemel: C-MR: A Continuous
MapReduce Processing Model for Low-Latency Stream Processing on Multi-Core
Architectures in Department of Computer Science, Brown University.
Chapter5 BIBLIOGRAPHY
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2929
[9] Zikai Wang: A Distributed Implementation of Continuous- MapReduce Stream Pro-
cessing Framework in Department of Computer Science, Brown University.
[10] Huan Liu: Cutting MapReduce Cost with Spot Market In Proc. Of Usenix HotCloud
(2011).
[11] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica: Im-
proving MapReduce Performance in Heterogeneous Environments In proceedings of
8th Usenix Symposium on Operating Systems Design and Implementation.
[12] Michael Stonebraker, Ugur etintemel, Stan Zdonik: The 8 Requirements of Real-
Time Stream Processing in ACM SIGMOD Record Volume 34 Issue 4, 2005
[13] Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.
Twister: A runtime for iterative MapReduce In The First International Workshop
on MapReduce and its Applications (2010).
[14] Hadoop. http://hadoop.apache.org/
[15] Amazon Elastic MapReduce aws.amazon.com/elasticmapreduce/
[16] Gunho Leey, Byung-Gon Chunz, Randy H. Katzy: Heterogeneity- Aware Resource
Allocation and Scheduling in the Cloud in ,Usenix HotCloud 2011
[17] Amazon ec2. http://aws.amazon.com/ec2
[18] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S.
Shenker, and I. Stoica. Mesos: A platform for finegrained resource sharing in the data
center In USENIX symposium on Networked Systems Design and Implementation,
2011.
[19] J. Polo, D. Carrera, Y. Becerra, V. Beltran, and J. T. andEduard Ayguad. Perfor-
mance management of accelerated mapreduce workloads in heterogeneous clusters
In 39th International Conference on Parallel Processing (ICPP2010), 2010.
[20] Pramod Bhatotia Alexander Wieder ?Istemi Ekin Akkus Rodrigo Rodrigues Umut
A. Acar Large-scale Incremental Data Processing with Change Propagation Max
Planck Institute for Software Systems (MPI-SWS)
Chapter5 BIBLIOGRAPHY
Pipelined-MapReduce: An improved MapReduce Parallel programing modelP3030
[21] D., Olston, C., Reed, B., Webb, K. C., and Yocum, K. LOGOTHETIS Stateful
bulk processing for incremental analytics In Proc. 1st Symp. on Cloud computing
(SoCC10).
[22] Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom Mod-
els and Issues in Data Stream Systems Department of Computer Science Stanford
University
top related