Download - Introduction to Hadoop and MapReduce
Overview of Hadoop and MapReduceGanesh Neelakanta Iyer
Research Scholar, National University of Singapore
About Me
I have 3 years of Industry work experience - Sasken Communication Technologies Ltd, Bangalore - NXP Semiconductors Pvt Ltd (Formerly Philips Semiconductors), Bangalore
I have finished my Masters in Electrical and Computer Engineering from NUS (National University of Singapore) in 2008.
Currently Research Scholar in NUS under the guidance of A/P. Bharadwaj Veeravalli.
Research Interests: Cloud computing, Game theory, Resource Allocation and PricingPersonal Interests: Kathakali, Teaching, Travelling, Photography
Agenda
• Introduction to Hadoop
• Introduction to HDFS
• MapReduce Paradigm
• Some practical MapReduce examples
• MapReduce in Hadoop
• Concluding remarks
Introduction to Hadoop
Data!
• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage
• The New York Stock Exchange generates about one terabyte of new trade data per day
• In last one week, I personally took 15 GB photos while I was travelling. So imagine the memory requirements for all photos taken in a day all over the world!
Hadoop
• Open source Cloud supported by Apache
• Reliable shared storage and analysis system
• Uses distributed file system (Called as HDFS) like GFS
• Can be used for a variety of applications
Typical Hadoop Cluster
Pro-Hadoop by Jason Venner
Typical Hadoop ClusterAggregation switch
Rack switch
40 nodes/rack, 1000-4000 nodes in cluster1 Gbps bandwidth within rack, 8 Gbps out of rackNode specs (Yahoo terasort):
8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf
mage from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
Introduction to HDFS
HDFS – Hadoop Distributed File System
http://www.gartner.com/it/page.jsp?id=1447613
Very Large Distributed File System– 10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware– Files are replicated to handle hardware failure– Detect failures and recover from them
Optimized for Batch Processing– Data locations exposed so that computations can move to where data resides– Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS
Distributed File SystemData Coherency
– Write-once-read-many access model– Client can only append to existing files
Files are broken up into blocks– Typically 128 MB block size– Each block replicated on multiple DataNodes
Intelligent Client– Client can find location of blocks– Client accesses data directly from DataNode
MapReduce Paradigm
MapReduceSimple data-parallel programming model designed for scalability and
fault-tolerance
Framework for distributed processing of large data sets
Originally designed by Google
Pluggable user code runs in generic framework
Pioneered by Google - Processes 20 petabytes of data per day
What is MapReduce used for?At Google:
Index construction for Google SearchArticle clustering for Google NewsStatistical machine translation
At Yahoo!:“Web map” powering Yahoo! SearchSpam detection for Yahoo! Mail
At Facebook:Data miningAd optimizationSpam detection
What is MapReduce used for?
In research:Astronomical image analysis (Washington)Bioinformatics (Maryland)Analyzing Wikipedia conflicts (PARC)Natural language processing (CMU) Particle physics (Nebraska)Ocean climate simulation (Washington)<Your application here>
MapReduce Programming Model
Data type: key-value records
Map function:(Kin, Vin) list(Kinter, Vinter)
Reduce function:(Kinter, list(Vinter)) list(Kout, Vout)
Example: Word Countdef mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1brown, 1
fox, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1
brown, 1
ate, 1mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
MapReduce Execution Details
Single master controls job execution on multiple slaves
Mappers preferentially placed on same node or same rack as their input blockMinimizes network usage
Mappers save outputs to local disk before serving them to reducersAllows recovery if a reducer crashesAllows having more reducers than nodes
Fault Tolerance in MapReduce1. If a task crashes:
Retry on another nodeOK for a map because it has no dependenciesOK for reduce because map outputs are on disk
If the same task fails repeatedly, fail the job or ignore that input block (user-controlled)
Fault Tolerance in MapReduce
2. If a node crashes:Re-launch its current tasks on other nodesRe-run any maps the node previously ran
Necessary because their output files were lost along with the crashed node
Fault Tolerance in MapReduce
3. If a task is going slowly (straggler):Launch second copy of task on another node (“speculative
execution”)Take the output of whichever copy finishes first, and kill the other
Surprisingly important in large clustersStragglers occur frequently due to failing hardware, software bugs,
misconfiguration, etcSingle straggler may noticeably slow down a job
Takeaways
By providing a data-parallel programming model, MapReduce can control job execution in useful ways:Automatic division of job into tasksAutomatic placement of computation near dataAutomatic load balancingRecovery from failures & stragglers
User focuses on application, not on complexities of distributed computing
Some practical MapReduce examples
1. Search
Input: (lineNumber, line) recordsOutput: lines matching a given pattern
Map: if(line matches pattern):
output(line)
Reduce: identify functionAlternative: no reducer (map-only job)
2. Sort
Input: (key, value) recordsOutput: same records, sorted by key
Map: identity functionReduce: identify function
Trick: Pick partitioningfunction h such thatk1<k2 => h(k1)<h(k2)
pigsheepyak
zebra
aardvarkantbeecow
elephant
Map
Map
Map
Reduce
Reduce
ant, bee
zebra
aardvark,elephant
cow
pig
sheep, yak
[A-M]
[N-Z]
3. Inverted IndexInput: (filename, text) recordsOutput: list of files containing each word
Map: foreach word in text.split():
output(word, filename)
Combine: uniquify filenames for each word
Reduce:def reduce(word, filenames):
output(word, sort(filenames))
Inverted Index Example
to be or not to be afraid, (12th.txt)
be, (12th.txt, hamlet.txt)greatness, (12th.txt)
not, (12th.txt, hamlet.txt)of, (12th.txt)
or, (hamlet.txt)to, (hamlet.txt)
hamlet.txt
be not afraid of greatness
12th.txt
to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt
be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt
4. Most Popular Words
Input: (filename, text) recordsOutput: top 100 words occurring in the most files
Two-stage solution:Job 1:
Create inverted index, giving (word, list(file)) recordsJob 2:
Map each (word, list(file)) to (count, word)Sort these records by count as in sort job
MapReduce in Hadoop
MapReduce in Hadoop
Three ways to write jobs in Hadoop:Java APIHadoop Streaming (for Python, Perl, etc)Pipes API (C++)
Word Count in Python with Hadoop Streaming
import sysfor line in sys.stdin:for word in line.split():
print(word.lower() + "\t" + 1)
import syscounts = {}for line in sys.stdin:word, count = line.split("\t”)dict[word] = dict.get(word, 0) + int(count)
for word, count in counts:print(word.lower() + "\t" + 1)
Mapper.py:
Reducer.py:
Concluding remarks
ConclusionsMapReduce programming model hides the complexity of work
distribution and fault tolerance
Principal design philosophies:Make it scalable, so you can throw hardware at problemsMake it cheap, lowering hardware, programming and admin costs
MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time
Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale
What next?
MapReduce has limitations – Applications are limited
Some developments: • Pig started at Yahoo research• Hive developed at Facebook• Amazon Elastic MapReduce
ResourcesHadoop: http://hadoop.apache.org/core/ Pig: http://hadoop.apache.org/pigHive: http://hadoop.apache.org/hiveVideo tutorials: http://www.cloudera.com/hadoop-training
Amazon Web Services: http://aws.amazon.com/Amazon Elastic MapReduce guide:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/
Slides of the talk delivered by Matei Zaharia, EECS, University of California, Berkeley
Thank [email protected]://ganeshniyer.com