mapreduce, hadoop and amazon awslopes/teaching/cs221w12/slides/hadoop-aws.pdf · • hadoop is a...
TRANSCRIPT
![Page 1: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/1.jpg)
MapReduce, Hadoop and Amazon AWS
Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa
February 2011
![Page 2: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/2.jpg)
What is Hadoop?
• A software framework that supports data-intensive distributed applications.
• It enables applications to work with thousands of nodes and petabytes of data.
• Hadoop was inspired by Google's MapReduce and Google File System (GFS).
• Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language.
• Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.
![Page 3: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/3.jpg)
Who uses Hadoop?
http://wiki.apache.org/hadoop/PoweredBy
![Page 4: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/4.jpg)
Who uses Hadoop?
• Yahoo! – More than 100,000 CPUs in >36,000 computers.
• Facebook – Used in reporting/analytics and machine learning and also
as storage engine for logs.
– A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
– A 300-machine cluster with 2400 cores and about 3 PB raw storage.
– Each (commodity) node has 8 cores and 12 TB of storage.
![Page 5: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/5.jpg)
Very Large Storage Requirements
• Facebook has Hadoop clusters with 15 PB of raw storage (15,000,000 GB).
• No single storage can handle this amount of data.
• We need a large set of nodes each storing part of the data.
![Page 6: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/6.jpg)
HDFS: Hadoop Distributed File System
1
2
1
2
1
2
3 3 3
Data Nodes
Namenode
Client
1. filename, index
2. Datanodes, Blockid
3. Read data
![Page 7: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/7.jpg)
Terabyte Sort Benchmark
• http://sortbenchmark.org/
• Task: Sorting 100TB of data and writing results on disk (10^12 records each 100 bytes).
• Yahoo’s Hadoop Cluster is the current winner:
– 173 minutes
– 3452 nodes x (2 Quadcore Xeons, 8 GB RAM)
This is the first time that a Java program has won this competition.
![Page 8: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/8.jpg)
Counting Words by MapReduce
Hello World Bye World Hello Hadoop Goodbye Hadoop
Hello World Bye World
Hello Hadoop Goodbye Hadoop
Split
![Page 9: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/9.jpg)
Counting Words by MapReduce
Hello World Bye World
Mapper
Hello, <1> World, <1> Bye, <1> World, <1>
Bye, <1> Hello, <1> World, <1, 1>
Sort & Merge
Bye, <1> Hello, <1> World, <2>
Combiner
Node 1
![Page 10: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/10.jpg)
Counting Words by MapReduce
Sort & Merge
Bye, <1> Hello, <1> World, <2>
Goodbye, <1> Hadoop, <2> Hello, <1>
Bye, <1> Goodbye, <1> Hadoop, <2> Hello, <1, 1> World, <2>
Split
Bye, <1> Goodbye, <1> Hadoop, <2>
Hello, <1, 1> World, <2>
![Page 11: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/11.jpg)
Counting Words by MapReduce
Bye, <1> Goodbye, <1> Hadoop, <2>
Hello, <1, 1> World, <2>
Reducer
Bye, <1> Goodbye, <1> Hadoop, <2>
Reducer Hello, <2> World, <2>
Node 1
Node 2
Write on Disk
part‐00000
Bye 1 Goodbye 1 Hadoop 2
Hello 2 World 2
part‐00001
![Page 12: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/12.jpg)
Writing Word Count in Java
• Download hadoop core (version 0.20.2): – http://www.apache.org/dyn/closer.cgi/hadoop/core/
• It would be something like:
– hadoop-0.20.2.tar.gz
• Unzip the package and extract:
– hadoop-0.20.2-core.jar
• Add this jar file to your project class path
Warning! Most of the sample codes on web are for older versions of Hadoop.
![Page 13: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/13.jpg)
Word Count: Mapper
Source files are available at: http://www.ics.uci.edu/~yganjisa/files/2011/hadoop-presentation/WordCount-v1-src.zip
![Page 14: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/14.jpg)
Word Count: Reducer
![Page 15: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/15.jpg)
Word Count: Main Class
![Page 16: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/16.jpg)
My Small Test Cluster
• 3 nodes – 1 master (ip address: 50.17.65.29) – 2 slaves
• Copy your jar file to master node: – Linux:
• scp WordCount.jar [email protected]:WordCount.jar
– Windows (you need to download pscp.exe):
• pscp.exe WordCount.jar [email protected]:WordCount.jar
• Login to master node:
– ssh [email protected]
![Page 17: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/17.jpg)
Counting words in U.S. Constitution!
• Download text version:
wget http://www.usconstitution.net/const.txt
• Put input text file on HDFS:
hadoop dfs -put const.txt const.txt
• Run the job: hadoop jar WordCount.jar edu.uci.hadoop.WordCount const.txt word-count-result
![Page 18: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/18.jpg)
Counting words in U.S. Constitution!
• List my files on HDFS:
– Hadoop dfs -ls
• List files in word-count-result folder:
– Hadoop dfs -ls word-count-result/
![Page 19: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/19.jpg)
Counting words in U.S. Constitution!
• Downloading results from HDFS:
hadoop dfs -cat word-count-result/part-r-00000 > word-count.txt
• Sort and view results:
sort -k2 -n -r word-count.txt | more
![Page 20: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/20.jpg)
Hadoop Map/Reduce - Terminology
• Running “Word Count” across 20 files is one job
• Job Tracker initiates some number of map tasks and some number of reduce tasks.
• For each map task at least one task attempt will be performed… more if a task fails (e.g., machine crashes).
![Page 21: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/21.jpg)
High Level Architecture of MapReduce
Master Node
JobTracker
Slave Node Slave Node Slave Node
TaskTracker TaskTracker TaskTracker
Client Computer
Task Task Task Task Task
![Page 22: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/22.jpg)
High Level Architecture of Hadoop
Master Node
JobTracker
Slave Node
TaskTracker
MapReduce layer
HDFS layer
TaskTracker
NameNode
DataNode DataNode
Slave Node
TaskTracker
DataNode
![Page 23: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/23.jpg)
Web based User interfaces
• JobTracker: http://50.17.65.29:9100/
• NameNode: http://50.17.65.29:9101/
![Page 24: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/24.jpg)
Hadoop Job Scheduling
• FIFO queue matches incoming jobs to available nodes
– No notion of fairness
– Never switches out running job
• Warning! Start your job as soon as possible.
![Page 25: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/25.jpg)
Reporting Progress
Source files are available at: http://www.ics.uci.edu/~yganjisa/files/2011/hadoop-presentation/WordCount-v2-src.zip
If your tasks don’t report anything in 10 minutes they would be killed by Hadoop!
![Page 26: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/26.jpg)
Distributed File Cache
• The Distributed Cache facility allows you to transfer files from the distributed file system to the local file system (for reading only) of all participating nodes before the beginning of a job.
![Page 27: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/27.jpg)
TextInputFormat
Split
LineRecordReader
<offset1, line1>
<offset2, line2>
<offset3, line3>
For more complex inputs, You should extend:
• InputSplit • RecordReader • InputFormat
![Page 28: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/28.jpg)
Part 2: Amazon Web Services (AWS)
![Page 29: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/29.jpg)
What is AWS?
• A collection of services that together make up a cloud computing platform: – S3 (Simple Storage Service)
– EC2 (Elastic Compute Cloud)
– Elastic MapReduce
– Email Service
– SimpleDB
– Flexibile Payments Service
– …
![Page 30: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/30.jpg)
Case Study: yelp
• Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day.
• Features powered by Amazon Elastic MapReduce include: – People Who Viewed this Also Viewed – Review highlights – Auto complete as you type on search – Search spelling suggestions – Top searches – Ads
• Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data.
![Page 31: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/31.jpg)
Amazon S3
• Data Storage in Amazon Data Center
• Web Service interface
• 99.99% monthly uptime guarantee
• Storage cost: $0.15 per GB/Month
• S3 is reported to store more than 102 billion objects as of March 2010.
![Page 32: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/32.jpg)
Amazon S3
• You can think of S3 as a big HashMap where you store your files with a unique key:
– HashMap: key -> File
![Page 33: MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W12/slides/Hadoop-AWS.pdf · • Hadoop is a top-level Apache project being built and used by a global community of contributors,](https://reader031.vdocuments.us/reader031/viewer/2022021903/5ba201bc09d3f2666b8d8ef0/html5/thumbnails/33.jpg)
References
• Hadoop Project Page: http://hadoop.apache.org/
• Amazon Web Services: http://aws.amazon.com/