mapreduce in amazon web services. introduction amazon elastic mapreduce – amazon provides...
TRANSCRIPT
MapReduce in Amazon Web Services
Introduction
• Amazon Elastic MapReduce– Amazon provides MapReduce framework and interface– Data Store: Amazon Simple Storage Service (Amazon S3)– Interface: Web, Console, API
• Running Hadoop Manually– Setup Amazon EC2 instances– Setup Hadoop Manually on the instances
Amazon Web Services
• Amazon EC2– Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable
compute capacity in the cloud. It is designed to make web-scale computing easier for developers.
– i.e., 컴퓨터 , 단 인스턴스의 전원이 내려가면 초기화 됨• Amazon EBS
– Amazon Elastic Block Store (EBS) provides block level storage volumes for use with Amazon EC2 instances. Amazon EBS volumes are off-instance storage that persists in-dependently from the life of an instance
– i.e., EC2 에 연결해 사용할 수 있는 외장 하드 , 데이터는 지속적으로 저장됨 .
• Amazon S3– Amazon S3 provides a simple web services interface that can be used to store and re-
trieve any amount of data, at any time, from anywhere on the web.– i.e., HDFS 와 같은 분산 저장 시스템 , 읽기 쓰기를 위해서는 별도의 API 사용
• Amazon Elastic MapReduce– It utilizes a hosted Hadoop framework running on the web-scale infrastructure of
Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
– i.e., 아마존 제공 MapReduce 솔루션 , MapReduce 프로그램을 실행할 수 있는 인터페이스 제공
Amazon Elastic MapReduce
Amazon Elastic MapReduce
Running Hadoop Manually
• Setup Methods1. Hadoop 이 이미 설치된 이미지로 EC2 를 기동한 후 수동 설정2. EBS 기반 AMI 에 하둡 설치 및 복사 후 수동 설정3. Hadoop 에 포함된 hadoop-ec2 를 사용하는 방법4. Whirr 을 사용함
• 1,2 의 방법은 EC2 인스턴스를 기동하거나 , 기동된 EC2 인스턴스의 IP 주소들을 알아내서 Hadoop 을 설정해야 하는 등 많은 노력이 들어감
• 3 의 방법은 Hadoop 의 contrib 패키지안에 포함된 프로그램으로 현재는 Whirr 에서 진행되고 있지만 지속적으로 유지보수가 되지 않음
• 4 의 방법이 가장 편리함– 단점으로는 클러스터가 내려갈 시 , 변경된 HDFS 의 내용이 사라짐– EBS 나 S3 같은 외부 스토리지 서비스에 데이터를 저장할 필요가 있음
• Reference– http://diveintodata.org/2011/03/whirr-usage-for-hadoop-cluster-in-
amazon-ec2/
Amazon Web Services
• http://aws.amazon.com/• Create an AWS Account
Amazon Web Services
• Account Information• Payment Method
Amazon Web Services
• Payment Method• Sing in to the AWS Management Console
Amazon Web Services
• AWS Management Console
whirr
• https://incubator.apache.org/whirr/• Apache Incubator Project• Amazon EC2 와 같은 상용 클라우드 환경에서 원하는 서비스에 대한 설
치 , 설정 , 실행을 자동으로 수행하는 라이브러리• 지원 클라우드 환경 및 서비스
Cloud provider
Cassandra Hadoop ZooKeeper HBaseelasticsearch
Voldemort
Amazon EC2 Yes Yes Yes Yes Yes Yes
Rackspace Cloud Servers
Yes Yes Yes Yes Yes Yes
Preparation
• Security Credentials
• Create a new Access Key
Security Credentials
Preparation
• Download Hadoop and Whir• Extract them
• Whirr in 5 minutes
export AWS_ACCESS_KEY_ID=...export AWS_SECRET_ACCESS_KEY=...
curl -O http://www.apache.org/dist/incubator/whirr/whirr-0.5.0-incubating/whirr-0.5.0-incubating.tar.gztar zxf whirr-0.5.0-incubating.tar.gz; cd whirr-0.5.0-incubating
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirrbin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo
bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
Whirr in 5 minutes
Configuring
• Setting Environment Variables to Specify AWS Credentials– AWS Access Key ID– AWS Secret Access Key
• Configure a Hadoop cluster– Make the copy of hadoop-ec2.properties
– Edit the hadoop-ec2-mod.properties
cd whirr-0.5.0-incubatingcp recipes/hadoop-ec2.properties ./hadoop-ec2-mod.properties
vim hadoop-ec2-mod.properties
export AWS_ACCESS_KEY_ID=...export AWS_SECRET_ACCESS_KEY=...
Configuring
• hadoop-ec2-mod.properties
• http://incubator.apache.org/whirr/configuration-guide.html
whirr.cluster-user=hadoop
whirr.cluster-name=hadoopclusterwhirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker
whirr.provider=aws-ec2whirr.identity=${env:AWS_ACCESS_KEY_ID}whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.private-key-file=${sys:user.home}/.ssh/id_rsawhirr.public-key-file=${whirr.private-key-file}.pub
whirr.hardware-id=m1.xlargewhirr.image-id=us-east-1/ami-08f40561whirr.location-id=us-east-1d
# Expert: specify the version of Hadoop to install.#whirr.hadoop.version=0.20.203.0#whirr.hadoop.tarball.url=http://archive.apache.org/dist/hadoop/core/hadoop-${whirr.hadoop.version}/hadoop-${whirr.hadoop.version}.tar.gz
Configuring
• whirr.instance-templates– The number of instances to launch for each set of roles in a service– e.g., 1 nn+jt,10 dn+tt means one instance with the roles nn (namenode) and jt (job-
tracker), and ten instances each with the roles dn (datanode) and tt (tasktracker)
• whirr.image-id – The ID of the image to use for instances. If not specified then a vanilla Linux image is
chosen. – e.g., http://alestic.com/
• whirr.location-id – The location to launch instances in. If not specified then an arbitrary location will be
chosen. – If you choose a different location, make sure whirr.image-id is updated too
Configuring
• whirr.hardware-id– http://aws.amazon.com/ec2/instance-types/
Configuring
• Price of On-Demand Instances
Configuring
• Generate a keypair
ssh-keygen -t rsa -P ''
Launch
• Run the following command to launch a cluster
bin/whirr launch-cluster --config hadoop-ec2-mod.properties
Run a MapReduce Job
• hadoop-site.xml file is created in the directory ~/.whirr/<cluster-name>
• You can use this to connect to the cluster by setting the HADOOP_CONF_DIR environment variable
• Run a proxy
export HADOOP_CONF_DIR=~/.whirr/hadoopcluster
. ~/.whirr/hadoopcluster/hadoop-proxy.sh
Run a MapReduce Job
• You should now be able to browse HDFS:
cd ..cd hadoop-0.20.2/bin/hadoop fs –ls /
Run a MapReduce Job
• You can run a MapReduce job at a localhost
bin/hadoop fs -mkdir inputbin/hadoop fs -put LICENSE.txt inputbin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output
Run a MapReduce Job
• You get a result of the MapReduce job
bin/hadoop fs -cat output/part-* |tail
Destroy a cluster
• When you've finished using a cluster you can terminate the instances and clean up resources with the following.
• All data will be deleted when you destroy the cluster.bin/whirr destroy-cluster --config hadoop-ec2-mod.properties
Using Amazon EBS
• Transfer your data which can be reused
Using Amazon EBS
Using Amazon EBS
ssh -i /home/xeryeon/.ssh/id_rsa [email protected] ebssudo mkfs.ext4 /dev/sdfsudo mount /dev/sdf ./ebs/