mapreduce in amazon web services. introduction amazon elastic mapreduce – amazon provides...

MapReduce in Amazon Web Services

Introduction

• Amazon Elastic MapReduce– Amazon provides MapReduce framework and interface– Data Store: Amazon Simple Storage Service (Amazon S3)– Interface: Web, Console, API

• Running Hadoop Manually– Setup Amazon EC2 instances– Setup Hadoop Manually on the instances

Amazon Web Services

• Amazon EC2– Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable

compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

– i.e., 컴퓨터 , 단 인스턴스의 전원이 내려가면 초기화 됨• Amazon EBS

– Amazon Elastic Block Store (EBS) provides block level storage volumes for use with Amazon EC2 instances. Amazon EBS volumes are off-instance storage that persists in-dependently from the life of an instance

– i.e., EC2 에 연결해 사용할 수 있는 외장 하드 , 데이터는 지속적으로 저장됨 .

• Amazon S3– Amazon S3 provides a simple web services interface that can be used to store and re-

trieve any amount of data, at any time, from anywhere on the web.– i.e., HDFS 와 같은 분산 저장 시스템 , 읽기 쓰기를 위해서는 별도의 API 사용

• Amazon Elastic MapReduce– It utilizes a hosted Hadoop framework running on the web-scale infrastructure of

Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

– i.e., 아마존 제공 MapReduce 솔루션 , MapReduce 프로그램을 실행할 수 있는 인터페이스 제공

Amazon Elastic MapReduce

Running Hadoop Manually

• Setup Methods1. Hadoop 이 이미 설치된 이미지로 EC2 를 기동한 후 수동 설정2. EBS 기반 AMI 에 하둡 설치 및 복사 후 수동 설정3. Hadoop 에 포함된 hadoop-ec2 를 사용하는 방법4. Whirr 을 사용함

• 1,2 의 방법은 EC2 인스턴스를 기동하거나 , 기동된 EC2 인스턴스의 IP 주소들을 알아내서 Hadoop 을 설정해야 하는 등 많은 노력이 들어감

• 3 의 방법은 Hadoop 의 contrib 패키지안에 포함된 프로그램으로 현재는 Whirr 에서 진행되고 있지만 지속적으로 유지보수가 되지 않음

• 4 의 방법이 가장 편리함– 단점으로는 클러스터가 내려갈 시 , 변경된 HDFS 의 내용이 사라짐– EBS 나 S3 같은 외부 스토리지 서비스에 데이터를 저장할 필요가 있음

• Reference– http://diveintodata.org/2011/03/whirr-usage-for-hadoop-cluster-in-

amazon-ec2/

Amazon Web Services

• http://aws.amazon.com/• Create an AWS Account

Amazon Web Services

• Account Information• Payment Method

Amazon Web Services

• Payment Method• Sing in to the AWS Management Console

Amazon Web Services

• AWS Management Console

whirr

• https://incubator.apache.org/whirr/• Apache Incubator Project• Amazon EC2 와 같은 상용 클라우드 환경에서 원하는 서비스에 대한 설

치 , 설정 , 실행을 자동으로 수행하는 라이브러리• 지원 클라우드 환경 및 서비스

Cloud provider

Cassandra Hadoop ZooKeeper HBaseelasticsearch

Voldemort

Amazon EC2 Yes Yes Yes Yes Yes Yes

Rackspace Cloud Servers

Yes Yes Yes Yes Yes Yes

Preparation

• Security Credentials

• Create a new Access Key

Security Credentials

Preparation

• Download Hadoop and Whir• Extract them

• Whirr in 5 minutes

export AWS_ACCESS_KEY_ID=...export AWS_SECRET_ACCESS_KEY=...

curl -O http://www.apache.org/dist/incubator/whirr/whirr-0.5.0-incubating/whirr-0.5.0-incubating.tar.gztar zxf whirr-0.5.0-incubating.tar.gz; cd whirr-0.5.0-incubating

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirrbin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo

bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

Whirr in 5 minutes

Configuring

• Setting Environment Variables to Specify AWS Credentials– AWS Access Key ID– AWS Secret Access Key

• Configure a Hadoop cluster– Make the copy of hadoop-ec2.properties

– Edit the hadoop-ec2-mod.properties

cd whirr-0.5.0-incubatingcp recipes/hadoop-ec2.properties ./hadoop-ec2-mod.properties

vim hadoop-ec2-mod.properties

export AWS_ACCESS_KEY_ID=...export AWS_SECRET_ACCESS_KEY=...

Configuring

• hadoop-ec2-mod.properties

• http://incubator.apache.org/whirr/configuration-guide.html

whirr.cluster-user=hadoop

whirr.cluster-name=hadoopclusterwhirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker

whirr.provider=aws-ec2whirr.identity=${env:AWS_ACCESS_KEY_ID}whirr.credential=${env:AWS_SECRET_ACCESS_KEY}

whirr.private-key-file=${sys:user.home}/.ssh/id_rsawhirr.public-key-file=${whirr.private-key-file}.pub

whirr.hardware-id=m1.xlargewhirr.image-id=us-east-1/ami-08f40561whirr.location-id=us-east-1d

# Expert: specify the version of Hadoop to install.#whirr.hadoop.version=0.20.203.0#whirr.hadoop.tarball.url=http://archive.apache.org/dist/hadoop/core/hadoop-${whirr.hadoop.version}/hadoop-${whirr.hadoop.version}.tar.gz

Configuring

• whirr.instance-templates– The number of instances to launch for each set of roles in a service– e.g., 1 nn+jt,10 dn+tt means one instance with the roles nn (namenode) and jt (job-

tracker), and ten instances each with the roles dn (datanode) and tt (tasktracker)

• whirr.image-id – The ID of the image to use for instances. If not specified then a vanilla Linux image is

chosen. – e.g., http://alestic.com/

• whirr.location-id – The location to launch instances in. If not specified then an arbitrary location will be

chosen. – If you choose a different location, make sure whirr.image-id is updated too

Configuring

• whirr.hardware-id– http://aws.amazon.com/ec2/instance-types/

Configuring

• Price of On-Demand Instances

Configuring

• Generate a keypair

ssh-keygen -t rsa -P ''

Launch

• Run the following command to launch a cluster

bin/whirr launch-cluster --config hadoop-ec2-mod.properties

Run a MapReduce Job

• hadoop-site.xml file is created in the directory ~/.whirr/<cluster-name>

• You can use this to connect to the cluster by setting the HADOOP_CONF_DIR environment variable

• Run a proxy

export HADOOP_CONF_DIR=~/.whirr/hadoopcluster

. ~/.whirr/hadoopcluster/hadoop-proxy.sh

Run a MapReduce Job

• You should now be able to browse HDFS:

cd ..cd hadoop-0.20.2/bin/hadoop fs –ls /

Run a MapReduce Job

• You can run a MapReduce job at a localhost

bin/hadoop fs -mkdir inputbin/hadoop fs -put LICENSE.txt inputbin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output

Run a MapReduce Job

• You get a result of the MapReduce job

bin/hadoop fs -cat output/part-* |tail

Destroy a cluster

• When you've finished using a cluster you can terminate the instances and clean up resources with the following.

• All data will be deleted when you destroy the cluster.bin/whirr destroy-cluster --config hadoop-ec2-mod.properties

Using Amazon EBS

• Transfer your data which can be reused

Using Amazon EBS

Using Amazon EBS

ssh -i /home/xeryeon/.ssh/id_rsa [email protected] ebssudo mkfs.ext4 /dev/sdfsudo mount /dev/sdf ./ebs/

mapreduce in amazon web services. introduction amazon elastic mapreduce – amazon provides...

Documents