bigdata- on - aws cloud -1
TRANSCRIPT
![Page 1: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/1.jpg)
BUSI758B
Big Data Analytics On
Amazon Web Services
![Page 2: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/2.jpg)
Yelp was able to save $55,000 in upfront in Hardware
costs.
Unilever processes Genetic sequences 20 times faster .
Swipely generates insight from millions of Credit Card
transactions.
Expedia processes click stream data from global
network of websites.
![Page 3: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/3.jpg)
The Big Question is
How ???
![Page 4: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/4.jpg)
The Answer is :
![Page 5: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/5.jpg)
Some Background on Cloud Computing and AWS
![Page 6: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/6.jpg)
What is Cloud Computing ?lCloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computingresources (e.g., networks, servers, storage, applications, and services) thatcan be rapidly provisioned and released with minimal management effortor service provider interaction.l - NIST Definition
lThis cloud model is composed of five essential characteristics, three service models, and four deployment models.
![Page 7: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/7.jpg)
Essential Characteristics:l- On-demand self-service.l- Broad network access.l- Resource pooling.l- Rapid elasticity.l- Measured service.
![Page 8: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/8.jpg)
Service Models:
IaaS Providers : AWS,HPCloud,Rackspace.
PaaS Providers: Google AppEngine, heroku, Redhat Openshift
SaaS Providers: Salesforce,Linkedin, Taleo
![Page 9: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/9.jpg)
Delivery Models:
lPublic CloudlPrivate CloudlHybrid CloudlCommunity Cloud*
* NIST Defines Community cloud as The cloud infrastructure provisioned for exclusive use by a specific
community of consumers from organizations that have shared concerns (e.g., mission,security requirements,
policy, and compliance considerations).
![Page 10: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/10.jpg)
lNow Few Questions ??
1. What service model does AWS fall into ?? 2. What are the advantages of using Cloud Platform for Big data ?3. How AWS leverage those advantages to provide Big Data Analytics ?
![Page 11: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/11.jpg)
Advantage of Cloud Platform
l- Ability to Scale the infrastructure l- OPEX instead of CAPEXl- Custom solutions as per the need. l- Easier/faster Deployment. l- Help focus on Core Business l solutions/Analytics.
So , It can be safely said that the Cloud Platform acts as Enabler of Big Data
technology.
![Page 12: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/12.jpg)
AWS Big Data Analytics :
![Page 13: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/13.jpg)
Elastic MapReduce(EMR)
![Page 14: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/14.jpg)
Elastic MapReduce(EMR)
![Page 15: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/15.jpg)
Hadoop as a Service
lAmazon Elastic Mapreduce supports Hadoop
Software Eco-System.(Hadoop 1.X, Hadoop 2.X)
lAmazon EMR control software is responsible for
automated arrangement, coordination, and
management of Hadoop Cluster.
lAmazon Elastic Mapreduce also Supports MAPR,
Apache Hadoop-derived software.
![Page 16: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/16.jpg)
Integrated With Tools
Amazon EMR provides you have root access to the cluster.
Additional Software required can be installed and configured in the cluster before
Hadoop starts by creating BootStrap Action.
*Spark is installed using BootStrapping.
![Page 17: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/17.jpg)
Mapreduce Engine
lJob/Task
lRoles of Servers:
la> Master Node
lb> Core Node
lc> Task Node
lStep: Unit of work
Mapreduce Engine implements the Distributed processing
framework of Hadoop.
![Page 18: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/18.jpg)
Mapreduce Engine- Cont..
ll
Hadoop AWS
Name Node Master Node
Data Node Core Node
Additional concepts of Task Node and Steps :
Task Node - Task Nodes are optional. You can add task Nodes when you start
the cluster, or you can add task groups to a running cluster. Because they do
not store data and can be added and removed from a cluster, you can use task
nodes to manage the EC2 instance capacity your cluster uses, increasing
capacity to handle peak loads and decreasing it later.
Steps: Contains 1 or more Hadoop jobs. Step is an instruction given to
manipulate date using Hadoop jobs.
Max. no of Pending and Active Steps allowed in Cluster is 256.
![Page 19: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/19.jpg)
Massively Parallel
lVirtual Instances -Much Easier to
Scale.
lQuick and Cost effective Scaling.
lDynamic Resizing while running the
job.
lDistributed Hadoop System in true
sense.
lMultiple clusters accessing same data
![Page 20: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/20.jpg)
Cost Effective AWS Wrapper
lSpot Instances
lPay as you go.
lAutomatic Cluster
termination after
job completion.
lBundled License
softwares with
infrastructure.
lEconomy of Scale
![Page 21: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/21.jpg)
Integrated to AWS Services
lAmazon EMR is integrated with other Amazon Web Services such as Amazon EC2,
Amazon S3, DynamoDB, Amazon RDS, CloudWatch, and AWS Data Pipeline.
lEasily access data stored in AWS from EMR cluster and make use of the
functionality offered by other Amazon Web Services to manage your cluster and
store the output of your cluster
ComputelEC2
Networking•VPC•ELB•Route 53
StoragelEBSlS3lGlacier
Data Services
lRDS
lDynamoDB
lRedshift
Deployment and ManagementlAWS Management Console lAWS Command Line InterfacelAWS IAMlCloud Watch
![Page 22: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/22.jpg)
Life Cycle of EMR Cluster
![Page 23: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/23.jpg)
How to launch and connect to EMR Cluster-Quick Demo
![Page 24: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/24.jpg)
![Page 25: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/25.jpg)
Click on Create Cluster
![Page 26: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/26.jpg)
lProvide Cluster name for easier Identification.
lTermination Protection has to be selected 'Yes' to prevent accidental
termination of Cluster.
lLogging has to be enabled as this feature leads to automatic logging of cluster
activity.
lProvide S3 folder location for logging.
lDebugging is enabled so that any troubleshooting regarding cluster activity
can be done.
![Page 27: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/27.jpg)
lIt is optional feature but always encouraged to have tags.
lTag is Key/Value pair which gets associated with every resource in cluster.
lHelps in monitoring and in managing cluster resource easily.
![Page 28: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/28.jpg)
![Page 29: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/29.jpg)
![Page 30: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/30.jpg)
![Page 31: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/31.jpg)
![Page 32: BigData- On - AWS Cloud -1](https://reader031.vdocuments.us/reader031/viewer/2022020116/55af7ac41a28ab2e568b46c4/html5/thumbnails/32.jpg)