2014 import.io data summit - including hadoop/impala getting started demo

Post on 06-Aug-2015

154 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ianmas@amazon.com @IanMmmm

LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist

THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN

DERIVE FROM IT!

THE COST OF DATA GENERATION IS FALLING!

We are constantly producing more data

From all types of industries

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Lower cost, higher throughput

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Lower cost, higher throughput

Highly constrained

+ ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

AWS Import / Export AWS Direct Connect

Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Amazon S3, Amazon Glacier,

Amazon DynamoDB, Amazon RDS,

Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Amazon EC2 Amazon Elastic

MapReduce

AMAZON ELASTIC MAPREDUCE

HADOOP AS A SERVICE!

•  SPLITS DATA INTO PIECES •  LETS PROCESSING OCCUR •  GATHERS THE RESULTS!

HDFS

EMR Kinesis

S3 DynamoDB

Data management

Pig

Analytics languages/engines

RDS

Redshift AWS Data Pipeline

EMR + IMPALA DEMO

STARTING AN EMR CLUSTER WITH HADOOP ECOSYSTEM

TOOLS PRE-INSTALLED

COPY & LOAD OUR DATASET $  scp  –i  EMRKeyPair.pem  ~/aws/hadoop/LHRarrivals*.csv  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐west-­‐1.compute.amazonaws.com:    $  ssh  –i  EMRKeyPair.pem  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐west-­‐1.compute.amazonaws.com    $  hadoop  fs  -­‐mkdir  /data/  $  hadoop  fs  -­‐put  <uploaded_files>  /data/  $  hadoop  fs  -­‐ls  -­‐h  -­‐R  /data/    or at scale, Distributed Copy using S3DistCp to parallel load from S3  $  .  /home/hadoop/impala/conf/impala.conf  $  hadoop  jar  /home/hadoop/lib/emr-­‐s3distcp-­‐1.0.jar  -­‐Dmapreduce.job.reduces=30  -­‐-­‐src  s3://s3bucketname/  -­‐-­‐dest  hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/data/  -­‐-­‐outputCodec  'none'    ** Run on a cluster master node

CREATE EXTERNAL TABLE $  #check  the  size  of  our  data  set  $  wc  –l  LHRarrivals*.csv      

 850  LHRarrivals2.csv    1526  LHRarrivals.csv  

     2376  total    $  impala-­‐shell    Welcome  to  the  Impala  shell.    >  create  EXTERNAL  TABLE  flights  (  input  STRING,  id  BIGINT,  widget  STRING,  source  STRING,  resultnum  BIGINT,  pageurl  STRING,  scheduled  STRING,  flightnumber  STRING,  airport  STRING,  status  STRING,  terminal  STRING  )  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY  ','  LOCATION  '/data/';  >  select  count  (*)  from  flights;    Should  return  count(*)  2376  reflecting  the  size  of  the  data  set  

DEMO OF ODBC ACCESS Doing this part on Amazon WorkSpaces using the Simba Cloudera Impala ODBC Driver.!

Set up an SSH tunnel to the master node to allow us to connect to port 25010 from the WorkSpaces desktop to the Impala ODBC port!

A previously configured system DSN allows us to work with the data from our EMR/Impala cluster directly within Microsoft Excel!

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Amazon S3, Amazon DynamoDB,

Amazon RDS, Amazon Redshift,

Data on Amazon EC2

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

BATCH PROCESSING

GENERATE ➔ ➔ SHARE!

STREAM PROCESSING

AMAZON KINESISREAL-TIME DATA STREAM PROCESSING!

Real-time response to content in semi-structured data streams

Relatively simple computations

on data (aggregates, filters, sliding window, etc.)

Hourly server logs: how your systems went wrong an hour ago

Weekly / Monthly Bill: What you spent this past billing cycle

Daily customer report from your website: tells you what deal or ad to try next time

Daily fraud reports: tells you if there was fraud yesterday

Daily business reports: tells me how customers used AWS services yesterday

Real-time metrics: what just went wrong now

Real-time spending alerts/caps: guaranteeing you can’t overspend

Real-time analysis: what to offer the current customer now

Real-time detection: blocks fraudulent use now

Fast ETL into Amazon Redshift: how are customers using services now

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Amazon S3, Amazon DynamoDB,

Amazon RDS, Amazon Redshift,

Data on Amazon EC2

Amazon EC2 Amazon Elastic

MapReduce

Amazon S3, Amazon Glacier,

Amazon DynamoDB, Amazon RDS,

Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 AWS Import / Export

AWS Direct Connect

GENERATE ➔ ➔ SHARE!

STREAM PROCESSING

GENERATE ➔ ➔ SHARE!

STREAM PROCESSING

Amazon S3, Amazon DynamoDB,

Amazon RDS, Amazon Redshift,

Data on Amazon EC2

Amazon Kinesis Stream Processing on

Amazon EC2

WANT TO KNOW MORE?

aws.amazon.com/solutions/case-studies/big-data/!

ianmas@amazon.com @IanMmmm

LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist

top related