2014 import.io data summit - including hadoop/impala getting started demo

ianmas@amazon.com @IanMmmm

LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist

THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN

DERIVE FROM IT!

THE COST OF DATA GENERATION IS FALLING!

We are constantly producing more data

From all types of industries

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Lower cost, higher throughput

Highly constrained

+ ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS

AWS Import / Export AWS Direct Connect

Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect

Amazon S3, Amazon Glacier,

Amazon DynamoDB, Amazon RDS,

Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2

Amazon EC2 Amazon Elastic

MapReduce

AMAZON ELASTIC MAPREDUCE

HADOOP AS A SERVICE!

•  SPLITS DATA INTO PIECES •  LETS PROCESSING OCCUR •  GATHERS THE RESULTS!

EMR Kinesis

S3 DynamoDB

Data management

Analytics languages/engines

Redshift AWS Data Pipeline

EMR + IMPALA DEMO

STARTING AN EMR CLUSTER WITH HADOOP ECOSYSTEM

TOOLS PRE-INSTALLED

COPY & LOAD OUR DATASET $ scp –i EMRKeyPair.pem ~/aws/hadoop/LHRarrivals*.csv hadoop@ec2-‐54-‐76-‐242-‐238.eu-‐west-‐1.compute.amazonaws.com: $ ssh –i EMRKeyPair.pem hadoop@ec2-‐54-‐76-‐242-‐238.eu-‐west-‐1.compute.amazonaws.com $ hadoop fs -‐mkdir /data/ $ hadoop fs -‐put <uploaded_files> /data/ $ hadoop fs -‐ls -‐h -‐R /data/ or at scale, Distributed Copy using S3DistCp to parallel load from S3 $ . /home/hadoop/impala/conf/impala.conf $ hadoop jar /home/hadoop/lib/emr-‐s3distcp-‐1.0.jar -‐Dmapreduce.job.reduces=30 -‐-‐src s3://s3bucketname/ -‐-‐dest hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/data/ -‐-‐outputCodec 'none' ** Run on a cluster master node

CREATE EXTERNAL TABLE $ #check the size of our data set $ wc –l LHRarrivals*.csv

850 LHRarrivals2.csv 1526 LHRarrivals.csv

2376 total $ impala-‐shell Welcome to the Impala shell. > create EXTERNAL TABLE flights ( input STRING, id BIGINT, widget STRING, source STRING, resultnum BIGINT, pageurl STRING, scheduled STRING, flightnumber STRING, airport STRING, status STRING, terminal STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/data/'; > select count (*) from flights; Should return count(*) 2376 reflecting the size of the data set

DEMO OF ODBC ACCESS Doing this part on Amazon WorkSpaces using the Simba Cloudera Impala ODBC Driver.!

Set up an SSH tunnel to the master node to allow us to connect to port 25010 from the WorkSpaces desktop to the Impala ODBC port!

A previously configured system DSN allows us to work with the data from our EMR/Impala cluster directly within Microsoft Excel!

Amazon S3, Amazon DynamoDB,

Amazon RDS, Amazon Redshift,

Data on Amazon EC2

BATCH PROCESSING

GENERATE ➔ ➔ SHARE!

STREAM PROCESSING

AMAZON KINESISREAL-TIME DATA STREAM PROCESSING!

Real-time response to content in semi-structured data streams

Relatively simple computations

on data (aggregates, filters, sliding window, etc.)

Hourly server logs: how your systems went wrong an hour ago

Weekly / Monthly Bill: What you spent this past billing cycle

Daily customer report from your website: tells you what deal or ad to try next time

Daily fraud reports: tells you if there was fraud yesterday

Daily business reports: tells me how customers used AWS services yesterday

Real-time metrics: what just went wrong now

Real-time spending alerts/caps: guaranteeing you can’t overspend

Real-time analysis: what to offer the current customer now

Real-time detection: blocks fraudulent use now

Fast ETL into Amazon Redshift: how are customers using services now

Data on Amazon EC2

Amazon EC2 Amazon Elastic

MapReduce

Amazon S3, Amazon Glacier,

Amazon DynamoDB, Amazon RDS,

Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 AWS Import / Export

AWS Direct Connect

STREAM PROCESSING

Data on Amazon EC2

Amazon Kinesis Stream Processing on

Amazon EC2

WANT TO KNOW MORE?

aws.amazon.com/solutions/case-studies/big-data/!

ianmas@amazon.com @IanMmmm

LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist

2014 import.io data summit - including hadoop/impala getting started demo

Technology

setting up hadoop cluster with cloudera manager and impala

impala: a modern sql engine for hadoop - meetup

impala: a modern, open-source sql engine for hadoop ·...

jdp15 import.io workshop

hadoop ecosystem vorstellung der komponenten · open source...

sql on hadoop. todays agenda introduction hive – the first...

impala: a modern sql engine for hadoop

tomorrow’s enterprise - delivered...

impala: a modern, open-source sql engine for hadoop ·...

cloudera impala: a modern sql engine for apache hadoop

sql friendly hadoop mysql and impala ecosystem integration...

cloudera impala: a modern sql engine for hadoop

impala deep dive - running descriptive analytics in hadoop

impala: a modern, open-source sql engine for...

tdwi sql on hadoop - sigs.de · pdf filespark, cascading...

cloudera odbc driver for impala...2015/02/05 · cloudera...

apache impala...

everyone can do data science — import.io webinar

impala: a modern sql engine for hadoop -...

impala: a modern, open-source sql engine for hadoop