sql on hadoop for enterprise analytics

59
A LITTLE BIT OF HISTORY Everything old is new again. SQL Forever.

Upload: dataversity

Post on 19-Feb-2017

738 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: SQL on Hadoop for Enterprise Analytics

A LITTLE BIT OF HISTORY

Everything old is new again. SQL Forever.

Page 2: SQL on Hadoop for Enterprise Analytics

The story so far

Why hasn’t SQL died yet? It’s 2016 and we’re still using it?!

Page 3: SQL on Hadoop for Enterprise Analytics

Everything old is new again

Existing architecture keeps reappearing

It takes time to figure out what tools are right for what jobs

SQL is still the best tool for business analytics

Page 4: SQL on Hadoop for Enterprise Analytics

A long long time ago…

Page 5: SQL on Hadoop for Enterprise Analytics

Growing pains

Late 1990

Page 6: SQL on Hadoop for Enterprise Analytics

Database problems

Database outage

Data integrity issues

Data latency

Late 1990

Page 7: SQL on Hadoop for Enterprise Analytics

Master Slave

Late 1990

Page 8: SQL on Hadoop for Enterprise Analytics

Transactions

Late 1990

Page 9: SQL on Hadoop for Enterprise Analytics

Performance

Late 1990

Page 10: SQL on Hadoop for Enterprise Analytics

By the time I graduated, SQL was on its last legs

2009

Page 11: SQL on Hadoop for Enterprise Analytics

Cache all the things!

2009

Page 12: SQL on Hadoop for Enterprise Analytics

Stop copying Twitter!

2009

Page 13: SQL on Hadoop for Enterprise Analytics

SQL golden age ends, NoSQL takes off

2010

Column Graph

Key-Value Document

Page 14: SQL on Hadoop for Enterprise Analytics

NoSQL

2010

Page 15: SQL on Hadoop for Enterprise Analytics

Awesome things about NoSQL

No SQL, normal languages as APIs!

Non relational!

FAST!

2010

Page 16: SQL on Hadoop for Enterprise Analytics

Remember ORMs?

~2000

Page 17: SQL on Hadoop for Enterprise Analytics

Active Record

~2000

Page 18: SQL on Hadoop for Enterprise Analytics

ORMs 👎

2011

Page 19: SQL on Hadoop for Enterprise Analytics

Remember EAV(Entity Attribute Value)?

1968

Page 20: SQL on Hadoop for Enterprise Analytics

Kind of looks like columns…

1968

Page 21: SQL on Hadoop for Enterprise Analytics

Modern EAV

2010

Page 22: SQL on Hadoop for Enterprise Analytics

Tedious to query

2010

Page 23: SQL on Hadoop for Enterprise Analytics

Voila!

2010

Page 24: SQL on Hadoop for Enterprise Analytics

No joins is a feature!

2010

Page 25: SQL on Hadoop for Enterprise Analytics

NoSQL has some rough bumps

2010

Page 26: SQL on Hadoop for Enterprise Analytics

NoSQL has A LOT of rough bumps…

2011

Page 27: SQL on Hadoop for Enterprise Analytics

Throwback Thursday!

2011

Page 28: SQL on Hadoop for Enterprise Analytics

Lock the doors

2011

Page 29: SQL on Hadoop for Enterprise Analytics

MPP columnar DBs! Wait... SQL is back?!

2015

Page 30: SQL on Hadoop for Enterprise Analytics

Hadoop on SQL

2016

Page 31: SQL on Hadoop for Enterprise Analytics

A long long time ago…

Page 32: SQL on Hadoop for Enterprise Analytics

What’s next?

~2020?

Page 33: SQL on Hadoop for Enterprise Analytics

What’s next?

~2020?

“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.” – Todd Lipcon

Page 34: SQL on Hadoop for Enterprise Analytics

SQL is far past hype

Page 35: SQL on Hadoop for Enterprise Analytics

Fin

“If it ain’t broke, don’t fix it”

Page 36: SQL on Hadoop for Enterprise Analytics

CUSTOMER STORYBuilding a event analytics pipeline

using Hadoop and Spark

Page 37: SQL on Hadoop for Enterprise Analytics

Why Consider a Big Data Pipeline?

37

You are rapidly exceeding the limits of your existing database

Everything on your website can be

analyzed.

Waiting until the next day isn’t for

you

Data comes and goes to many places, and you want

one process for it

Page 38: SQL on Hadoop for Enterprise Analytics

Big DATA CULTURE

38

Summary data is not good enough

Company is mandating new technologies

You want to build a data driven culture

Big SQL is the heart of a data-driven culture

Page 39: SQL on Hadoop for Enterprise Analytics

CASE STUDY

39

A major healthcare provider wants to create a web event pipeline that:

During periods of healthcare registration and new coverage

start and can dial back the rest of the year

Massive Scaling Large data volumes

10-15M customers worth of data. Provides data for

analysis in under 1 minute.

AND Utilizes existing in house technologies (such as Cloudera Impala)

Page loadsRegistrations

LoginsErrors

All events processed

Page 40: SQL on Hadoop for Enterprise Analytics

Solution: Build an event processing framework

5

Events

Event Collector

Hadoop?

Page 41: SQL on Hadoop for Enterprise Analytics

High Level Process

6

Events

Event Collector

Message ProcessingHDFS

Looker

To be designed

Page 42: SQL on Hadoop for Enterprise Analytics

Why is Hadoop so hard?

7

Need to write in Java and Scala

We don’t have structure

Not easy to get data out into BI tools

Event Collectors don’t tend to feed to HDFS

out of the box

Typically follow a batch processing

framework

Page 43: SQL on Hadoop for Enterprise Analytics

Ingestion mechanism

8

Low-Latency In flight transformation and

processing

Ability to populate multiple destinations

Our ideal ingestion would have three key aspects

Page 44: SQL on Hadoop for Enterprise Analytics

Spark vs Storm

9

VS

• Own Master Server • Run on HDFS• Micro batching • Exact once delivery

(eliminates vulnerability)

• Not native to Hadoop• Less Developed• One at a time• ETL in flight• Sub second latency

Two of the major players in data streaming / processing

Page 45: SQL on Hadoop for Enterprise Analytics

Flume

45

Source Interceptor Selector Channel Sinks

Managed by the Flume Agent

Web Server

Web Server

Web Server

Web Server Investor Channel

HDFSNo in flight transformation, so this just needs to meet workload

Page 46: SQL on Hadoop for Enterprise Analytics

KAFKA

46

Broker

Broker

Broker

Producer Broker Consumer

Producer

Producer

Spark Streaming

Other

ZooKeeper

Broker

Page 47: SQL on Hadoop for Enterprise Analytics

Flume vs. Kafka

12

Use Both: Out-of-the box with Flafka and native connectors

Flume

Kafka

Source

SparkCustom

connector

Customconnector

Flume KafkaSource Spark

Page 48: SQL on Hadoop for Enterprise Analytics

Storing the output

48

Data can be queried via Hive, Impala, or Spark SQL

Cloudera is our Enterprise choice

We can process a subset in-stream with Mlib or other machine learning

algorithms

Output summaries to other RDBMS

systems

Our streaming Spark cluster consumes messages from Kafka. We batch these every minute into a HDFS cluster. We chose this because

Page 49: SQL on Hadoop for Enterprise Analytics

Final Result

14

Events

Event Collector Kafka

Flume Spark SQL Cloudera

Other storage (RDBMS)

Other storage (logs)

Page 50: SQL on Hadoop for Enterprise Analytics

Pipeline Summary

15

Add data to any point of the pipeline

Kafka, Flume, Impala, Looker without many

custom connectors

Pipeline includes additional sources like teradata, oracle

Add in-flight predictive model training and execution without significant additional processing time

Our pipeline provides several points for flexibility as well as meets our key priorities.

Page 51: SQL on Hadoop for Enterprise Analytics

Priority # 1: Scale

Kafka is easy to scale, As more volume comes in, adding new brokers can be automated using the Partition Reassignment Tool

By monitoring batch times in Looker on Spark SQL, we can alert when we need to scale up the cluster using Scheduled Looks

16

Page 52: SQL on Hadoop for Enterprise Analytics

Priority #2: Flexibility

17

Different events can be parsed out to different Spark streaming applications with Kafka topics (Or another type of consumer)

Add more data at any point (flume, kafka producer, or directly to spark)

Looker connects to wherever the data lands, as long as we can query it. Perform analysis IN CLUSTER

Page 53: SQL on Hadoop for Enterprise Analytics

Priority #3 Speed Analyzing the stream

53

Events per hour

Identify missing batches

Volume and Timing

Right sizing hardware

Duplicate events

And missing information

Page 54: SQL on Hadoop for Enterprise Analytics

Priority #4: In house Technologies

19

Provide access to Hadoop/Impala via a centralized data hub:A single place to access web based reports, explores, BI tools and code libraries

Enable users to ask questions and query web data without writing SQL or knowing about the pipeline

Page 55: SQL on Hadoop for Enterprise Analytics

Analyzing the stream

55

Looking for Lost data

=/=

Page 56: SQL on Hadoop for Enterprise Analytics

Analyzing the stream

21

By connecting Looker to various points in the stream we can verify complete loads:

We also mask the location of information, one dashboard may show a variety of reliable sources.

• Impala SQL• Source Logs• Summary Reports

Page 57: SQL on Hadoop for Enterprise Analytics

Other uses and benefits

57

Match data in flight to find bad user

accounts

In flight alerts for missing

data

Analysis without needing to know

the location in the stream

SQL on Hadoop BI solution doesn’t

require new skillset

Page 58: SQL on Hadoop for Enterprise Analytics

THANK YOU!