sql on hadoop for enterprise analytics

A LITTLE BIT OF HISTORY

Everything old is new again. SQL Forever.

The story so far

Why hasn’t SQL died yet? It’s 2016 and we’re still using it?!

Everything old is new again

Existing architecture keeps reappearing

It takes time to figure out what tools are right for what jobs

SQL is still the best tool for business analytics

A long long time ago…

Growing pains

Late 1990

Database problems

Database outage

Data integrity issues

Data latency

Late 1990

Master Slave

Late 1990

Transactions

Late 1990

Performance

Late 1990

By the time I graduated, SQL was on its last legs

Cache all the things!

Stop copying Twitter!

SQL golden age ends, NoSQL takes off

Column Graph

Key-Value Document

Awesome things about NoSQL

No SQL, normal languages as APIs!

Non relational!

Remember ORMs?

Active Record

ORMs 👎

Remember EAV(Entity Attribute Value)?

Kind of looks like columns…

Modern EAV

Tedious to query

Voila!

No joins is a feature!

NoSQL has some rough bumps

NoSQL has A LOT of rough bumps…

Throwback Thursday!

Lock the doors

MPP columnar DBs! Wait... SQL is back?!

Hadoop on SQL

A long long time ago…

What’s next?

~2020?

What’s next?

~2020?

“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.” – Todd Lipcon

SQL is far past hype

“If it ain’t broke, don’t fix it”

CUSTOMER STORYBuilding a event analytics pipeline

using Hadoop and Spark

Why Consider a Big Data Pipeline?

You are rapidly exceeding the limits of your existing database

Everything on your website can be

analyzed.

Waiting until the next day isn’t for

Data comes and goes to many places, and you want

one process for it

Big DATA CULTURE

Summary data is not good enough

Company is mandating new technologies

You want to build a data driven culture

Big SQL is the heart of a data-driven culture

CASE STUDY

A major healthcare provider wants to create a web event pipeline that:

During periods of healthcare registration and new coverage

start and can dial back the rest of the year

Massive Scaling Large data volumes

10-15M customers worth of data. Provides data for

analysis in under 1 minute.

AND Utilizes existing in house technologies (such as Cloudera Impala)

Page loadsRegistrations

LoginsErrors

All events processed

Solution: Build an event processing framework

Events

Event Collector

Hadoop?

High Level Process

Events

Event Collector

Message ProcessingHDFS

Looker

To be designed

Why is Hadoop so hard?

Need to write in Java and Scala

We don’t have structure

Not easy to get data out into BI tools

Event Collectors don’t tend to feed to HDFS

out of the box

Typically follow a batch processing

framework

Ingestion mechanism

Low-Latency In flight transformation and

processing

Ability to populate multiple destinations

Our ideal ingestion would have three key aspects

Spark vs Storm

• Own Master Server • Run on HDFS• Micro batching • Exact once delivery

(eliminates vulnerability)

• Not native to Hadoop• Less Developed• One at a time• ETL in flight• Sub second latency

Two of the major players in data streaming / processing

Source Interceptor Selector Channel Sinks

Managed by the Flume Agent

Web Server

Web Server Investor Channel

HDFSNo in flight transformation, so this just needs to meet workload

Broker

Producer Broker Consumer

Producer

Spark Streaming

ZooKeeper

Broker

Flume vs. Kafka

Use Both: Out-of-the box with Flafka and native connectors

Source

SparkCustom

connector

Customconnector

Flume KafkaSource Spark

Storing the output

Data can be queried via Hive, Impala, or Spark SQL

Cloudera is our Enterprise choice

We can process a subset in-stream with Mlib or other machine learning

algorithms

Output summaries to other RDBMS

systems

Our streaming Spark cluster consumes messages from Kafka. We batch these every minute into a HDFS cluster. We chose this because

Final Result

Events

Event Collector Kafka

Flume Spark SQL Cloudera

Other storage (RDBMS)

Other storage (logs)

Pipeline Summary

Add data to any point of the pipeline

Kafka, Flume, Impala, Looker without many

custom connectors

Pipeline includes additional sources like teradata, oracle

Add in-flight predictive model training and execution without significant additional processing time

Our pipeline provides several points for flexibility as well as meets our key priorities.

Priority # 1: Scale

Kafka is easy to scale, As more volume comes in, adding new brokers can be automated using the Partition Reassignment Tool

By monitoring batch times in Looker on Spark SQL, we can alert when we need to scale up the cluster using Scheduled Looks

Priority #2: Flexibility

Different events can be parsed out to different Spark streaming applications with Kafka topics (Or another type of consumer)

Add more data at any point (flume, kafka producer, or directly to spark)

Looker connects to wherever the data lands, as long as we can query it. Perform analysis IN CLUSTER

Priority #3 Speed Analyzing the stream

Events per hour

Identify missing batches

Volume and Timing

Right sizing hardware

Duplicate events

And missing information

Priority #4: In house Technologies

Provide access to Hadoop/Impala via a centralized data hub:A single place to access web based reports, explores, BI tools and code libraries

Enable users to ask questions and query web data without writing SQL or knowing about the pipeline

Analyzing the stream

Looking for Lost data

Analyzing the stream

By connecting Looker to various points in the stream we can verify complete loads:

We also mask the location of information, one dashboard may show a variety of reliable sources.

• Impala SQL• Source Logs• Summary Reports

Other uses and benefits

Match data in flight to find bad user

accounts

In flight alerts for missing

Analysis without needing to know

the location in the stream

SQL on Hadoop BI solution doesn’t

require new skillset

THANK YOU!

Sources

http://www.slideshare.net/Dataversity/thu-1200-penchikalasrinicolorhttp://seldo.com/weblog/2011/08/11/orm_is_an_antipatternhttp://mashable.com/2010/10/04/foursquare-downtime/#aPh4mhYxLSq6http://blogs.adobe.com/security/files/2011/04/NoSQL-But-Even-Less-Security.pdf?file=2011/04/NoSQL-But-Even-Less-Security.pdfhttp://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodbhttps://www.percona.comhttp://techcrunch.com/http://mashable.com/

sql on hadoop for enterprise analytics

Technology

hive: sql in the hadoop environment lecture …...hive: sql...

splout sql - web latency sql views for hadoop

processing relational data with hive lecture bigdata...

future of-hadoop-analytics

actian sql analytics in hadoop - the enterprise data cloud...

sql-on-hadoop tutorial -...

hadoop and no sql -...

predictive analytics with hadoop

realtime analytics + hadoop 2.0

sql-on-hadoop tutorial

splout sql: a web-latency sql view for hadoop

hadoop data analytics 4

sql on hadoop

realtime analytics in hadoop

benchmarking hadoop - which hadoop sql engine leads the herd

big sql 3.0 - fast and easy sql on hadoop

product brochure - stage.actian.com...the actian analytics...

product brochure - actian · hadoop analytics the highest...

big data discovery and hadoop analytics - … fileand hadoop...

hadoop-ds: which sql-on-hadoop rules the herd