strata+hadoop 2015 nyc end user panel on real-time data analytics

46
End User Panel on Real-Time Data Analytics Building Predictive Applications with Real-Time Data Pipelines and Streamliner Eric Frenkiel, CEO and Co-Founder, MemSQL

Upload: memsql

Post on 15-Apr-2017

787 views

Category:

Data & Analytics


0 download

TRANSCRIPT

End User Panel on Real-Time Data Analytics

Building Predictive Applications with Real-Time Data Pipelines and Streamliner

Eric Frenkiel, CEO and Co-Founder, MemSQL

Going Real-Time is the Next Phase for Big Data

MoreDevices

More Interconnectivity

MoreUser Demand

…and companies are at risk of being left behind

MemSQL Architecture

S t r e a m i n g Da t a W a r e h o u s e

Streaming

Integrated streamingwith Streamliner

Database

High volume transactions for structured and unstructured data

Data Warehouse

Fast, scalableSQL for immediate

analytics

Applications and Technology Trends

Real-Time Analytics Risk-Management Personalization

Portfolio Tracking Monitoring and Detection

Internet of Things | Real-Time Data Pipelines | Operationalizing Apache Spark

Put Apache Spark in the fast lane.Persist. Perform. Perfect.

Changing the Way the World Invests

Noah Zucker, Vice President – Tactical Engineering, Novus Partners

Scalable Portfolio Intelligence with MemSQL

100+ Investment Managers, $2 Trillion AUM Research Platform: 10,000+ Institutions Founded 2007, Privately Held

We help investors discover their true investment acumen and risk

About Novus

True Investment Acumen and Risk…at Scale

Top-Tier Client List

24/7 ETL Handholding

Overnight Failure = Business Hours Slowdown

Scala worker pool limited by the database

Non-trivial code changes needed to shard and scale

Before MemSQL…

Today’s Portfolio Intelligence…Right Now

Before MemSQL:

With MemSQL:

90 Min.

2 Min.

Customer Data

Persistent StoreETL Analytics

(Scala)

First-Class JSON Support…Happy Developers

memsql> select * from tasks t where t.task::uid::%clientId = 7;

+---------+---------------------------------------------------------------+| task_id | task |+---------+---------------------------------------------------------------+| 3 | {"uid":{"clientId":7,"id":1009,"which":"P"},"user":"noahlz"} |+---------+---------------------------------------------------------------+1 row in set (0.00 sec)

Salat

Client team focuses on service, not ETL

Predictable application performance

Scala workers: 12 126 Add servers to scale –

No code changes needed

With MemSQL…

http://www.novus.com

http://tech.novus.com

@NovusCode

Ian Hansen, Software Engineering ManagerDigital Ocean

ETL Tools for Small Teams

Problem: Business Intelligence Slows as We Grow

Data lives in SQL Easy to ask new questions in SQL But… Business Intelligence tasks taking longer Database isn’t built for quick aggregations

Solution: Scale-out SQL Database SQL team stays powerful Quick to iterate with quick answers Prepare for the future!

Problem: Data isn’t in MemSQL

Plus

You don’t have an engineer on your team

It’s hard to get an engineer’s time

You’ve got a job to do… (which is taking more and more time)

Solution: ETL Using REPLACE INTO MySQL SQL flavor (available in MemSQL) Handles new rows and updates on rows Easy to write

• Query source database then replace into target database

Many other scale-out SQL databases don’t have equivalent

Problem: Now Load JSON Event Data ~300K events per day Many different types of JSON events

Solution: MemSQL Loader + JSON Type

Only loads new files (or files whose content has changed)

Parallelizes the process Transformation script

simple: return id and raw json data

SQL team unaffected by new JSON events

./memsql-loader load /opt/events/** --table events --script=/opt/events-etl --file-id-column file_id --columns id,data

Problem: Processing Data on Select Need computed value in SQL query Computing the value slows down queries Computed value used on many queries

• e.g. domain from a URL string

Solution: Persistent Columns

Pre-compute result and save it on the row

Automatically updated if row changes

No need to alter ETL pipeline

ALTER TABLE events ADD COLUMN ( referring_domain AS substring_index(substring(data::$referrer, (locate('//', data::$referrer)) + 2), '/', 1) PERSISTED varchar(255))

Solution: Persistent Columns

Use pre-computed value in select

memsql> select data, referring_domain from events limit 2;+-------------------------------------+------------------+| data | referring_domain |+-------------------------------------+------------------+| {"referrer":"http://example.com/b"} | example.com || {"referrer":"http://example.com/a"} | example.com |+-------------------------------------+------------------+

Tools

REPLACE INTO syntax JSON native type MemSQL Loader Persistent columns Now, MemSQL Streamliner

We Want More Data

We are Hiring

Mike DePrizio, Senior Architect, Akamai Technologies

Unlocking Revenue with In-Memory Technology

We are the leading provider of cloud services for delivering, optimizing and securing online content and business applications

$1.96BRevenue

1,300Locations

5,000+Customers

5,100+Employees

CORPORATE STATS (2014):

OUR HISTORY: Founded 1998 and rooted in MIT technology—solving Internet congestion with math not hardware

The Business of Billing

Billing domino effect Akamai Customers Sub-customers

Daily billing requires: Fast data delivery

Accurate data

Old Model New Model

Generating a bill at end of month for customer services

Generating a bill at the end of every day for sub-customer services

Current Billing Data Management

Gather logs from 190,000+ servers in 1400 locations in 110 countries Multiple PBs/day aggregate/reduce into relevant billing data feed

Typical data record: 3 key fields plus metrics

Load resulting data record into our RDBMS system

Greatest Challenges Current system cannot handle expected throughput Difficult to quickly scale up existing environments New model will generate 10x+ data

Deploying MemSQLApplicationDaily Sub-customer billing

ProblemExisting RDMS pipeline loads were maxed out at 150-300K upserts/second, could not keep up with projected size of new billing model

ResultsMemSQL cluster performs at 1.9 million upserts/second, allowing transition from monthly to daily billing

Billing Data resource usage statistics

INSERT... ON DUPLICATE KEY UPDATE... (1.9 million/sec)

Billing Application

• Compute sub-customer charges daily

• Roll up sub-customer usage by customer/cloud provider

• More sophisticated platform offers customers better service, partners new business opportunities

Results Speak for Themselves 2M upserts/second on AWS EC2

instances

Scalability on commodity hardware

Meeting our billing windows

Unlocking revenue

Adapt PoC for real-world situations

Continue scaling linearly

Optimize results with small cluster deployment 

What Next?

Eric Frenkiel, MemSQL CEO and co-founderSeptember 30, 2015 • New York, NY

Introducing MemSQL Streamliner

One click deployment of integrated Apache Spark

Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation

Eliminates batch ETL Open source on GitHub

Introducing the MemSQL Streamliner

Simple Deployment Process

Application

1. Deploy MemSQL

Cluster

In-Memory | Distributed | Relational

Application

2. Deploy Spark

Cluster

Application

Kafka Connects to Each Node

Cluster

Application

Streamliner ArchitectureFirst of many integrated Apache Spark solutions

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Streamliner ETL Detail

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

STREAMLINER

Custom

Future Extractor

JSON

Custom

Future Transformer

Extract Transform Load

Building Predictive Applications

StreamlinerInput

User JarSAS Generated PMML

Industrial Equipment

Sensor Data

S1 S2 S3 P1 P2 P3

Scoring Real-Time Data with Predictive Models

Sensor 1 Predictive Model 1

Streamliner Benefits

Build end-to-end data pipelines in minutes

Reduce data latency from days or hours to ZERO

Support thousands of concurrent users running real-time queries

Give users immediate access to fresh data via innovative applications 

THE GAME

See MemSQL Streamliner in Action at Booth #831