clearstorydata.com using spark and shark for fast cycle analysis on diverse data 12.2.13 vaibhav...

13
clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data 12.2.13 Vaibhav Nivargi

Upload: mary-richardson

Post on 25-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

clearstorydata.com

Using Spark and Shark for Fast Cycle Analysis on Diverse Data

12.2.13

Vaibhav Nivargi

clearstorydata.com

About ClearStory Data

clearstorydata.com

Analysis in the New Data Landscape

New use cases seen in all industries.

• Live situational analysis requiring fast-cycle analysis across internal data and sources of external data

• Multi-source analysis with data refreshing on new insights, as data from sources evolves

• Large-scale analysis of structured and unstructured data combined in integrated insights

clearstorydata.com

Example: Interactive Multi-source Analysis

More data and more people change the analysis.

FacebookShares, Likes,

Comments

News Coverage

Online, Print, Television

TwitterFollowers, Tweets, Retweets

DonationsNew Members,

Donations

Website TrafficTraffic,

Referrals, Content

Data Intelligence

Interactive analysis on diverse internal & external data

Corporate SponsorsCorporate

Engagement, New Inquiries

clearstorydata.com

Today’s Need is Speed, Scale & Ad Hoc FlexibilityWith more sources, more data and more people.

? ?

??

clearstorydata.com

Why Spark and Shark ?

• RDDs– Low latency & scale– Iterative and Interactive computation

• Lineage and fault tolerance– Able to re-derive data

• Expressive power of Scala and SQL– Operations beyond aggregations, joins, and statistical operators– Advanced: ML, data mining, segmentation, approximate

queries, graphs …

• Support for structured and semi-structured data• BDAS Stack & AMPLab

– Tachyon, MLBase, BlinkDB, GraphX …

• Community and adoption

clearstorydata.com

Data Sources ClearStory Platform ClearStory Application

The ClearStory Solution

Data Inference & Profiling

Harmonization

Visualization

Collaboration

In-MemoryData Units

clearstorydata.com

Public PremiumWebRDBMS Hadoop

ClearStory API

User Application

Data Access, Inference and Lineage

Data Source API

Files

Spark Cluster + ClearStory IP

Harmonization Engine and Blended Data Processing

Where do Spark & Shark fit ?

clearstorydata.com

How we leverage Spark & Shark

• User intent captured and translated to custom API

• Harmonization-as-a-Service• Manages Spark and Shark query execution

• Read cached data from HDFS

• RESTful

• Merges datasets (RDDs) on the fly – on user request

• Support conversion of user actions to backend queries

• Query optimizations

• Performance optimizations• Mixed-mode execution (sql2rdd & spark native)

• Caching

• Pre-computation

clearstorydata.com

How we leverage Spark & Shark

• Query results returned to the application for scalable visualization and ClearStory-specific viz techniques

• RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals

• Data updates automatically processed as source data changes

• ClearStory’s own deployment, packaging, and integrated monitoring for operations at scale

clearstorydata.com

Spark Developments – What We Like

• Query cancellation, progress indication (0.8.1 and beyond)

• More performance breakthroughs

• Workload Management

• BlinkDB

• MLBase

• Tachyon

• GraphX

clearstorydata.com

We’re Hiring!

• Working with the community, giving back

• Lots of exciting new developments

• This is like the early days of Hadoop – massive momentum gathering

The First Spark Summit!More Meet-ups!

clearstorydata.com