outside the box: alternate query models and the future of big data

Grab some coffee and enjoy the pre-show banter before the top of the hour!

The Briefing Room

Outside the Box: Alternate Query Models & the Future of Big Data

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected]


The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission


The Briefing Room

Topics

This Month: INNOVATORS

January: ANALYTICS

February: BIG DATA

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room


The Briefing Room

Data Discovery & Visualization

INNOVATORS


The Briefing Room

Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected]


The Briefing Room

Infobright

! Infobright’s columnar database is used for applications and data marts that analyze large volumes of machine-generated data

!   It leverages patented compression and optimization techniques, and a “knowledge grid,” to achieve real-time analytics

! Infobright offers a commercial version of its software, as well as a freely-available, open source product


The Briefing Room

Guests: Don DeLoach and Jeff Kibler

Don DeLoach is CEO and President of Infobright

Jeff Kibler is Senior Technical Architect for Infobright

Turning “Huh?” into “Aha!” Alternate Query Models and Big Data Analy;cs

Logis;cs, Manufacturing,

Business Intelligence

Online & Mobile Adver;sing/Web Analy;cs, eCommerce, Social Networks

Government, U;li;es, Research

Financial Services

Telecom, Security

§  400+ direct and OEM customers across North America, EMEA and Asia §  1,000 installa:ons §  8 of Top 10 Global Telecom Carriers use Infobright via OEM/ISVs

About Infobright

Columnar Database

Designed for fast analy:cs

Deep data compression

Intelligence, not Hardware

Knowledge Grid

Itera:ve Engine

Administra:ve Simplicity

No manual tuning

Minimal ongoing

administra:on

Core Competencies

Machine-Generated Data Is Everywhere

§ Weblogs

§ Computer, network events

§ Call detail records §  Financial trade data §  Sensors, RFID § Online game data

Businesses need to extract insight in near-‐real ;me from rapidly growing data volume:

•  Segment and target website visitors •  Troubleshoot networks

•  Iden7fy security threats and fraud •  Op7mize online/mobile ads

Internet of Things is a Multiplier for EVERYTHING

§ Data management §  Hadoop transforming this area

§  Transparent analy:c stack §  Opera:onal, inves:ga:ve, predic:ve §  Machine-‐generated, text

§ User consump:on §  Real-‐:me, interac:ve visualiza:on & query crea:on

§ Data Center / Data Warehouse §  Infrastructure strategies, op:ons prolifera:ng

Emerging Data Analytics Stack: Days of One-Size-Fits-All Are Gone

“Yesterday’s BI-‐ETL-‐EDW stack is wrong-‐sided for tomorrow’s needs, and quickly becoming irrelevant.” Gigamon

Infobright: Columnar Architecture

Smarter architecture §  Load data and go §  No indices or par::ons to build and maintain

§  Knowledge Grid automa:cally updated as data packs are created or updated

§  Super-‐compact data foot-‐ print can leverage off-‐the-‐shelf hardware

Data Packs – data stored in manageably sized, highly compressed data packs

Data compressed using algorithms tailored to data type

Knowledge Grid – sta:s:cs and metadata “describing” the super-‐compressed data

Column Orientation

The Knowledge Grid

Knowledge Grid applies to the whole table

Column A Column B …

DP1

DP2

DP3 DP4 DP5 DP6

Informa:on about the data

Knowledge Nodes built for each Data Pack

Dynamic knowledge

Global knowledge

String and character data

Numeric data

Distribu;ons

Built during LOAD

Built per query E.g. for aggregates, joins

DP1

Column A

§  Knowledge Nodes answer the query directly, or §  Iden:fy only required Data Packs, minimizing decompression, and §  Predict required data in advance based on workload

Optimizer / Granular Engine

Q: How are my sales doing this

year?

Query Results Knowledge Grid

Compressed Data

1%

1.  Query received 2.  Engine iterates on Knowledge Grid 3.  Each pass eliminates Data Packs 4.  If any Data Packs are needed to resolve query, only those are decompressed

Infobright Architecture: Data Packs and Compression

64K

64K

64K

64K

Data Packs §  Each data pack contains 65,536 data values §  Compression is applied to each individual data pack §  The compression algorithm varies depending on data type and distribu:on

Compression §  Results vary depending on the distribu:on of data among data packs

§  A typical overall compression ra:o seen in the field is 10:1

§  Some customers have seen results of 40:1 and higher

§  For example, 1TB of raw data compressed 10 to 1 would only require 100GB of disk capacity

Patent-‐Pending Compression Algorithms

What Your Data Looks Like Now

Original data

10TB

=

Compressed data

50 GB Avg compression ra:o of 20:1

+

Knowledge Grid < .5 GB

< 1% of compressed data

§  “Principle of exactness” the default for most data analy:cs and access systems today

§  Using “approximate queries” good enough answers can be found using less resources

§  Works best when given the ability to alternate between approxima:on and exactness in an easy way

§  Crea:ng an interac:vity that accelerates :me to answers and reduces compu:ng resources required

Alternate Query Models: When Good Enough Works

§ Standard Queries: Knowledge Grid is used to aid performance, only required data packs are opened, retrieves exact results

§ Rough Queries: Only Knowledge Grid is used to derive an answer quickly, typically for analytics like SUM, AVG, MAX

Tools for Investigative Analysis

Today, Infobright provides:

§ Approximate Queries: Uses a combination of the Knowledge Grid and Intelligent Random Sampling to return results very quickly - applicable for any type of query

§ Exact results are not important § Top-N type queries §  Investigative Analytics

Tools for Investigative Analysis

Fast and Informative:

§  Approximate Query useful when looking for data in an exploratory fashion (e.g. anomalous events, understanding data characteristics)

§  Example: Find the “Top-10” protocols and ports extracted from event records. §  Exact Query may take minutes, Approximate Query can answer in seconds. What’s

important is the Top-10 not necessarily the exact numbers

Use Case

EXACT QUERY DY_HR SUM(TDR) AP_NAME

8 14269152 DNS 8 13716936 HTTP-80 8 13527636 HTTPS-443 8 13044432 UNDEFINED 8 11486904 NO APPL PORT 8 4280412 UNDEFINED 8 2313288 HTTP-ALT-8080 8 1278876 5223 8 1214100 DNS-53 8 991560 NO APPL PORT 8 899220 XMPP-Client

APPROXIMATE QUERY DY_HR SUM(TDR) AP_NAME

8 16872663 HTTP-80 8 15361320 DNS 8 14528793 HTTPS-443 8 13578984 UNDEFINED 8 11613616 NO APPL PORT 8 3659742 UNDEFINED 8 2724149 HTTP-ALT-8080 8 1427824 5223 8 1194147 DNS-53 8 1083973 NO APPL PORT 8 967579 XMPP-Client

Example: Online Advertising Segmentation

The goal in this example is to create a targeted campaign. They have a minimum number of participants that have to be included in the target group

Then find the top m individuals who

meet criteria 1 and criteria 2

They also have to a look at how many individuals who are in

each permutation of the criteria.

Find the top n individuals who meet criteria 1

This is repeated until they are in the range that that want to work with, and there can be up to

1500 different criteria, though they normally stop after 7 or 8 different filters

This process can take a considerable amount of time

Approximate query could dramatically save the amount of time it takes to determine which set of criteria they

should use

They can (if desired) use exact queries to calculate the exact final numbers,

instead of having to do exact queries for all the runs.

Trad

ition

al Q

uerie

s A

ppro

xim

ate

Que

ries

This process can collapse an effort that takes hours into minutes or seconds

HIGH AVAILABILITY

Big Data Analytics At the End of the Day

LOW TOUCH

AFFORDABILITY TCO

AD HOC PERFORMANCE SCALABILITY

COMPRESSION

LOAD SPEEDS

Thank you!


The Briefing Room

Perceptions & Questions

Analyst: Robin Bloor

The Current Disposition

u  10 bn connected devices u  13 to 14 bn new processors

embedded every year u  Estimate 31 bn connected

devices by 2020 u  Sensors, RFID tags, DSPs,

FPGAs, CPUs, etc. u  To control, alert, log and

report u  Data growth at 55% pa

IOT Data Characteristics

u  Arrives in continuous streams u  Generally reliable (i.e., not

in need of cleansing) u  Very high volume u  “Big tables” of predictably

structured data u  So, very little need for ETL

activity u  If “valuable” then processing

speed is likely to be critical

IOT Apps and Database

u  Mostly streaming – for alerts and BI (analysis, discovery)

u  DBMS choice is a “horses for courses” thing

u  If performance matters, probably not a Hadoop app

u  The data structure does not favor the prominent NoSQL DBMSs

u  Traditional RDBMS will not do well

u  Hence column-store approach is most logical

The Coming Inversion

1. Instrument existing (dumb) devices

2. Gather and analyze data

3. Redesign device and its instrumentation

from knowledge gained

4. Iterate

In terms of DATA VOLUMES

we expect the IOT DATA VOLUME

to swamp all other sources of data

Going Forward

u  Do the high compression rates you achieve occur because it is machine data, i.e., it’s a function of the characteristics of the data?

u  Is the “approximate query” an Infobright invention?

u  How frequently do customers use this type of query and for what type of applications?

u  Who, typically, are the Infobright end users?

u  What “relationship” does Infobright favor with Hadoop?

u  What statistical functions, if any, does Infobright offer?

u  What does the product roadmap look like?


The Briefing Room


The Briefing Room

Upcoming Topics

www.insideanalysis.com

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

This Month: INNOVATORS

January: ANALYTICS

February: BIG DATA


The Briefing Room

Thank You for Your

Attention

outside the box: alternate query models and the future of big data

Technology

briefrthe briefing room

infobrighttwitter tag

roomtwitter tag

data marts

machinegenerated data

ve machine

emerging data analytics

ve simplicitydesignedfor