outside the box: alternate query models and the future of big data

39
Grab some coffee and enjoy the pre-show banter before the top of the hour!

Upload: inside-analysis

Post on 20-Aug-2015

2.077 views

Category:

Technology


1 download

TRANSCRIPT

Grab some coffee and enjoy the pre-show banter before the top of the hour!

The Briefing Room

Outside the Box: Alternate Query Models & the Future of Big Data

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected]

Twitter Tag: #briefr

The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission

Twitter Tag: #briefr

The Briefing Room

Topics

This Month: INNOVATORS

January: ANALYTICS

February: BIG DATA

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

Twitter Tag: #briefr

The Briefing Room

Data Discovery & Visualization

INNOVATORS

Twitter Tag: #briefr

The Briefing Room

Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected]

Twitter Tag: #briefr

The Briefing Room

Infobright

! Infobright’s columnar database is used for applications and data marts that analyze large volumes of machine-generated data

!   It leverages patented compression and optimization techniques, and a “knowledge grid,” to achieve real-time analytics

! Infobright offers a commercial version of its software, as well as a freely-available, open source product

Twitter Tag: #briefr

The Briefing Room

Guests: Don DeLoach and Jeff Kibler

Don DeLoach is CEO and President of Infobright

Jeff Kibler is Senior Technical Architect for Infobright

Turning  “Huh?”  into  “Aha!”  Alternate  Query  Models  and  Big  Data  Analy;cs  

Logis;cs,  Manufacturing,  

Business  Intelligence    

Online  &  Mobile  Adver;sing/Web  Analy;cs,  eCommerce,  Social  Networks  

Government,  U;li;es,  Research  

 

Financial  Services    

Telecom,  Security    

§  400+  direct  and  OEM  customers  across  North  America,  EMEA  and  Asia  §  1,000  installa:ons  §  8  of  Top  10  Global  Telecom  Carriers  use  Infobright  via  OEM/ISVs  

About Infobright

Columnar  Database  

Designed  for  fast  analy:cs  

Deep  data  compression  

Intelligence,  not  Hardware  

Knowledge  Grid  

Itera:ve  Engine  

Administra:ve  Simplicity  

No  manual  tuning  

Minimal  ongoing  

administra:on  

Core Competencies

Machine-Generated Data Is Everywhere

§ Weblogs  

§ Computer,  network  events  

§ Call  detail  records  §  Financial  trade  data  §  Sensors,  RFID  § Online  game  data  

Businesses  need  to  extract  insight  in  near-­‐real  ;me  from  rapidly  growing  data  volume:  

•  Segment  and  target  website  visitors   •  Troubleshoot  networks  

•  Iden7fy  security  threats  and  fraud   •  Op7mize  online/mobile  ads  

Internet of Things is a Multiplier for EVERYTHING

§ Data  management  §  Hadoop  transforming  this  area  

§  Transparent  analy:c  stack  §  Opera:onal,  inves:ga:ve,  predic:ve    §  Machine-­‐generated,  text  

§ User  consump:on    §  Real-­‐:me,  interac:ve  visualiza:on  &  query  crea:on  

§ Data  Center  /  Data  Warehouse  §  Infrastructure  strategies,  op:ons  prolifera:ng  

Emerging Data Analytics Stack: Days of One-Size-Fits-All Are Gone

“Yesterday’s  BI-­‐ETL-­‐EDW  stack  is  wrong-­‐sided  for  tomorrow’s  needs,  and  quickly  becoming  irrelevant.”  Gigamon  

Infobright: Columnar Architecture

Smarter  architecture    §  Load  data  and  go  §  No  indices  or  par::ons    to  build  and  maintain  

§  Knowledge  Grid  automa:cally  updated  as  data  packs  are  created  or  updated  

§  Super-­‐compact  data  foot-­‐  print  can  leverage  off-­‐the-­‐shelf  hardware  

Data  Packs  –  data  stored    in  manageably  sized,  highly  compressed  data  packs  

Data  compressed  using  algorithms  tailored  to    data  type  

Knowledge  Grid  –  sta:s:cs  and  metadata  “describing”    the  super-­‐compressed  data  

Column Orientation

The Knowledge Grid

Knowledge  Grid  applies  to  the  whole  table

Column  A   Column  B   …  

DP1  

DP2  

DP3  DP4  DP5  DP6  

Informa:on  about  the  data  

Knowledge  Nodes  built  for  each  Data  Pack

Dynamic  knowledge  

Global  knowledge  

String  and  character  data  

Numeric  data  

Distribu;ons  

Built  during    LOAD  

Built  per  query  E.g.  for  aggregates,  joins  

DP1  

Column  A  

§   Knowledge  Nodes  answer  the  query  directly,  or  §   Iden:fy  only  required  Data  Packs,  minimizing  decompression,  and  §   Predict  required  data  in  advance  based  on  workload  

Optimizer / Granular Engine

Q:  How  are  my  sales  doing  this  

year?

Query Results Knowledge  Grid

Compressed  Data

1%

1.  Query  received  2.  Engine  iterates  on  Knowledge  Grid  3.  Each  pass  eliminates  Data  Packs  4.  If  any  Data  Packs  are  needed  to  resolve  query,  only  those  are  decompressed      

Infobright Architecture: Data Packs and Compression

64K  

64K  

64K  

64K  

Data  Packs  §  Each  data  pack  contains  65,536  data  values  §  Compression  is  applied  to  each  individual  data  pack  §  The  compression  algorithm  varies  depending  on  data  type  and  distribu:on  

Compression  §  Results  vary  depending  on  the  distribu:on  of  data  among  data  packs  

§  A  typical  overall  compression  ra:o  seen  in  the  field  is  10:1  

§  Some  customers  have  seen  results  of  40:1  and  higher  

§  For  example,  1TB  of  raw  data  compressed  10  to  1  would  only  require  100GB  of  disk  capacity  

Patent-­‐Pending  Compression  Algorithms  

What Your Data Looks Like Now

Original  data  

10TB  

=

Compressed  data  

50  GB  Avg  compression  ra:o  of  20:1  

+

Knowledge  Grid  <  .5  GB  

<  1%  of  compressed  data

§  “Principle  of  exactness”  the  default  for  most  data  analy:cs  and  access  systems  today  

§  Using  “approximate  queries”  good  enough  answers  can  be  found  using  less  resources  

§  Works  best  when  given  the  ability  to  alternate  between  approxima:on  and  exactness  in  an  easy  way  

§  Crea:ng  an  interac:vity  that  accelerates  :me  to  answers  and  reduces  compu:ng  resources  required  

Alternate Query Models: When Good Enough Works

§ Standard Queries: Knowledge Grid is used to aid performance, only required data packs are opened, retrieves exact results

§ Rough Queries: Only Knowledge Grid is used to derive an answer quickly, typically for analytics like SUM, AVG, MAX

Tools for Investigative Analysis

Today, Infobright provides:

§ Approximate Queries: Uses a combination of the Knowledge Grid and Intelligent Random Sampling to return results very quickly - applicable for any type of query

§ Exact results are not important § Top-N type queries §  Investigative Analytics

Tools for Investigative Analysis

Fast and Informative:

§  Approximate Query useful when looking for data in an exploratory fashion (e.g. anomalous events, understanding data characteristics)

§  Example: Find the “Top-10” protocols and ports extracted from event records. §  Exact Query may take minutes, Approximate Query can answer in seconds. What’s

important is the Top-10 not necessarily the exact numbers

Use Case

EXACT QUERY  DY_HR   SUM(TDR)   AP_NAME  

8   14269152  DNS  8   13716936  HTTP-80  8   13527636  HTTPS-443  8   13044432  UNDEFINED  8   11486904  NO APPL PORT  8   4280412  UNDEFINED  8   2313288  HTTP-ALT-8080  8   1278876  5223  8   1214100  DNS-53  8   991560  NO APPL PORT  8   899220  XMPP-Client  

APPROXIMATE QUERY  DY_HR   SUM(TDR)   AP_NAME  

8   16872663  HTTP-80  8   15361320  DNS  8   14528793  HTTPS-443  8   13578984  UNDEFINED  8   11613616  NO APPL PORT  8   3659742  UNDEFINED  8   2724149  HTTP-ALT-8080  8   1427824  5223  8   1194147  DNS-53  8   1083973  NO APPL PORT  8   967579  XMPP-Client  

Example: Online Advertising Segmentation

The goal in this example is to create a targeted campaign. They have a minimum number of participants that have to be included in the target group

Then find the top m individuals who

meet criteria 1 and criteria 2

They also have to a look at how many individuals who are in

each permutation of the criteria.

Find the top n individuals who meet criteria 1

This is repeated until they are in the range that that want to work with, and there can be up to

1500 different criteria, though they normally stop after 7 or 8 different filters

This process can take a considerable amount of time

Approximate query could dramatically save the amount of time it takes to determine which set of criteria they

should use

They can (if desired) use exact queries to calculate the exact final numbers,

instead of having to do exact queries for all the runs.

Trad

ition

al Q

uerie

s A

ppro

xim

ate

Que

ries

This process can collapse an effort that takes hours into minutes or seconds

HIGH AVAILABILITY

Big Data Analytics At the End of the Day

LOW TOUCH

AFFORDABILITY TCO

AD HOC PERFORMANCE SCALABILITY

COMPRESSION

LOAD SPEEDS

Thank  you!  

Twitter Tag: #briefr

The Briefing Room

Perceptions & Questions

Analyst: Robin Bloor

The Current Disposition

u  10 bn connected devices u  13 to 14 bn new processors

embedded every year u  Estimate 31 bn connected

devices by 2020 u  Sensors, RFID tags, DSPs,

FPGAs, CPUs, etc. u  To control, alert, log and

report u  Data growth at 55% pa

IOT Data Characteristics

u  Arrives in continuous streams u  Generally reliable (i.e., not

in need of cleansing) u  Very high volume u  “Big tables” of predictably

structured data u  So, very little need for ETL

activity u  If “valuable” then processing

speed is likely to be critical

IOT Apps and Database

u  Mostly streaming – for alerts and BI (analysis, discovery)

u  DBMS choice is a “horses for courses” thing

u  If performance matters, probably not a Hadoop app

u  The data structure does not favor the prominent NoSQL DBMSs

u  Traditional RDBMS will not do well

u  Hence column-store approach is most logical

The Coming Inversion

1. Instrument existing (dumb) devices

2. Gather and analyze data

3. Redesign device and its instrumentation

from knowledge gained

4. Iterate

In terms of DATA VOLUMES

we expect the IOT DATA VOLUME

to swamp all other sources of data

Going Forward

u  Do the high compression rates you achieve occur because it is machine data, i.e., it’s a function of the characteristics of the data?

u  Is the “approximate query” an Infobright invention?

u  How frequently do customers use this type of query and for what type of applications?

u  Who, typically, are the Infobright end users?

u  What “relationship” does Infobright favor with Hadoop?

u  What statistical functions, if any, does Infobright offer?

u  What does the product roadmap look like?

Twitter Tag: #briefr

The Briefing Room

Twitter Tag: #briefr

The Briefing Room

Upcoming Topics

www.insideanalysis.com

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

This Month: INNOVATORS

January: ANALYTICS

February: BIG DATA

Twitter Tag: #briefr

The Briefing Room

Thank You for Your

Attention