outside the box: alternate query models and the future of big data
TRANSCRIPT
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
Topics
This Month: INNOVATORS
January: ANALYTICS
February: BIG DATA
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr
The Briefing Room
Analyst: Robin Bloor
Robin Bloor is Chief Analyst at The Bloor Group
Twitter Tag: #briefr
The Briefing Room
Infobright
! Infobright’s columnar database is used for applications and data marts that analyze large volumes of machine-generated data
! It leverages patented compression and optimization techniques, and a “knowledge grid,” to achieve real-time analytics
! Infobright offers a commercial version of its software, as well as a freely-available, open source product
Twitter Tag: #briefr
The Briefing Room
Guests: Don DeLoach and Jeff Kibler
Don DeLoach is CEO and President of Infobright
Jeff Kibler is Senior Technical Architect for Infobright
Logis;cs, Manufacturing,
Business Intelligence
Online & Mobile Adver;sing/Web Analy;cs, eCommerce, Social Networks
Government, U;li;es, Research
Financial Services
Telecom, Security
§ 400+ direct and OEM customers across North America, EMEA and Asia § 1,000 installa:ons § 8 of Top 10 Global Telecom Carriers use Infobright via OEM/ISVs
About Infobright
Columnar Database
Designed for fast analy:cs
Deep data compression
Intelligence, not Hardware
Knowledge Grid
Itera:ve Engine
Administra:ve Simplicity
No manual tuning
Minimal ongoing
administra:on
Core Competencies
Machine-Generated Data Is Everywhere
§ Weblogs
§ Computer, network events
§ Call detail records § Financial trade data § Sensors, RFID § Online game data
Businesses need to extract insight in near-‐real ;me from rapidly growing data volume:
• Segment and target website visitors • Troubleshoot networks
• Iden7fy security threats and fraud • Op7mize online/mobile ads
§ Data management § Hadoop transforming this area
§ Transparent analy:c stack § Opera:onal, inves:ga:ve, predic:ve § Machine-‐generated, text
§ User consump:on § Real-‐:me, interac:ve visualiza:on & query crea:on
§ Data Center / Data Warehouse § Infrastructure strategies, op:ons prolifera:ng
Emerging Data Analytics Stack: Days of One-Size-Fits-All Are Gone
“Yesterday’s BI-‐ETL-‐EDW stack is wrong-‐sided for tomorrow’s needs, and quickly becoming irrelevant.” Gigamon
Infobright: Columnar Architecture
Smarter architecture § Load data and go § No indices or par::ons to build and maintain
§ Knowledge Grid automa:cally updated as data packs are created or updated
§ Super-‐compact data foot-‐ print can leverage off-‐the-‐shelf hardware
Data Packs – data stored in manageably sized, highly compressed data packs
Data compressed using algorithms tailored to data type
Knowledge Grid – sta:s:cs and metadata “describing” the super-‐compressed data
Column Orientation
The Knowledge Grid
Knowledge Grid applies to the whole table
Column A Column B …
DP1
DP2
DP3 DP4 DP5 DP6
Informa:on about the data
Knowledge Nodes built for each Data Pack
Dynamic knowledge
Global knowledge
String and character data
Numeric data
Distribu;ons
Built during LOAD
Built per query E.g. for aggregates, joins
DP1
Column A
§ Knowledge Nodes answer the query directly, or § Iden:fy only required Data Packs, minimizing decompression, and § Predict required data in advance based on workload
Optimizer / Granular Engine
Q: How are my sales doing this
year?
Query Results Knowledge Grid
Compressed Data
1%
1. Query received 2. Engine iterates on Knowledge Grid 3. Each pass eliminates Data Packs 4. If any Data Packs are needed to resolve query, only those are decompressed
Infobright Architecture: Data Packs and Compression
64K
64K
64K
64K
Data Packs § Each data pack contains 65,536 data values § Compression is applied to each individual data pack § The compression algorithm varies depending on data type and distribu:on
Compression § Results vary depending on the distribu:on of data among data packs
§ A typical overall compression ra:o seen in the field is 10:1
§ Some customers have seen results of 40:1 and higher
§ For example, 1TB of raw data compressed 10 to 1 would only require 100GB of disk capacity
Patent-‐Pending Compression Algorithms
What Your Data Looks Like Now
Original data
10TB
=
Compressed data
50 GB Avg compression ra:o of 20:1
+
Knowledge Grid < .5 GB
< 1% of compressed data
§ “Principle of exactness” the default for most data analy:cs and access systems today
§ Using “approximate queries” good enough answers can be found using less resources
§ Works best when given the ability to alternate between approxima:on and exactness in an easy way
§ Crea:ng an interac:vity that accelerates :me to answers and reduces compu:ng resources required
Alternate Query Models: When Good Enough Works
§ Standard Queries: Knowledge Grid is used to aid performance, only required data packs are opened, retrieves exact results
§ Rough Queries: Only Knowledge Grid is used to derive an answer quickly, typically for analytics like SUM, AVG, MAX
Tools for Investigative Analysis
Today, Infobright provides:
§ Approximate Queries: Uses a combination of the Knowledge Grid and Intelligent Random Sampling to return results very quickly - applicable for any type of query
§ Exact results are not important § Top-N type queries § Investigative Analytics
Tools for Investigative Analysis
Fast and Informative:
§ Approximate Query useful when looking for data in an exploratory fashion (e.g. anomalous events, understanding data characteristics)
§ Example: Find the “Top-10” protocols and ports extracted from event records. § Exact Query may take minutes, Approximate Query can answer in seconds. What’s
important is the Top-10 not necessarily the exact numbers
Use Case
EXACT QUERY DY_HR SUM(TDR) AP_NAME
8 14269152 DNS 8 13716936 HTTP-80 8 13527636 HTTPS-443 8 13044432 UNDEFINED 8 11486904 NO APPL PORT 8 4280412 UNDEFINED 8 2313288 HTTP-ALT-8080 8 1278876 5223 8 1214100 DNS-53 8 991560 NO APPL PORT 8 899220 XMPP-Client
APPROXIMATE QUERY DY_HR SUM(TDR) AP_NAME
8 16872663 HTTP-80 8 15361320 DNS 8 14528793 HTTPS-443 8 13578984 UNDEFINED 8 11613616 NO APPL PORT 8 3659742 UNDEFINED 8 2724149 HTTP-ALT-8080 8 1427824 5223 8 1194147 DNS-53 8 1083973 NO APPL PORT 8 967579 XMPP-Client
Example: Online Advertising Segmentation
The goal in this example is to create a targeted campaign. They have a minimum number of participants that have to be included in the target group
Then find the top m individuals who
meet criteria 1 and criteria 2
They also have to a look at how many individuals who are in
each permutation of the criteria.
Find the top n individuals who meet criteria 1
This is repeated until they are in the range that that want to work with, and there can be up to
1500 different criteria, though they normally stop after 7 or 8 different filters
This process can take a considerable amount of time
Approximate query could dramatically save the amount of time it takes to determine which set of criteria they
should use
They can (if desired) use exact queries to calculate the exact final numbers,
instead of having to do exact queries for all the runs.
Trad
ition
al Q
uerie
s A
ppro
xim
ate
Que
ries
This process can collapse an effort that takes hours into minutes or seconds
HIGH AVAILABILITY
Big Data Analytics At the End of the Day
LOW TOUCH
AFFORDABILITY TCO
AD HOC PERFORMANCE SCALABILITY
COMPRESSION
LOAD SPEEDS
The Current Disposition
u 10 bn connected devices u 13 to 14 bn new processors
embedded every year u Estimate 31 bn connected
devices by 2020 u Sensors, RFID tags, DSPs,
FPGAs, CPUs, etc. u To control, alert, log and
report u Data growth at 55% pa
IOT Data Characteristics
u Arrives in continuous streams u Generally reliable (i.e., not
in need of cleansing) u Very high volume u “Big tables” of predictably
structured data u So, very little need for ETL
activity u If “valuable” then processing
speed is likely to be critical
IOT Apps and Database
u Mostly streaming – for alerts and BI (analysis, discovery)
u DBMS choice is a “horses for courses” thing
u If performance matters, probably not a Hadoop app
u The data structure does not favor the prominent NoSQL DBMSs
u Traditional RDBMS will not do well
u Hence column-store approach is most logical
The Coming Inversion
1. Instrument existing (dumb) devices
2. Gather and analyze data
3. Redesign device and its instrumentation
from knowledge gained
4. Iterate
In terms of DATA VOLUMES
we expect the IOT DATA VOLUME
to swamp all other sources of data
Going Forward
u Do the high compression rates you achieve occur because it is machine data, i.e., it’s a function of the characteristics of the data?
u Is the “approximate query” an Infobright invention?
u How frequently do customers use this type of query and for what type of applications?
u Who, typically, are the Infobright end users?
u What “relationship” does Infobright favor with Hadoop?
u What statistical functions, if any, does Infobright offer?
u What does the product roadmap look like?
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
www.insideanalysis.com
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: INNOVATORS
January: ANALYTICS
February: BIG DATA