sqrrl october webinar: data modeling and indexing
DESCRIPTION
This webinar provides a technical deep dive into the NoSQL database Apache Accumulo. Sqrrl extends Accumulo with additional security, analytical, and data modeling tools. Topics include data modeling techniques, secondary indices, JSON and Graph capabilities for Accumulo.TRANSCRIPT
Securely explore your data
DATA MODELING AND INDEXING FOR APACHE ACCUMULO
Sqrrl Webinar Series October, 2013 Adam Fuchs, CTO Sqrrl Data, Inc.
RECAP
1. Introduction to Sqrrl and Accumulo 2. Security In The Wild 3. Sqrrl and Accumulo Technology 4. The Data-Centric Security Ecosystem
In our September Webinar: Sqrrl, Apache Accumulo, and Cell-Level Security
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 2%
TODAY’S DISCUSSION
1. Sqrrl and Accumulo Technology Review 2. Table Designs
1. Dynamic Documents 2. Graphs 3. Inverted Indexes
3. Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 3%
LAYERED ARCHITECTURE Turtles all the way down...
Accumulo'RPC'(Sorted(Key/Value(I/O)(
Hadoop'RPC'(File(I/O)(
Application
Sqrrl Enterprise
Sqrrl'API'over'Apache'Thri8'RPC'(JSON,(Graph,(Aggrega=on,(
Search,(etc.)(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 4%
An Accumulo key is a 5-tuple, consisting of:
" Row: Controls Atomicity " Column Family: Controls Locality " Column Qualifier: Controls Uniqueness " Visibility Label: Controls Access " Timestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…
Accumulo(Key/Value(Example(
ACCUMULO DATA FORMAT
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 5%
Instance new%ZooKeeperInstance(...)%
new%MockInstance()%
Connector
getConnector(...)%
TableOperations
InstanceOperations
SecurityOperations Scanner BatchScanner
createScanner(...)% createBatchScanner(...)%
Range
IteratorOption
Map.Entry
Key Value
iterator()%
BatchWriter
createBatchWriter(...)%
Mutation
addMuta3on(...)%
THE ACCUMULO CLIENT API
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 6%
InJMemory%Map%
Write%Ahead%Log%
(For%Recovery)%
Sorted,%Indexed%File%
Sorted,%Indexed%File%
Sorted,%Indexed%File%
Tablet(Data(Flow(
Reads&Iterator%Tree%
Minor&Compac0
on&
Merging&/&Major&Compac0on&
Iterator%Tree%
Writes& Iterator%Tree%
Scan&
Tablet%Server%
Tablet%
Tablet%Server%
Tablet%
Tablet%Server%
Tablet%
Applica3on%
Zookeeper%
Zookeeper%
Zookeeper%
Master%
HDFS%
Read/Write&
Store/Replicate&
Assign/Balance&
Delegate&Authority&
Delegate&Authority&
Applica3on%
Applica3on%
ACCUMULO TECHNOLOGY Strengths • Shared-Nothing => Scalability • Micro-Batching for Efficient
Random I/O • High Concurrency, Low Latency
for Denormalized Data • Sparse, Flexible Schema supports
dynamic and diverse data models • Cell-level Security promotes
sharing Weaknesses • Sorting induces write multiplication
factor • Sparse schema support induces
additional storage overhead
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 7%
TODAY’S DISCUSSION
1. Sqrrl and Accumulo Technology Review 2. Table Designs
1. Dynamic Documents 2. Graphs 3. Inverted Indexes
3. Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 8%
PROXY/NETFLOW EXAMPLE
Source Destination Port Bytes In Bytes Out Protocol 10.1.2.3 google.com 80 73,824 15,632 http 10.1.2.4 facebook.com 443 10,328 13,284,129 https 10.1.2.4 google.com 80 623,249 93,125 http 10.1.2.3 abcd1234.ru 3133
7 158 523,698,104 unknown
10.1.2.3 netflix.com 443 434,855,357 1,392,994 https 10.1.2.4 google.com 443 23,084 583,331 https 10.1.2.3 10.1.2.5 22 204 158 ssh
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 9%
INDEXES AND QFDS
Logs/Observations Input
Indexes
Question-Focused Datasets Transform
ation
• Immutable(
• AppendHOnly(
• RealHTime(
• Online(• Sorted(• Grouped(• Aggregated(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 10%
QFD KEY GENERATION
Source Destination Port Bytes In Bytes Out Protocol 10.1.2.3 google.com 80 73,824 15,632 http
Key% % % % % % %J>%%Value%10.1.2.3,%Bytes%In%% % %J>%+73,824%10.1.2.3,%Bytes%Out% % %J>%+15,632%10.1.2.3,%Ports%Used% % %J>%+{80}%10.1.2.3,%Protocols%Used% %J>%+{hap}%
Hosts QFD
0x00
.
.
.
0xFF
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 11%
HOSTS QFD WITH AGGREGATION IP Ports
Used Protos Used
Total Bytes In
Total Bytes Out
Ports Hosted
Protos Hosted
10.1.2.3 {22, 80, 443, 31337}
{http, https, ssh, unknown}
434,931,543 525,106,888 - -
10.1.2.4 {80, 443}
{http, https}
656,661 13,960,585 - -
10.1.2.5 - - 158 204 {22} {ssh}
New%Contribu3on:%(10.1.2.5,%Total%Bytes%In%J>%+3,215)%
158%+3,215%3,373%
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 12%
facebook.com
google.com
abcd1234.ru
netflix.com
10.1.2.3
10.1.2.4
10.1.2.5
CONNECTIVITY GRAPH
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 13%
Row Col. Fam. Col. Qual. Val. 10.1.2.3 Contacts 10.1.2.5 -
10.1.2.3 Contacts abcd1234.ru -
10.1.2.3 Contacts google.com -
10.1.2.3 Contacts netflix.com -
10.1.2.4 Contacts facebook.com -
10.1.2.4 Contacts google.com -
Row Col. Fam. Col. Qual. Val 10.1.2.5 Serves 10.1.2.3 -
abcd1234.ru Serves 10.1.2.3 -
facebook.com Serves 10.1.2.4 -
google.com Serves 10.1.2.3 -
google.com Serves 10.1.2.4 -
netflix.com Serves 10.1.2.3 -
INVERTED INDEXING
Table:(
Row:(
Column(Family:(
Column(Qualifier:(
Value:(
Forward(Index(
<UUID>(
<Type>(
<Field>(
<Term>(
Inverted(Index(
<Field>(
<Term>(
<UUID>(
<Digest(of(Event>(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 14%
INVERTED INDEXING
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 15%
ADVANCED INDEXING
Table:(
Row:(
Column(Family:(
Column(Qualifier(
(Tuples):(
Value:(
Shard(Table(
<Par==on(ID>(
“Docs”( “Inv.(Index”( “Field(Index”(
<UUID>(
<Value>(
<Term>(
<UUID>(
<Field:Term>(
<UUID>(<Field>(
“Geo”(
<Hash>(
<UUID>(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 16%
TODAY’S DISCUSSION
1. Sqrrl and Accumulo Technology Review 2. Table Designs
1. Dynamic Documents 2. Graphs 3. Inverted Indexes
3. Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 17%
SQRRL ENTERPRISE
• Dynamic Documents • JSON I/O support • Cell-level Security and Efficient Aggregation Extensions
• Dynamic Graphs • Co-partitioned with Documents for Integrated Search and
Discovery
• Search • Lucene Query Syntax • Accumulo Indexes Preserve Security Model
• Processing • SQL-Like Language for Transforming and Aggregating Results • Parallel Slicing and Extraction
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 18%
Simple API for Advanced Accumulo Usage
REAL-TIME OPERATIONAL APPS
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary%
Contact us for a demo
19%
HOW TO LEARN MORE
Download our White Paper " www.sqrrl.com/whitepaper
Watch a video " www.sqrrl.com/downloads#videos
Request a demo or one-on-one workshop " www.sqrrl.com/contact
Come meet us " Accumulo Meetup (October 28, New York) " Strata + Hadoop World (October 28-30, New York) " IBM IOD (November 4-7, Las Vegas) " SC13 (November 18-21, Denver)
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 20%
THANK YOU
Thanks for attending!
To keep up to date with Sqrrl, check out or social media sites: www.twitter.com/sqrrl_inc www.linkedin.com/company/sqrrl
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 21%