adam fuchs' accumulo talk at nosql now! 2013
DESCRIPTION
Adam Fuch provides an overview of Accumulo and Sqrrl Enterprise at the 2013 NoSQL Now! conferenceTRANSCRIPT
Securely explore your data
SQRRL ENTERPRISE +
APACHE ACCUMULO:
A secure, scalable, real-time analysis framework
Adam Fuchs, CTO
Sqrrl Data, Inc.
August 21, 2013
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
TWO HALVES OF REAL-TIME
Real-Time reduce event to reaction time Real-Time reduce ingest to query latency
Data-Driven Query-Driven
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
1. SPE queries NoSQL to enrich streaming data
2. SPE persists results in NoSQL for future query
3. SPE takes action automatically
4. SPE issues data-driven alerts
5. Sqrrl provides context for dashboards
6. Analysis tools query use Sqrrl to search and manipulate historical data
Data-Driven + Query-Driven Real-Time Ecosystem
Data
NoSQL+
SPE
Dashboards
Actions
InteractiveAnalysis Tools(Discovery + Forensics)
1 2
3
5
4
6
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential © 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 5
This talk focuses on the database.
Dashboards
InteractiveAnalysis Tools(Discovery + Forensics)
1. SPE queries NoSQL to enrich streaming data2. SPE persists results in NoSQL for future query3. SPE takes action automatically4. SPE issues data-driven alerts5. Sqrrl provides context for dashboards6. Analysis tools query use Sqrrl to search and manipulate historical data
Data
Actions
SPE4
3
NoSQL+6
5
21
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ACCUMULO DATA FORMAT
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 7
Accumulo Key/Value Example
An Accumulo key is a 5-tuple, consisting of:
- Row: Controls Atomicity- Column Family: Controls Locality - Column Qualifier: Controls Uniqueness- Visibility Label: Controls Access- Timestamp: Controls Versioning
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ACCUMULO TABLETS
Collections of KV pairs form Tables
Tables are partitioned into Tablets
Metadata tablets hold info about other tablets, forming a 3-level hierarchy
A Tablet is a unit of work for a Tablet Server
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 8
Root Tablet-∞ to ∞
Metadata Tablet 1-∞ to “Encyclopedia:Ocelot”
Data Tablet-∞ : thing
Data Tabletthing : ∞
Data Tablet-∞ : Ocelot
Data TabletOcelot : Yak
Data TabletYak : ∞
Data Tablet-∞ to ∞
Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞
Well-Known Location
(zookeeper)
Table: Adam’s Table Table: Encyclopedia Table: Foo
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ACCUMULO PROCESSES
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 9
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
Application
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Delegate Authority
Application
Application
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
TABLET DATA FLOW
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 10
In-Memory Map
Write AheadLog
(For Recovery)
Sorted, Indexed
File
Sorted, Indexed
File
Sorted, Indexed
File
Tablet
ReadsIterator
TreeMinor
Compaction
Merging / Major Compaction
Iterator Tree
Writes Iterator Tree
Scan
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
WORD COUNT:
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 11
Summing Aggregating Iterator
Input Corpus
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ITERATOR FRAMEWORK
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 12
Iterator Operations:
- File Reads- Block Caching- Merging- Deletion- Isolation- Locality Groups- Range Selection- Column Selection- Cell-level Security- Versioning- Filtering- Aggregation- Partitioned Joins
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ACCUMULO LATENCIES
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 13
Ingesters QueriersTablet Servers
Input BatchWriter
In-Memory
Map
ScanIterators
Scanner/Batch
Scanner
In-Memory
Map
RFile
Compaction
Iterators
ScanIterators
RFile
Compaction
Iterators
In-Memory
Map
RFiles
CompactionIterators
ScanIterators
Output
~ms~ms ~ms
ms
- m
in
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ACCUMULO THROUGHPUT
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 14
Ingesters QueriersTablet Servers
Input BatchWriter
In-Memory
Map
ScanIterators
Scanner/Batch
Scanner
In-Memory
Map
RFile
Compaction
Iterators
ScanIterators
RFile
Compaction
Iterators
In-Memory
Map
RFiles
CompactionIterators
ScanIterators
Output
~ms~ms ~ms
ms
- m
in
Read-Modify-Write Latency: ~ms
>1K entries/s challenging with R-M-W
Ingest:up to 500K entries/s
per node
Scan:up to 1M entries/s
per node
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
SQRRL ENTERPRISE
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 15
Built on Apache Accumulo
Sqrrl Server
Sqrrl API over Apache Thrift RPC(JSON, Graph, Aggregation, Search, etc.)
• Sqrrl proprietary• Automated indexing• Custom iterators• Lucene integration• Security extensions Accumulo RPC
(Sorted Key/Value I/O)
Hadoop RPC(File I/O)
• Open source (including Sqrrl contributions)
• Open source or commercial distributions
Graph + Document I/O
Exploratory / Operational Apps
Bulk Processing Integration
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 16
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
DATA-CENTRIC SECURITY
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 17
Definition: Data carries with it information that is required to make policy decisions on its releasability.
User 1 User 2Sqrrl/
Accumulo
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
SECURITY
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 18
Example Accumulo Key/Value Pairs
Accumulo is the only NoSQL database with cell-level access controls
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
DATA-CENTRIC SECURITY ECOSYSTEM
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 19
Data Labeler Sqrrl Enterprise
Apps
User Attributes
Audits
Policies
End Users
Auth. Service
Policy Engine
Key Mgmt
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 20
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
HIERARCHICAL DECOMPOSITION
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 21
Row:
Column Family:
Column Qualifier:
Value:
<person>
attribute purchases
age
<age>
discount
<cost>
sneakers
<rate>
returns
hat
<cost>
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
MATERIALIZED TABLE
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 22
Row: george
attribute purchases
age
27 $83
sneakers
bill
attribute purchases
40%
sneakers
$100
discount
49
age
Key/Value Pair
Column Family:
Column Qualifier:
Value:
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
FORWARD AND INVERTED INDEX
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 23
Table:
Row:
Column Family:
Value:
Forward Index
<UUID>
<Type>
<Field>
<Term>
Inverted Index
<Term>
<UUID>
<Type+Field>
<Digest of Event>
Column Qualifier:
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
FORWARD AND INVERTED INDEX
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 24
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
CUSTOM INDEXING
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 25
Table:
Row:
Geo Index
<GeoHash>
<Event Type>
<UUID>
<Digest of Event>
Latitude10110101001
Longitude00111010010
101001110111010101011100001011100
Depth11010110110
Column Family:
Column Qualifier:
Value:
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
D4M 2.0 SCHEMA FOR TWITTER DATA
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 26
Table:
Row:
Column Family:
Tedge
<UUID>
“stat”
<stat>
“1”
“time”
<time>
“1”
“user”
<user>
“1”
“word”
<word>
“1”
TedgeT
<value>
“stat”
<UUID>
“1”
“time”
<UUID>
“1”
“user”
<UUID>
“1”
“word”
<UUID>
“1”
Column Qualifier:
Value:
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
D4M 2.0 SCHEMA FOR TWITTER DATA
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 27
Table:
Row:
Column Family:
TedgeDegT
<value>
“stat”
“degree”
<count>
“time”
“degree”
<count>
“user”
“degree”
<count>
“word”
“degree”
<count>
Ttext
<UUID>
Column Qualifier:
Value:
“text”
-
<text>
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
D4M 2.0 SCHEMA FOR TWITTER DATA
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 28
Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database , Kepner et. al., HPEC 2013
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 29
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ACCUMULO WITH D4M 2.0 SCHEMA PERFORMANCE
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 30
Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database , Kepner et. al., HPEC 2013
Maximizing throughput on an 8-node, 192-core cluster:
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ACCUMULO SCALABILITY: GRAPH500 BENCHMARK
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 31
source: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
ATOMIC INCREMENT PERFORMANCE COMPARISON
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 32
Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
QUESTIONS?
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 33
Adam Fuchs, CTOSqrrl Data, Inc.