accumulo summit 2015: using d4m for rapid prototyping of analytics for apache accumulo [frameworks]
TRANSCRIPT
D4M and Apache Accumulo
Vijay Gadepally, Lauren Edwards,
Dylan Hutchison, Jeremy Kepner
Accumulo SummitCollege Park, MD
April 29, 2015
This work is sponsored by the Assistant Secretary of Defense for Research
and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions,
interpretations, recommendations and conclusions are those of the authors
and are not necessarily endorsed by the United States Government
Accumulo Summit
VNG - 2
Giving away the punch line
• D4M is a popular open source software tool that
connects scientists with Big Data technologies
• D4M-Accumulo binding provides high performance
connectivity to Apache Accumulo for quick analytic
prototyping
• Graphulo: Implement GraphBLAS server-side
iterators and operators on Accumulo tables
Accumulo Summit
VNG - 3
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 4
Common Big Data Challenge
CommandersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Data
Users
Gap
2000 2005 2010 2015 & Beyond
Rapidly increasing
- Data volume
- Data velocity
- Data variety
- Data veracity (security)
Accumulo Summit
VNG - 5
Common Big Data Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
D E
B
Computing
Web
Files
Scheduler
Ingest &
EnrichmentIngest &
EnrichmentIngestDatabases
Accumulo Summit
VNG - 6
Common Big Data Architecture- Data Volume: Cloud Computing -
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
D E
B
Computing
Web
Files
Scheduler
Ingest &
EnrichmentIngest &
EnrichmentIngestDatabases
Operators
MIT
SuperCloud
Enterprise Cloud
Big Data Cloud Database Cloud
Compute Cloud
MIT SuperCloud merges four clouds
Accumulo Summit
VNG - 7
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
D E
B
Computing
Web
Files
Scheduler
Ingest &
EnrichmentIngest &
EnrichmentIngestDatabases
Lincoln benchmarking
validated Accumulo performance
Common Big Data Architecture- Data Velocity: Accumulo Database -
Accumulo Summit
VNG - 8
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
D E
B
Computing
Web
Files
Scheduler
Ingest &
EnrichmentIngest &
EnrichmentIngestDatabases
D4M demonstrated a
universal approach to diverse data
columnsro
ws
Σ
raw
Common Big Data Architecture- Data Variety: D4M Schema -
intel reports, DNA, health records, publication
citations, web logs, social media, building alarms,
cyber, … all handled by a common 4 table schema
Accumulo Summit
VNG - 9
Common Big Data Architecture- Data Veracity: Security Tools-
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
D E
B
Computing
Web
Files
Scheduler
Ingest &
EnrichmentIngest &
EnrichmentIngestDatabases
Using cryptography to protect
sensitive data-Verifiable Query Results-
-Computing on Masked Data-
Big Data Cloud
Masked
Query
Plaintext
Query
Encrypt
CMD
Masked
Analytic
Result
Decrypt
Plaintext
Analytic
Result
Accumulo Summit
VNG - 10
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 11
High Level Language: D4Mhttp://d4m.mit.edu
Accumulo
Distributed Database
Query:Alice
Bob
Cathy
David
Earl
Associative ArraysNumerical Computing Environment
D4MDynamic
Distributed
Dimensional
Data ModelA
C
DE
B
A D4M query returns a sparse
matrix or a graph…
…for statistical signal processing
or graph analysis in MATLAB
D4M binds associative arrays to databases, enabling rapid
prototyping of data-intensive cloud analytics and visualization
Accumulo Summit
VNG - 12
What is D4M?
• The Dynamic Distributed Dimensional Data Model:
– Support for mathematical foundation – associative arrays
– Schema to represent most unstructured data as associative arrays
– Software tools to connect with variety of databases such as Apache Accumulo, SciDB, mySQL, PostgreSQL, …
• Software tools currently implemented in MATLAB/Octave, and Julia (v1)
• Connect to databases via JDBC (relational), SHIM (SciDB) or custom Java API (Accumulo)
Accumulo Summit
VNG - 13
• Key innovation: mathematical closure
– All associative array operations return associative arrays
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A('alice bob ',:) A('alice ',:) A('al* ',:)
A('alice : bob ',:) A(1:2,:) A == 47.0
• Simple to implement in a library in programming environments
with: 1st class support of 2D arrays, operator overloading,
sparse linear algebra
Mathematical Foundation: Associative Arrays
• Complex queries with ~50x less effort than Java/SQL
• Naturally leads to high performance parallel implementation
• Need a schema to convert arbitrary data to associative array
Accumulo Summit
VNG - 14
D4M Data Schema
• A structure described in a language supported by the database management system.
• Use D4M schema to represent heterogeneous data types in common data format
– Schema converts structured or unstructured raw text to a tuple representation supported by Accumulo:
• Usually use a 4 table representation
– The Edge Table, the Transpose Table, Degree Table, Raw Table
33659254179712 2013-05-20 21:21:42 20798128
kiefpief web 3b77caf94bfc81fe I am
sending love to Oklahoma. And actually -- to everyone who
may need it. You are loved. And you are not alone.
Promise. #PrayforOklahoma
33660010027264 2013-05-20 21:54:56 35.99894978 -
78.90660222 -8783842.7781526 4300476.86376416
22435220 RyanBLeslie Twitter for iPad348803787
bced47a0c99c71d0 @HaydenBigCntry RT @jiminhofe:
The devastation in Oklahoma is
…
D4M
Schema
(33659254179712, time|2013-05-20 21:21:42, 1)
(33659254179712, user|kiefpief, 1)
(33659254179712, text, Sending love to OK #PrayforOklahoma)
(33659254179712, word|Sending, 1)
(33660010027264, time|2013-05-20 21:54:56, 1)
(33660010027264, lat|-78.90660222, 1 )
(33660010027264, lon|35.99894978, 1)
(33660010027264, user|RyanBLeslie, 1)
(33660010027264, RT|@HaydenBigCntry , 1)
(33660010027264, word|Oklahoma, 1)
…
Accumulo Summit
VNG - 15
4 Table D4M Schema
row_num col1 col2 col3
001 row1col1 row1col2 word1 word2 word3
002 row2col1 row2col2 word2 word3
003 … … word1 word3
col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3
row_num|001 1 1 1 1 1
row_num|002 1 1 1 1
row_num|003 1 1
col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3
Degree 1 1 1 1 2 2 3
row_num|001 row_num|002 row_num|003
col1|row1col1 1
col1|row2col1
col2|row1col2 1 1
col2|row2col2 1
col3|word1 1 1
col3|word2 1 1
col3|word3 1 1
Tedge
TedgeDeg
TedgeT
text
row_num|001 word1 word2 word3
row_num|002 word2 word3
row_num|003 word1 word3
TedgeTxt
Accumulo Summit
VNG - 16
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 17
D4M Software Library
• Associative Array representation works very well as an interface among databases.
• D4M currently implemented in languages with first class support of sparse matrices:
– MATLAB
– GNU Octave
– Julia (in progress)
• Implemented in ~2000 lines of MATLAB code
Download D4M
Source from
d4m.mit.edu
d4m_api.zip
matlab_src/
d4m_api_java.jar
libext.zip
dependency JARs
Accumulo Summit
VNG - 18
D4M: What a user sees
(row, col, val)
Matlab strings
d4m
Matlab API
d4m_api_java
Java API
Accumulo
Java APIAccumulo
Table
% D4M Associative Array APIrow = 'r1,r2,'; col = 'c1,c1,'; val = '7,3,';
A = Assoc(row,col,val,@min);
% D4M Accumulo APIDB = DBserver(’zoohost.edu:2181', 'Accumulo',
'instance', 'user', 'password');
T = DB('Table'); % Create table if doesn't exist.
put(T,A); % Put associative array in T.
Aret = T(:,:); % Scan all of T.
Accumulo Summit
VNG - 19
D4M: What a developer sees
Type Matlab/Julia File Java Class Use
Table
management
DBcreate.m D4mDbTableOperations Create table
@DBserver/ls.m D4mDbInfo List tables
@DBtable/nnz.m D4mDbTableOperationsNumber of entries in table,
summed from table's tablets
DBdelete.m D4mDbTableOperations Delete table
Write DBinsert.m D4mDbInsert Insert
Scan@DBtable/DBtable.m D4mDataSearch Create query holder
@DBtable/subsref.m D4mDataSearch Do query, possibly holding batches
@DBtable/close.m D4mDataSearch Reset query
Delete@DBtable/deleteTriple.m AccumuloDelete Delete entries
@DBtable/deleteAssoc.m AccumuloDelete Delete entries
Iterators@DBtable/ColCombiner.m D4mDbTableOperations List table iterators
@DBtable/addColCombiner.m D4mDbTableOperations Add all-scope table iterator
@DBtable/deleteColCombiner.m D4mDbTableOperations Remove iterator
Splits
@DBtable/Splits.m D4mDbTableOperationsReturn splits, number of entries in each
tablet, tablet server addresses
@DBtable/addSplits.m D4mDbTableOperations Add new table split
@DBtable/putSplits.m D4mDbTableOperations Replace table splits, merging old splits
@DBtable/mergeSplits.m D4mDbTableOperations Remove splits by merging tablets
• Source code released and available!
Accumulo Summit
VNG - 20
D4M Write
More details on Batched Insert – 500 kB by default
• putNumBytes() controls #entries to insert in one batch, on MATLAB side
• Independent batches: each creates, flushes and closes separate
BatchWriters
• Guarantee BatchWriters correctly closed
• No need to maintain BatchWriter lifecycle in MATLAB
• 30 ms maximum latency before flushing
• 50 Write threads
• 1 MB maximum memory on BatchWriter, plenty for default batch size
KeyValue
Assoc
Val
Row ID
Assoc
Row
Column Timestamp
FamilyputColumn
Family()
QualifierAssoc
Col
Visibilityput
Security()
Accumulo Summit
VNG - 21
D4M Scan Example
1. Translate Matlab queries into ranges for BatchScanner
T(:,:) %Scan all
T('r1;r5;:;r7;', :) %Scan given row ranges
T(:, 'c1;') %Use fetchColumn(), or row scan
Transpose table
T('r5;:;r9;', 'c1;:;c3;') %Complicated; break into simpler
queries
2. Hold state of Scanner iterator as state of MATLAB object
T_it = Iterator(T, 'elements', 1e5); % 100k entry batch size
A = T_it(:,:); % Initial query
while nnz(A) % While there is another batch
handleBatch(A);
A = T_it(); % Get next batch
end
Accumulo Summit
VNG - 22
Parallel Accumulo Access
Sample script writing files to Accumulo in parallel:
T = DB('Tedge','TedgeT');
myFiles = global_ind(zeros(Nfile,1,map([Np 1],{},0:Np-1)));
for i = myFiles
fname = ['data/' num2str(i)]; % Create filename.
load([fname '.A.mat']); % Load file data.
put(T,num2str(A)); % Insert to Accumulo.
end
Run on 4 local processors: eval(pRUN('Script',4,{}));
• D4M + pMATLAB gives rise to high performance
Accumulo Summit
VNG - 23
Accumulo Scaling on MIT SuperCloud
• Scales linearly with ingest processes, server nodes, and data sizeS
erv
er n
od
es
Accumulo Summit
VNG - 24
115,000,000 inserts per second
• Using supercomputing techniques allows peak insert to be achieve
within seconds of launch
1M edge
Graph500
graph
43K
43B edges in
5 minutes
Accumulo Summit
VNG - 25
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 26
D4M Twitter Demo
• August 24, 2014: Earthquake in Northern California
• Tweets from August 24-25
• Using D4M for:
– Exploration
– Analytics
– Visualization
Accumulo Summit
VNG - 27
Set Table Bindings
Accumulo Summit
VNG - 28
Query Tweets
Accumulo Summit
VNG - 29
Find Common Locations
Accumulo Summit
VNG - 30
Filter Tweets
Accumulo Summit
VNG - 31
Query for Full Tweets
Accumulo Summit
VNG - 32
Load Stopwords
Accumulo Summit
VNG - 33
Remove Stopwords
Accumulo Summit
VNG - 34
Find Co-Occurring Words
Accumulo Summit
VNG - 35
Remove Diagonal
Accumulo Summit
VNG - 36
See Words Most Used Together
Accumulo Summit
VNG - 37
Display on Map
Accumulo Summit
VNG - 38
Outline
• Introduction
• D4M Overview
• D4M Details
• Demonstration
• Conclusions
Accumulo Summit
VNG - 39
Summary
• D4M is a popular software tool that connects
scientists with Big Data technologies
• D4M-Accumulo binding provides high performance
connectivity to Apache Accumulo for quick analytic
prototyping
• Current research expands this connection to support
high performance graph analytics
Accumulo Summit
VNG - 40
• Graphulo: Implement GraphBLAS server-side iterators and operators on
Accumulo tables
• Use case: Queued analytics = Localized within a neighborhood
• Aim for Accumulo Contrib
• Released:
– Design Document
• Upcoming:
– Beta version of tools in
late May/early June
• Future:
– Scalability
– Schemas
– More example algorithms
G R A P H U L O
http://graphulo.mit.edu
Graphulo:
Contact Dylan Hutchison if you have any thoughts!
Accumulo Summit
VNG - 41
Acknowledgements
• Bill Arcand
• Bill Bergeron
• David Bestor
• Chansup Byun
• Matt Hubbell
• Jeremy Kepner
• Jake Bolewski
• Pete Michaleas
• Julie Mullen
• Andy Prout
• Albert Reuther
• Tony Rosa
• Charles Yee
• Dylan Hutchison
And many more …
Accumulo Summit
VNG - 42
Thank you!
• Contact:
– Vijay Gadepally ([email protected])
– Lauren Edwards ([email protected])
– Jeremy Kepner ([email protected])