an introduction to accumulo

Post on 08-Sep-2014

1.011 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

This was presented for an O'Reilly Media webcast. http://www.oreilly.com/pub/e/3152?cmp=tw-na-webcast-product-webcast_an_introduction_to_apache_accumulo This webcast will cover the basics of Apache Accumulo architecture and how it works, along with examples of how it is used. We'll also talk about some interesting use cases, such as text indexing, fine-grained multi-level access controls, and storing large-scale graphs. We'll also briefly touch on what sets Accumulo apart from other similar and not-so similar systems and where we think the Accumulo project is headed in a technical direction. A description of Accumulo from the Apache Accumulo website: 
 The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

TRANSCRIPT

AN INTRODUCTION TO

APACHE ACCUMULOHOW IT WORKS, WHY IT EXISTS, AND HOW IT IS USED

Donald Miner

CTO, ClearEdge IT Solutions

@donaldpminer

August 5th, 2014

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

COPY AND PASTED FROM

ACCUMULO.APACHE.ORG

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Adelaide BartkowskiAlyssa Files

Beatriz PalmoreCecilia OursCraig Avalos

Dianna LapointeErma Davis

Fermina SmeadGarrett Harsh

Gaylene SherryGilberto Pardue

Hui NodalJanell Tomita

Jannette BettersJeana Delk

Madlyn RadkePeggie Allis

Rhona ZygmontTran Degarmo

Wilhelmina Papp

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Janell TomitaJannette Betters

Jeana DelkMadlyn Radke

Peggie AllisRhona ZygmontTran Degarmo

Wilhelmina Papp

Adelaide BartkowskiAlyssa Files

Beatriz PalmoreCecilia OursCraig Avalos

Dianna Lapointe

Erma DavisFermina SmeadGarrett Harsh

Gaylene SherryGilberto Pardue

Hui Nodal

-inf to D E to H J to +inf

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Accumulo Master

TabletServer TabletServer TabletServer

ZooKeeper

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

KEYVALUEAdelaide Bartkowski 91294124Alyssa Files 491294Beatriz Palmore 4124124124Cecilia Ours 419120Craig Avalos 940124Dianna Lapointe 4921Erma Davis 050194Fermina Smead 10024599949Garrett Harsh 140095931Gaylene Sherry 914815Gilberto Pardue 412414124124Hui Nodal 962195192Janell Tomita 12121Jannette Betters 9192012Jeana Delk 9120150Madlyn Radke 4921Peggie Allis 944944Rhona Zygmont 123103Tran Degarmo 9499494Wilhelmina Papp 11221

Lookup “Garret Harsh”

FAST

Lookup “4921”

SLOW

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

MIT Lincoln Lab study:100 Million inserts per second using Accumulo

http://arxiv.org/ftp/arxiv/papers/1406/1406.4923.pdfhttp://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf

Booz Allen Hamilton study:942 tablet servers, 7.56 trillion entries, 408TB, 26 hours94MB/Sec, 15TB/hr, 80million inserts per second11 tablet servers went down with no interruptionShowed linear scalability for write throughput22,000 queries per second

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

COPY AND PASTED FROM

ACCUMULO.APACHE.ORG

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

HBase vs. Accumulo• Slight differences in visibility labels• Coprocessors vs. Iterators• Accumulo has faster write throughput*• HBase’s reads are faster*• HBase has more ecosystem integration• BatchScanner• Accumulo can shift around locality groups after the fact• Accumulo has shown to work with no problems at 1,000

nodes (BAH paper). Facebook and others run a “cell” design for HBase. Largest clusters in the hundreds*.

* We believeDisclaimer: I am biased

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

VISIBILITY LABELS!

(admin & developer) | analyst

Column Visibility SyntaxLabel DescriptionA & B Both ‘A’ and ‘B’ are required

A | B Either ‘A’ or ‘B’ is required

A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required

A | (B & C) ‘A’ or ‘B’ and ‘C’ is required

(A | B) & (C & D) ?

A & (B & (C | D)) ?

Patient has schizophrenia: insurer | MD & psychPatient has stomach ulcers: insurer | doctorPatient has cavity: insurer | dentistPatient has consent for general anesthesia: surgeon

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

ITERATORS!

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

More cool features• Constraints: user-defined Java functions that allow or

prevent new writes based on a condition• Large rows: no limit on data stored in a row• Multiple masters & FATE: able to execute table operations

in a fault-tolerant manner• MapReduce InputFormats• Bulk import utilities: write directly to Accumulo file formats• Batch scanner: client scans multiple ranges at once• Batch writer: client buffers and organized data before

writing in parallel

More cool features• Constraints: user-defined Java functions that allow or

prevent new writes based on a condition• Large rows: no limit on data stored in a row• Multiple masters & FATE: able to execute table operations

in a fault-tolerant manner• MapReduce InputFormats• Bulk import utilities: write directly to Accumulo file formats• Batch scanner: client scans multiple ranges at once• Batch writer: client buffers and organized data before

writing in parallel

More cool features• Thrift proxy: access Accumulo through Ruby, Python, …• Monitor page: shows performance, status, errors, more• Locality groups: group column families together on disk

for performance tuning (changeable later)• On-HDFS at rest encryption (work in progress)• Table import and export

More cool features• Thrift proxy: access Accumulo through Ruby, Python, …• Monitor page: shows performance, status, errors, more• Locality groups: group column families together on disk

for performance tuning (changeable later)• On-HDFS at rest encryption (work in progress)• Table import and export

Scalability & Performance• Multiple HDFS volumes: Accumulo can use multiple

NameNodes to store its data• Master stores metadata in an Accumulo table

• Native in-memory map: data is first written into a buffer written in C++, outside of Java

• Relative encoding: consecutive keys with the same values are flagged instead of rewritten

• Scan pipelines: stages of the read path are parallelized into separate threads

• Caching: data recently scanned is cached

HOW IT WORKS

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Data ModelRow ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public | private 12423523 @donaldpminer

don info height public | private 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

Name email twitter picture height SSN

derek de…@ad….com 9efe23aa… 6’2”

don dm…@cl….com @donaldpminer 5’ 9”

erica @erica aef319eaf…

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Lookup key

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Collection of data that is kept together

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

What the data is

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Who can see the data

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

When the data was created

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

UNIQUENESS

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

SORTED

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Some piece of information

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Text rowID = new Text(”don");Text colFam = new Text(”info");Text colQual = new Text(”picture");ColumnVisibility colVis = new ColumnVisibility("public");long timestamp = System.currentTimeMillis();Value value = new Value(MyPictureObj.getBytes());

Mutation mutation = new Mutation(rowID);mutation.put(colFam, colQual, colVis, timestamp, value);

BatchWriterConfig config = new BatchWriterConfig();BatchWriter writer = conn.createBatchWriter(”usertable", config)

writer.add(mutation);writer.close();

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Writing data into Accumulo

New Record

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

sorted

append

Writing data into Accumulo

New Record

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

RFile(minc)

sorted

Minor Compaction

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

RFile(minc)

RFile(minc)

Minor Compaction

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

RFile(minc)

RFile(minc)

RFile(minc)

Minor Compaction

Writing data into Accumulo

RFile(majc)

RFile(minc)

RFile(minc)

RFile(minc)

sorted

Major Compaction

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Family Visibilities

don-don info public

Reading data

Range Family Visibilities

don-don info public

Reading data

Authorizations auths = new Authorizations("public”);

Scanner scan = conn.createScanner(”usertable", auths);

scan.setRange(new Range(”don",”don"));scan.fetchFamily(”info");

for(Entry<Key,Value> entry : scan) { String row = entry.getKey().getRow(); Value value = entry.getValue();}

Reading data

MemTable RFile(minc)

RFile(minc)

RFile(minc)

RFile(majc)

Range Family Visibilities

don-don info public

Tablet: c - f

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Family Visibilities

don-don info public, user, tech

Reading data

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Visibilities

don-don public, user, tech

Reading data Scan

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 dminer@gopivotal.com

don contact email admin | private 12412412 dminer@clearedgeit.com

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Visibilities

d-e public, user, tech

Reading data Scan

Iterators• Iterators run tablet server side at these times:

1. Scan Time

2. Minor Compaction

3. Major Compaction

• Multiple iterators are included with Accumulo• Custom iterators can be created using the Iterator API

Scan Time Iterator

Minor Compaction Iterator

Major Compaction Iterator

Age-Off Iterator

Row ID

Column Family

Column Qualifier

Column

Visibility

Timestamp

Value

bob attribute score public 1005 24

bob attribute score public 1004 55

bob attribute score public 1003 71

bob attribute score public 1002 66

bob attribute score public 1001 39

bob attribute score public 1000 33

Current Time: 1102

Entries < 100s old

Entries > 100s old

Scan time: server side filtering Major compaction time: age off

Combiner Iterators

Apply a function to all available versions of a particular key

Row ID

Column Family

Column Qualifier

Column Visibility

Time Stamp

Value

bob attribute score public 1005 33

bob attribute score public 1004 65

bob attribute score public 1003 71

bob attribute score public 1002 59

bob attribute score public 1001 57

bob attribute score public 1000 51

MAX 71

Scan time: server side combining Minor & Major compaction time: consolidation

USE CASES

Basic Structured Data

Row IDColumn Family

Column Qualifier

Column Visibility

Timestamp

Value

bob attribute surname public Jul 2013 doe

bob attribute height public Jun 2012 5’11”

bob insurance dental private Sep 2009 MetLife

jane attribute bloodType public Jul 2011 ab-

jane attribute surname public Aug 2013 doe

jane contact cellPhone public Dec 2010 (808) 345-9876

jane insurance vision private Jan 2008 VSP

john allergy major private Feb 1988 amoxicillin

john attribute weight public Sep 2013 180

john contact homeAddr public Mar 2003 34 Baker LN

Indexing Everything

Row ID Column Fam Column Qual Visibility Time value

index Column Fam Column Qual:Row ID Visibility Time -

to Column Fam Column Qual:Row ID Visibility Time -

values Column Fam Column Qual:Row ID Visibility Time -

Event Table

Index Table

Index TableRow ID

Column Family

Column Qualifier

Column Visibility

Timestamp

Value

(808) 345-9876

contact cellPhone:jane public Dec 2010 -

180 attribute weight:john public Sep 2013 -

34 Baker LN contact homeAddr:john public Mar 2003 -

5’11” attribute height:bob public Jun 2012 -

MetLife insurance

dental:bob private Sep 2009 -

VSP insurance

vision:jane private Jan 2008 -

ab- attribute bloodType:jane public Jul 2011 -

amoxicillin allergy major:john private Feb 1988 -

doe attribute surname:bob public Jul 2013 -

doe attribute surname:jane public Aug 2013 -

Data Lake

PATIENTS MEDICINES DOCTORS

INDEX

Data Lake

PATIENTS MEDICINES DOCTORS

INDEX

Tell me everything you know

of amoxicillin

amoxicillin

Data Lake

PATIENTS DISEASES DOCTORS

INDEX

amoxicillin

bob:allergy:amoxicillin

larry:takes:amoxicillinStomach ulcer:treatment:amoxicillin

smith:prescribed:amoxicillinInfection:

treatment:amoxicillin

Diarrhea:side effect:amoxicillin

Graphs

a

bc

d

e

a b c d e

a - 1

b 1 -

c - 1

d 1 1 - 1

e -

Start Nodes

End

Nod

es

Row ID Column Family Column Qualifier Value

a edge b 1

a edge d 1

c edge a 1

c edge d 1

d edge c 1

e edge d 1

Term-Partitioned Index

Tablet Server 1

Row IDColumn Family

Value

baseball document docid_3

baseball document docid_2

bat document docid_2

Tablet Server 2

Row IDColumn Family

Value

football document docid_1

football document docid_3

glove document docid_1

Tablet Server 3

Row IDColumn Family

Value

nba document docid_1

shoes document docid_1

soccer document docid_3

RESULTS: [docid_2, docid_3] RESULTS: [docid_1, docid_3] RESULTS: [docid_3]

Tablet Server knows about the terms “baseball”

Tablet Server knows about the terms “football”

Tablet Server knows about the terms “soccer”

Query: “baseball” AND “football” AND “soccer”

Client

Client-side Set Intersection

[docid_2, docid_3][docid_1, docid_3][docid_3]

Geospacial Indexing: Z-Order Curve

33.333W, 55.555N = 3535.353535

WHERE TO GO FROM HERE

Resources

Apache Accumulo website

accumulo.apache.org

Accumulo Summit 2014

accumulosummit.com

slideshare.net/AccumuloSummit

Multi-day in-person training

UMBC Training Centers

ClearEdge IT Solutions

Sqrrl

Find a job

AN INTRODUCTION TO

APACHE ACCUMULOHOW IT WORKS, WHY IT EXISTS, AND HOW IT IS USED

Donald Miner

CTO, ClearEdge IT Solutions

@donaldpminer

August 5th, 2014

top related