introduction to apache accumulo

Post on 27-Jan-2015

138 Views

Category:

Technology

6 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at the Boulder/Denver BigData Meetup on March 21, 2012

TRANSCRIPT

Introduction to Apache Accumulo

Boulder/Denver BigData Meetup - March 21,2012Jared Winick@jaredwinick

Accumulo /əˈkjuːmjʊˌloʊ/

1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing

http://yourmotivational.com/uploads/8604.jpg

Annotation AddedJeff Dean: Designs, Lessons and Advice from Building Large Distributed Systemshttp://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

Enables interactive access to…

Trillions of recordspetabytes of indexed data

across 100s-1000s of servers

Short Accumulo History Lesson

http://www.flickr.com/photos/mr_t_in_dc/4249886990/sizes/l/in/photostream/

2006

2008

http://upload.wikimedia.org/wikipedia/commons/8/84/National_Security_Agency_headquarters%2C_Fort_Meade%2C_Maryland.jpg

2011

2012

Uses of BigTable and Kin

(BigTable)

•Google Analytics1

•Crawl1

•AppEngine Datastore2

•Many more1

(HBase)

•Messages3,4,6

•Insights5,6

(Accumulo)

•???

(Cassandra)

•Rainbird (realtime analytics)7

1.) http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf2.) http://code.google.com/appengine/articles/storage_breakdown.html3.) http://www.facebook.com/note.php?note_id=4549916089194.) http://mvdirona.com/jrh/TalksAndPapers/KannanMuthukkaruppan_StorageInfraBehindMessages.pdf5.) http://www.facebook.com/note.php?note_id=101501039002589206.) http://borthakur.com/ftp/SIGMODRealtimeHadoopPresentation.pdf7.) http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011

Accumulo /əˈkjuːmjʊˌloʊ/

1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing

Multi-dimension Key

Key

Row ID

Column

Timestamp

Family Qualifier Visibility

Value

http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html

Keys Sorted Lexicographically

Row ID, Column Family, Column Qualifier, Column Visibility, Timestamp

Everything is a byte[] except the Timestamp which is a long

Physical Layout

Row ID Col Fam Col Qual Col Vis Time Value

Alice properties age public March 2011 31

Alice properties phone private Feb 2011 555-1234

Alice purchases Xbox public Feb 2011 $299

Bob properties phone private March 2011 555-4321

Bob purchases iPhone Public Feb 2011 $399

Key Value

Queries

•By exact Key or range of Keys•Data is always returned in sorted order

Query Requirements Drive Data Model Design

http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html

Accumulo

Hadoop HDFS Zookeeper

Storage Configuration/State

Hadoop MapReduce

Analytics

Clients

Read/Write

Accumulo

Hadoop HDFS

Tablet Server

Tablet Server

Tablet Server

Master

Data Node

Data Node

Data Node

Name Node

. . .

. . .

TableTablets

… … … … … …

Accumulo

Hadoop HDFS

Tablet Server

Tablet Server

Tablet Server

Master

Data Node

Data Node

Data Node

Name Node

. . .

. . .

TableTablets

Tablet Server Failure

1.) Detect Failure

2.) Reassign

Accumulo

Hadoop HDFS

Tablet Server

Writes

Client

Write-Ahead

Log (WAL)

1MemTable2

Data Node

Data Node

Data Node. . .

Tablet

Accumulo

Hadoop HDFS

Tablet Server

Writes

Client

Write-Ahead

Log (WAL)

1MemTable2

Data Node

Data Node

Data Node. . .

File 1

3

Tablet

Compactions

Minor Major

The process of flushing a MemTable of a Tablet to a single file in HDFS

The process of combining multiple files into a single file

• Tablets are split when they reach a max size• Always split on row boundary• Master assigns a split Tablet to another Tablet

server (no data is moved!)

Tablet Splits

Reads

AccumuloTablet Server

Client MemTable

File 1

Tablet

File 1

Merge

Accumulo /əˈkjuːmjʊˌloʊ/

1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing

http://wiki.eeng.dcu.ie/ee557/287-EE/version/default/part/ImageData/data/server-side_intro.gif

Iterators: Server-side programming

Iterators

Can be run at:•Scan Time•Minor Compaction•Major Compaction

Can do things like:•Aggregation (Combiners)•Age-Off•Filtering (access control)•Transformation

Push Processing to the Data

Accumulo /əˈkjuːmjʊˌloʊ/

1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing

• Every key-value has a visibility label• Label is defined with boolean operators• Label is arbitrary and ad-hoc

• Authorizations presented at scan time• Data is filtered out automatically by system-

level Iterator

Access Control

Public Private | Admin Finance | (HR & Manager)

Access Control – Typical Architecture

Web Server

Enterprise Identity

Management

Accumulo1.) Pass Credentials

2.) Lookup User

3.) Return Authorizations

4.) Proxy Authorization

Trusted Zone

5.) Return Visible Data6.) Return Data

Access Control – Typical Architecture

3.) Auths:[SECRET, UNCLASSIFIED,PROJECT X, PROJECT Y]

Web Server

Enterprise Identity

Management

Accumulo

1.) PKI Cert

2.) Lookup Bob

4.) Proxy Bob’s Auths

Trusted Zone

5.) Return [6,8]6.) Return [6,8]

Bob

SECRET&PROJECT X, 6SECRET&PROJECT Y, 8SECRET&PROJECT Z, 3

Demo

Application Requirements

Build an application to analyze trends in Twitter messages.

•Query for word/phrase and view real-time activity in a time series graph•View at different time ranges (1 day, 7 days, 30 days, etc)•Allow multiple query terms to compare activity (ex. Breakfast,Lunch)•Automatically extract daily trends for the user

Demo Setup/Data

• Twitter Streaming API• US country codes only messages• 1,2,3-grams built• Data since Dec 24 – Live• Running on average workstation, 1 SATA disk,

6 GB memory.• 72GB, 2.6 billion entries and counting

Data Model• Tweets table– Row ID: n-gram– Column Family: Date Granularity (DAY, HOUR)– Column Qual: Date Value– Value: Count– SummingCombiner (Iterator) used to update Count

Row ID Col Fam Col Qual Value

breakfast DAY 20120318 31

breakfast DAY 20120319 56

… … … …

lunch HOUR 2012031801 3

lunch HOUR 2012031802 4

Data Model• Trends table– Row ID: (Date Granularity + Date Value)– Column Family: (Integer.MAX_VALUE – trendScore)– Column Qual: n-gram– Value: []

Row ID Col Fam Col Qual Value

DAY:20120318 2147483145 church

DAY:20120318 2147483316 hangover

… … … …

DAY:20120319 2147476521 the broncos

DAY:20120319 2147477704 tim tebow

• Utilize MapReduce for building trends• AccumuloInputFormat reads from tweets

table• AccumuloOutputFormat writes to trends

table• AccumuloStorage LoadFunc for Pig

available on github

MapReduce Analytics

Summary

•Accumulo exploits locality to enable interactive access to huge data sets while adding cell-level access control and server-side programming

•Nothing in life is free. Accumulo comes with the complexity and responsibility of managing a distributed system and designing indexes on your data

References

http://incubator.apache.org/accumulo/

http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012

• Documentation, Mailing Lists, Links

• HBase Shootout

• Trendulohttps://github.com/jaredwinick/trendulo

top related