accumulo nutch/gora, storm, and pig

15
Large Scale Web Analytics with Accumulo (and Nutch/Gora, Pig, and Storm) Jason Trost [email protected] @jason_trost

Upload: jason-trost

Post on 27-Jan-2015

124 views

Category:

Documents


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Accumulo Nutch/GORA, Storm, and Pig

Large Scale Web Analytics with Accumulo (and Nutch/Gora, Pig, and Storm)

Jason Trost [email protected] @jason_trost

Page 2: Accumulo Nutch/GORA, Storm, and Pig

Introductions

• Jason Trost ([email protected])

• Senior Software Engineer at Endgame Systems

• Former Accumulo Trainer

• Apache Accumulo Committer

– Apache Pig integration with Accumulo

– some minor bug fixes

Page 3: Accumulo Nutch/GORA, Storm, and Pig

Agenda

• Technologies Introduction –Apache Accumulo –Apache Gora –Apache Nutch/Gora –Storm

• Accumulo at Endgame –Web Crawl Analytics –Real-time DNS Processing –Operations

Page 4: Accumulo Nutch/GORA, Storm, and Pig

Apache Accumulo

• Accumulo is a BigTable implementation with cell level security

• It is conceptually very similar to HBase, but it has some nice features that HBase is currently lacking.

• Some of these features are: – Cell level security

– No fat row problem

– No limitation on col fams or when col fams can be created

– Server side, data local, programming abstraction called Iterators

– Iterators enable fast aggregation, searching, filtering, streaming Reduce

Page 5: Accumulo Nutch/GORA, Storm, and Pig

Apache Gora

• Gora is a object relational/non-relational mapping for arbitrary data stores including both relational (MySQL) and non-relational data stores (HBase, Cassandra, Accumulo, Redis, Voldermort, etc.).

• It was designed for Big Data applications and has support (interfaces) for Apache Pig, Apache Hive, Cascading, and generic MapReduce.

Page 6: Accumulo Nutch/GORA, Storm, and Pig

Apache Nutch/Gora

• Nutch is a highly scalable web crawler built over Hadoop MapReduce.

• It was designed from the ground up to be an Internet scale web crawler and to enable large scale search applications

• GORA enables the storing of the web crawl data and metadata in Accumulo

Page 7: Accumulo Nutch/GORA, Storm, and Pig

Storm

• Highly scalable streaming event processing system

• Conceptually similar to MapReduce, but operates on streaming data in real-time

• Released by Twitter after they acquired Backtype

• Development led by Nathan Marz

Twitter Storm

• At-least-once-processing of events

• Spouts and Bolts are wired together to form computation Topologies

• Topologies run until killed

Page 8: Accumulo Nutch/GORA, Storm, and Pig

at

Page 9: Accumulo Nutch/GORA, Storm, and Pig

Web Crawl Analytics

• Formerly used Heritrix with a Cassandra backend for collection and storage

• We now use Nutch/GORA to perform Large-scale web crawling

• All pages and HTTP headers are stored in Accumulo

• Run Pig scripts for pulling data out of Accumulo, performing rollups, performing pattern matching (using regular expressions), and processing the pages using python scripts

Page 10: Accumulo Nutch/GORA, Storm, and Pig

Real-time DNS Processing

• We used to use MapReduce/PIG to generate daily reports on all DNS event data from files in HDFS; this took several hours

• Now, we use an internally developed framework called Velocity that was built over Storm

• In real-time, enrich DNS and security events with IP geo data (country, city, company, vertical), correlate with internally developed/maintained DNS blacklists

• Store the events in Accumulo & use custom Accumulo iterators to perform rollups

• At report generation time, Accumulo aggregates records server side

• This process now takes minutes, not hours, and we can query for partial results instead of having to wait until the end of the day Twitter Storm

Page 11: Accumulo Nutch/GORA, Storm, and Pig

Custom Iterators & Aggregation

Ingest Format

Row GROUP BY FIELDS

Col Fam Constant String

Col Qual Event UUID

Val -

Format After Custom Iterator

Row GROUP BY FIELDS

Col Fam Constant String

Col Qual “”

Val “1”

At Ingest • RowID contains a CSV record that

represents the fields used to basically perform a GROUP BY

• Col Qual contains the event UUID

At Scan time • Basically strip off the event UUID • Set the value to be “1” • Prepares Key/Value for input into

SummingCombiner • Output from SummingCombiner is an

accurate count of aggregated records • This is, in essence, a streaming

Reduce

Page 12: Accumulo Nutch/GORA, Storm, and Pig

Operations with Accumulo

• Hadoop Streaming jobs tend to kill tablet servers

– Streaming jobs use more memory than Hadoop allows

– This can make service memory allocations challenging

– Reducing number of Map tasks helped

• Running tablet servers under supervision is critical

– Tablet servers fail fast

– Supervisord or daemontools restart failed processes

– Has improved our cluster’s stability dramatically

• Pre-splitting tables is very important for throughput

– Our rows lead with a day day, e.g. “20120101”

• Locality Groups are your friend for Nutch/Gora

Page 13: Accumulo Nutch/GORA, Storm, and Pig

We’re Hiring

• Like to work on hard problems with Big Data?

• Are you familiar/interested in these technologies? – Hadoop, Storm, Django, Nutch/GORA

– Accumulo, Solr/ElasticSearch, Redis

– Python, Java, Pig, Node.JS, Github

• Want to contribute to Open Source?

• We have offices in Atlanta, Washington DC, Baltimore, and San Antonio

• www.linkedin.com/jobs/at-Endgame-Systems

Page 14: Accumulo Nutch/GORA, Storm, and Pig

Questions?

Page 15: Accumulo Nutch/GORA, Storm, and Pig

Contact Info

• Jason Trost

• Email: [email protected]

• Twitter: @jason_trost

• Blog: www.covert.io