large scale crawling with apache nutch

This is the title of a presentation

Large Scale Crawling with

Julien [email protected]

ApacheCon Europe 2012

Apache

I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop

About myself

DigitalPebble Ltd, Bristol (UK)

Specialised in Text Engineering

Web Crawling

Natural Language Processing

Information Retrieval

Data Mining

Strong focus on Open Source & Apache ecosystem

Apache Nutch VP

Apache Tika committer

User | Contributor SOLR, Lucene

GATE, UIMA

Mahout

Behemoth

A few words about myself just before I start...What I mean by Text Engineering is a variety of activities ranging from ....What makes the identity of DP is The main projects I am involved in are

Objectives

Overview of the project

Nutch in a nutshell

Nutch 2.x

Future developments

Nutch?

Distributed framework for large scale web crawlingbut does not have to be large scale at all

or even on the web (file-protocol)

Based on Apache Hadoop

Indexing and Search

Apache TLP since May 2010

Note that I mention crawling and not web search used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR

Short history

2002/2003 : Started By Doug Cutting & Mike Caffarella

2004 : sub-project of Lucene @Apache

2005 : MapReduce implementation in Nutch2006 : Hadoop sub-project of Lucene @Apache

2006/7 : Parser and MimeType in Tika2008 : Tika sub-project of Lucene @Apache

May 2010 : TLP project at Apache

June 2012 : Nutch 1.5.1

Oct 2012 : Nutch 2.1

Major Releases

October 2012 2.1

July 2012 2.0

June 2012 1.5

November 2011 1.4

June 2011 1.3

September 2010 1.2

June 2010 1.1

March 2009 1.0

April 2007 0.9

July 2006 0.8

Recent Releases

trunk

2.1

2.0

1.5.1

1.3

1.4

1.1

1.2

1.0

06/12

06/11

06/10

06/09

2.x

Mailing lists

http://pulse.apache.org/#nutch.apache.org

Mailing [email protected] subscribers: 984
Current digest subscribers: 15
Total posts (607 days): 5390
Mean postsperday: 8.88

Mailing [email protected] subscribers: 487
Current digest subscribers: 5
Total posts (607 days): 6099
Mean postsperday: 10.05

Community

6 active committers / PMC members4 within the last 18 months

Constant stream of new contributions & bug reports

Steady numbers of mailing list subscribers and traffic

Nutch is a very healthy 10-year old

Why use Nutch?

Featurese.g. Index with SOLR

PageRank implementation

Can be extended with plugins

Usual reasonsMature, business-friendly license, community, ...

ScalabilityTried and tested on very large scale

Hadoop cluster : installation and skills

Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters

Not the best option when ...

Hadoop based == batch processing == high latencyNo guarantee that a page will be fetched / parsed / indexed within X minutes|hours

Javascript / Ajax not supported (yet)

Use cases

Crawl for IRGeneric or vertical

Index and Search with SOLR

Single node to large clusters on Cloud

but alsoData Mining

NLP (e.g.Sentiment Analysis)

ML

MAHOUT / UIMA / GATE

Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)

Customer cases

Specificity (Verticality)

Scale

Usecase : BetterJobs.comSingle server

Aggregates content from job portals

Extracts and normalizes structure (description, requirements, locations)

~1M pages total

Feeds SOLR index

Usecase : SimilarPages.comLarge cluster on Amazon EC2 (up to 400 nodes)

Fetched & parsed 3 billion pages

10+ billion pages in crawlDB (~100TB data)

200+ million lists of similarities

No indexing / search involved

Typical Nutch Steps

Inject populates CrawlDB from seed list

Generate Selects URLS to fetch in segment

Fetch Fetches URLs from segment

Parse Parses content (text + metadata)

UpdateDB Updates CrawlDB (new URLs, new status...)

InvertLinks Build Webgraph

SOLRIndex Send docs to SOLR

SOLRDedup Remove duplicate docs based on signature

Sequence of batch operations

Or use the all-in-one crawl script

Repeat steps 2 to 8

Same in 1.x and 2.x

Main steps in NutchMore actions availableShell Wrappers around hadoop commands

Main steps

CrawlDBSeed ListSegment

/crawl_generate/

/crawl_fetch//content/

/crawl_parse//parse_data//parse_text/

LinkDB

Main steps in NutchMore actions availableShell Wrappers around hadoop commands

Frontier expansion

Manual discoveryAdding new URLs by hand, seeding

Automatic discovery of new resources (frontier expansion)Not all outlinks are equally useful - control

Requires content parsing and link extraction

seed

i = 1

i = 2

i = 3

[Slide courtesy of A. Bialecki]

An extensible framework

EndpointsProtocol

Parser

HtmlParseFilter (ParseFilter in Nutch 2.x)

ScoringFilter (used in various places)

URLFilter (ditto)

URLNormalizer (ditto)

IndexingFilter

PluginsActivated with parameter 'plugin.includes'

Implement one or more endpoints

Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters

Features

FetcherMulti-threaded fetcher

Follows robots.txt

Groups URLs per hostname / domain / IP

Limit the number of URLs for round of fetching

Default values are polite but can be made more aggressive

Crawl Strategy Breadth-first but can be depth-first

Configurable via custom scoring plugins

ScoringOPIC (On-line Page Importance Calculation) by default

LinkRank

Fetcher . multithreaded but polite

Features (cont.)

ProtocolsHttp, file, ftp, https

SchedulingSpecified or adaptative

URL filtersRegex, FSA, TLD, prefix, suffix

URL normalisersDefault, regex

Fetcher . multithreaded but polite

Features (cont.)

Other pluginsCreativeCommons

Feeds

Language Identification

Rel tags

Arbitrary Metadata

Indexing to SOLRBespoke schema

Parsing with Apache TikaHundreds of formats supported

But some legacy parsers as well

Data Structures in 1.x

MapReduce jobs => I/O : Hadoop [Sequence|Map]Files

CrawlDB => status of known pages

CrawlDBMapFile :

byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;

Input of : generate - index

Output of : inject - update

Writable object crawl datum

Data Structures 1.x

Segment/crawl_generate/ SequenceFile/crawl_fetch/ MapFile/content/ MapFile/crawl_parse/ SequenceFile/parse_data/ MapFile/parse_text/ MapFile

Segment => round of fetching

Identified by a timestamp

Can have multiple versions of a page in different segments

Data Structures 1.x

LinkDBMapFile :

Inlinks : HashSet Inlink : String fromUrlString anchor

Output of : invertlinks

Input of : SOLRIndex

linkDB => storage for Web Graph

NUTCH 2.x

2.0 released in July 2012

2.1 in October 2012

Common features as 1.xdelegation to SOLR, TIKA, MapReduce etc...

Moved to table-based architectureWealth of NoSQL projects in last few years

Abstraction over storage layer Apache GORA

Apache GORA

http://gora.apache.org/

ORM for NoSQL databasesand limited SQL support + file based storage

Serialization with Apache AVRO

Object-to-datastore mappings (backend-specific)

DataStore implementations

0.2.1 released in August 2012

Accumulo

Cassandra

HBase

Avro

DynamoDB (soon)

SQL

AVRO Schema => Java code

{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }},[]

Mapping file (backend specific Hbase)

DataStore operations

Atomic operationsget(K key)

put(K key, T obj)

delete(K key)

Queryingexecute(Query query) Result

deleteByQuery(Query query)

Wrappers for Apache HadoopGORAInput|OutputFormat

GoraRecordReader|Writer

GORAMapper|Reducer

GORA in Nutch

AVRO schema provided and java code pre-generated

Mapping files provided for backendscan be modified if necessary

Need to rebuild to get dependencies for backendNo binary distribution of Nutch 2.x

http://wiki.apache.org/nutch/Nutch2Tutorial

What does this mean for Nutch?

Benefits

Storage still distributed and replicated

but one big tablestatus, metadata, content, text one place

Simplified logic in NutchSimpler code for updating / merging information

More efficient (?)No need to read / write entire structure to update records

No comparison available yet + early days for GORA

Easier interaction with other resourcesThird-party code just need to use GORA and schema

What does this mean for Nutch?

Drawbacks

More stuff to install and configure :-)

Not as stable as Nutch 1.x

Dependent on success of Gora

2.x Work in progress

Stabilise backend implementationsGORA-Hbase most reliable

Synchronize features with 1.xe.g. has ElasticSearch but missing LinkRank equivalent

Filter enabled scans (GORA-119)Don't need to de-serialize the whole dataset

Future

New functionalities Support for SOLRCloud

Sitemap (from Crawler Commons library)

Canonical tag

More indexers (e.g. ElasticSearch) + pluggable indexers?

Both 1.x and 2.x in parallelbut more frequent releases for 2.x

More delegation

Great deal done in recent years (SOLR, Tika)

Share code with crawler-commons(http://code.google.com/p/crawler-commons/)Fetcher / protocol handling

Robots.txt parsing

URL normalisation / filtering

PageRank-like computations to graph library e.g. Apache Giraph

Should be more efficient as well

Where to find out more?

Project page : http://nutch.apache.org/

Wiki : http://wiki.apache.org/nutch/

Mailing lists : [email protected]

[email protected]

Chapter in 'Hadoop the Definitive Guide' (T. White)Understanding Hadoop is essential anyway...

Support / consulting : http://wiki.apache.org/nutch/Support

Questions

?

/

large scale crawling with apache nutch

Technology