large scale crawling with apache nutch and friends
Post on 18-Oct-2014
4.085 views
DESCRIPTION
Presented by Julien Nioche, Director, DigitalPebble This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.TRANSCRIPT
Large Scale Crawling with
Julien [email protected]
LUCENE/SOLR REVOLUTION EU 2013
Apache
and friends...
2 / 43
About myself
DigitalPebble Ltd, Bristol (UK) Specialised in Text Engineering
– Web Crawling– Natural Language Processing– Information Retrieval– Machine Learning
Strong focus on Open Source & Apache ecosystem VP Apache Nutch User | Contributor | Committer
– Tika– SOLR, Lucene – GATE, UIMA– Mahout– Behemoth
3 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
4 / 43
Nutch?
“Distributed framework for large scale web crawling”(but does not have to be large scale at all)
Based on Apache Hadoop
Apache TLP since May 2010
Indexing and Search by
5 / 43
A bit of history
2002/2003 : Started By Doug Cutting & Mike Caffarella
2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache
2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache
May 2010 : TLP project at Apache
Sept 2010 : Storage abstraction in Nutch 2.x– 2012 : Gora TLP @Apache
6 / 43
Recent Releases
trunk
2.2.12.0
1.5.11.3 1.41.1 1.21.0
06/1206/1106/1006/09
2.x
06/13
1.7
2.1
1.6
7 / 43
Why use Nutch?
Features– Index with SOLR / ES / CloudSearch– PageRank implementation– Loads of existing plugins– Can easily be extended / customised
Usual reasons– Open source with a business-friendly license, mature, community, ...
Scalability– Tried and tested on very large scale– Standard Hadoop
8 / 43
Use cases
Crawl for search– Generic or vertical– Index and Search with SOLR and al.– Single node to large clusters on Cloud
… but also– Data Mining– NLP (e.g.Sentiment Analysis)– ML
– MAHOUT / UIMA / GATE – Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)
with
9 / 43
Customer casesSpecificity (Verticality)
Size
BetterJobs.com (CareerBuilder)– Single server
– Aggregates content from job portals
– Extracts and normalizes structure (description, requirements, locations)
– ~2M pages total
– Feeds SOLR index
SimilarPages.com– Large cluster on Amazon EC2 (up to 400
nodes)
– Fetched & parsed 3 billion pages
– 10+ billion pages in crawlDB (~100TB data)
– 200+ million lists of similarities
– No indexing / search involved
10 / 43
http://commoncrawl.org/
Using Nutch 1.7 A few modifications to Nutch code
– https://github.com/Aloisius/nutch
Next release imminent
Open repository of web crawl data 2012 dataset : 3.83 billion docs ARC files on Amazon S3
CommonCrawl
11 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
12 / 43
Installation
http://nutch.apache.org/downloads.html
1.7 => src and bin distributions 2.2.1 => src only
'ant clean runtime'– runtime/local => local mode (test and debug)– runtime/deploy => job jar for Hadoop + scripts
Binary distribution for 1.x == runtime/local
13 / 43
Configuration and resources
Changes in $NUTCH_HOME/conf– Need recompiling with 'ant runtime'– Local mode => can be made directly in runtime/local/conf
Specify configuration in nutch-site.xml– Leave nutch-default alone!
At least :
<property> <name>http.agent.name</name> <value>WhateverNameDescribesMyMightyCrawler</value></property>
14 / 43
Running it!
bin/crawl script : typical sequence of steps
bin/nutch : individual Nutch commands– Inject / generate / fetch / parse / update ….
Local mode : great for testing and debugging
Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism– MapReduce UI to monitor crawl, check logs, counters
15 / 43
Monitor Crawl with MapReduce UI
16 / 43
Counters and logs
17 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
18 / 43
Typical Nutch Steps
1) Inject → populates CrawlDB from seed list
2) Generate → Selects URLS to fetch in segment
3) Fetch → Fetches URLs from segment
4) Parse → Parses content (text + metadata)
5) UpdateDB → Updates CrawlDB (new URLs, new status...)
6) InvertLinks → Build Webgraph
7) Index → Send docs to [SOLR | ES | CloudSearch | … ]
Sequence of batch operations
Or use the all-in-one crawl script
Repeat steps 2 to 7
Same in 1.x and 2.x
19 / 43
Main steps from a data perspective
CrawlDBSeed List Segment
/crawl_generate//crawl_fetch//content//crawl_parse//parse_data//parse_text/
LinkDB
20 / 43
Frontier expansion
Manual “discovery”– Adding new URLs by
hand, “seeding”
Automatic discovery of new resources (frontier expansion)– Not all outlinks are
equally useful - control– Requires content
parsing and link extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
21 / 43
An extensible framework
Endpoints– Protocol– Parser– HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)– ScoringFilter (used in various places)– URLFilter (ditto)– URLNormalizer (ditto)– IndexingFilter– IndexWriter (NEW IN 1.7!)
Plugins– Activated with parameter 'plugin.includes'– Implement one or more endpoints
22 / 43
Features
Fetcher– Multi-threaded fetcher– Queues URLs per hostname / domain / IP– Limit the number of URLs for round of fetching– Default values are polite but can be made more aggressive
Crawl Strategy – Breadth-first but can be depth-first– Configurable via custom ScoringFilters
Scoring– OPIC (On-line Page Importance Calculation) by default– LinkRank
23 / 43
Features (cont.)
Protocols– Http, file, ftp, https– Respects robots.txt directives
Scheduling– Fixed or adaptive
URL filters– Regex, FSA, TLD, prefix, suffix
URL normalisers– Default, regex
24 / 43
Features (cont.)
Other plugins– CreativeCommons– Feeds– Language Identification– Rel tags– Arbitrary Metadata
Pluggable indexing– SOLR | ES etc...
Parsing with Apache Tika– Hundreds of formats supported– But some legacy parsers as well
25 / 43
Indexing
Apache SOLR– schema.xml in conf/– SOLR 3.4 – JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377
ElasticSearch– Version 0.90.1
AWS CloudSearch– WIP : https://issues.apache.org/jira/browse/NUTCH-1517
Easy to build your own– Text, DB, etc...
26 / 43
Typical Nutch document
Some of the fields (IndexingFilters in plugins or core code)– url– content– title– anchor– site– boost– digest– segment– host– type
Configurable ones– meta tags (keywords, description etc...)– arbitrary metadata
27 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
28 / 43
NUTCH 2.x
2.0 released in July 2012
2.2.1 in July 2013
Common features as 1.x– MapReduce, Tika, delegation to SOLR, etc...
Moved to 'big table'-like architecture– Wealth of NoSQL projects in last few years
Abstraction over storage layer → Apache GORA
29 / 43
Apache GORA
http://gora.apache.org/
ORM for NoSQL databases– and limited SQL support + file based storage
Serialization with Apache AVRO
Object-to-datastore mappings (backend-specific)
DataStore implementations
Current version 0.3
● Accumulo● Cassandra● HBase
● Avro● DynamoDB● SQL (broken)
30 / 43
AVRO Schema => Java code
{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }},[…]
31 / 43
Mapping file (backend specific – Hbase)
<gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
32 / 43
DataStore operations
Basic operations– get(K key) – put(K key, T obj)– delete(K key)
Querying– execute(Query<K, T> query) → Result<K,T>– deleteByQuery(Query<K, T> query)
Wrappers for Apache Hadoop– GORAInput|OutputFormat– GoraRecordReader|Writer– GORAMapper|Reducer
33 / 43
GORA in Nutch
AVRO schema provided and java code pre-generated
Mapping files provided for backends
– can be modified if necessary
Need to rebuild to get dependencies for backend– hence source only distribution of Nutch 2.x
http://wiki.apache.org/nutch/Nutch2Tutorial
34 / 43
Benefits
Storage still distributed and replicated
… but one big table
– status, metadata, content, text → one place
– no more segments
Resume-able fetch and parse steps
Easier interaction with other resources
– Third-party code just need to use GORA and schema
Simplify the Nutch code
Potentially faster (e.g. update step)
35 / 43
Drawbacks
More stuff to install and configure– Higher hardware requirements
Current performance :-(– http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html– N2+HBase : 2.7x slower than 1.x– N2+Cassandra : 4.4x slower than 1.x– due mostly to GORA layer : not inherent to Hbase or Cassandra– https://issues.apache.org/jira/browse/GORA-119 → filtered scans– Not all backends provide data locality!
Not as stable as Nutch 1.x
36 / 43
2.x Work in progress
Stabilise backend implementations– GORA-Hbase most reliable
Synchronize features with 1.x– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)– No pluggable indexers yet (NUTCH-1568)
Filter enabled scans– GORA-119
• => don't need to de-serialize the whole dataset
37 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
38 / 43
Future
New functionalities – Support for SOLRCloud– Sitemap (from CrawlerCommons library)– Canonical tag– Generic deduplication (NUTCH-656)
1.x and 2.x to coexist in parallel– 2.x not yet a replacement of 1.x
Move to new MapReduce API– Use Nutch on Hadoop 2.x
39 / 43
More delegation
Great deal done in recent years (SOLR, Tika)
Share code with crawler-commons(http://code.google.com/p/crawler-commons/)– Fetcher / protocol handling– URL normalisation / filtering
PageRank-like computations to graph library– Apache Giraph– Should be more efficient + less code to maintain
40 / 43
Longer term
Hadoop 2.x & YARN
Convergence of batch and streaming– Storm / Samza / Storm-YARN / …
End of 100% batch operations ?– Fetch and parse as streaming ?– Always be fetching– Generate / update / pagerank remain batch
See https://github.com/DigitalPebble/storm-crawler
41 / 43
Where to find out more?
Project page : http://nutch.apache.org/ Wiki : http://wiki.apache.org/nutch/ Mailing lists :
– [email protected]– [email protected]
Chapter in 'Hadoop the Definitive Guide' (T. White)– Understanding Hadoop is essential anyway...
Support / consulting : – http://wiki.apache.org/nutch/Support
42 / 43
Questions
?
43 / 43