nutch - web-scale search engine toolkit

54

Click here to load reader

Upload: abial

Post on 06-May-2015

8.277 views

Category:

Technology


3 download

DESCRIPTION

This slideset presents the Nutch search engine (http://lucene.apache.org/nutch). A high-level architecture is described, as well as some challenges common in web-crawling and solutions implemented in Nutch. The presentation closes with a brief look into the Nutch future.

TRANSCRIPT

Page 1: Nutch - web-scale search engine toolkit

Nut

ch ndash

Apa

cheC

on U

S 0

9

Web-scale search engine toolkit

Today and tomorrow

Andrzej Białeckiabapacheorg

Apache

2

Nut

ch ndash

Apa

cheC

on U

S 0

9

AgendaAbout the projectWeb crawling in generalNutch architecture overviewNutch workflow

bull Setupbull Crawlingbull SearchingChallenges (and some solutions)Nutch present and futureQuestions and answers

3

Nut

ch ndash

Apa

cheC

on U

S 0

9

Apache Nutch projectFounded in 2003 by Doug Cutting the Lucene creator and Mike CafarellaApache project since 2004 (sub-project of Lucene)Spin-offs

bull Map-Reduce and distributed FS rarr Hadoopbull Content type detection and parsing rarr TikaMany installations in operation mostly vertical searchCollections typically 1 mln - 200 mln documents

4

Nut

ch ndash

Apa

cheC

on U

S 0

9

5

Nut

ch ndash

Apa

cheC

on U

S 0

9

Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random

1

3

4

56

2

7

8

9

1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9

6

Nut

ch ndash

Apa

cheC

on U

S 0

9

Whats in a search engine

hellip a few things that may surprise you

7

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

8

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular

minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework

minus Map-reduce processingFull-text indexer amp search engine

minus Using Lucene or Solrminus Support for distributed search

Robust API and integration options

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 2: Nutch - web-scale search engine toolkit

2

Nut

ch ndash

Apa

cheC

on U

S 0

9

AgendaAbout the projectWeb crawling in generalNutch architecture overviewNutch workflow

bull Setupbull Crawlingbull SearchingChallenges (and some solutions)Nutch present and futureQuestions and answers

3

Nut

ch ndash

Apa

cheC

on U

S 0

9

Apache Nutch projectFounded in 2003 by Doug Cutting the Lucene creator and Mike CafarellaApache project since 2004 (sub-project of Lucene)Spin-offs

bull Map-Reduce and distributed FS rarr Hadoopbull Content type detection and parsing rarr TikaMany installations in operation mostly vertical searchCollections typically 1 mln - 200 mln documents

4

Nut

ch ndash

Apa

cheC

on U

S 0

9

5

Nut

ch ndash

Apa

cheC

on U

S 0

9

Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random

1

3

4

56

2

7

8

9

1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9

6

Nut

ch ndash

Apa

cheC

on U

S 0

9

Whats in a search engine

hellip a few things that may surprise you

7

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

8

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular

minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework

minus Map-reduce processingFull-text indexer amp search engine

minus Using Lucene or Solrminus Support for distributed search

Robust API and integration options

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 3: Nutch - web-scale search engine toolkit

3

Nut

ch ndash

Apa

cheC

on U

S 0

9

Apache Nutch projectFounded in 2003 by Doug Cutting the Lucene creator and Mike CafarellaApache project since 2004 (sub-project of Lucene)Spin-offs

bull Map-Reduce and distributed FS rarr Hadoopbull Content type detection and parsing rarr TikaMany installations in operation mostly vertical searchCollections typically 1 mln - 200 mln documents

4

Nut

ch ndash

Apa

cheC

on U

S 0

9

5

Nut

ch ndash

Apa

cheC

on U

S 0

9

Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random

1

3

4

56

2

7

8

9

1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9

6

Nut

ch ndash

Apa

cheC

on U

S 0

9

Whats in a search engine

hellip a few things that may surprise you

7

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

8

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular

minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework

minus Map-reduce processingFull-text indexer amp search engine

minus Using Lucene or Solrminus Support for distributed search

Robust API and integration options

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 4: Nutch - web-scale search engine toolkit

4

Nut

ch ndash

Apa

cheC

on U

S 0

9

5

Nut

ch ndash

Apa

cheC

on U

S 0

9

Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random

1

3

4

56

2

7

8

9

1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9

6

Nut

ch ndash

Apa

cheC

on U

S 0

9

Whats in a search engine

hellip a few things that may surprise you

7

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

8

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular

minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework

minus Map-reduce processingFull-text indexer amp search engine

minus Using Lucene or Solrminus Support for distributed search

Robust API and integration options

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 5: Nutch - web-scale search engine toolkit

5

Nut

ch ndash

Apa

cheC

on U

S 0

9

Web as a directed graphNodes (vertices) URL-s as unique identifiersEdges (links) hyperlinks like lta href=rdquotargetUrlrdquogtEdge labels lta href=rdquordquogtanchor textltagtOften represented as adjacency (neighbor) listsTraversal follow the edges breadth-first depth-first random

1

3

4

56

2

7

8

9

1 rarr 2 3 4 5 65 rarr 6 97 rarr 3 4 8 9

6

Nut

ch ndash

Apa

cheC

on U

S 0

9

Whats in a search engine

hellip a few things that may surprise you

7

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

8

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular

minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework

minus Map-reduce processingFull-text indexer amp search engine

minus Using Lucene or Solrminus Support for distributed search

Robust API and integration options

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 6: Nutch - web-scale search engine toolkit

6

Nut

ch ndash

Apa

cheC

on U

S 0

9

Whats in a search engine

hellip a few things that may surprise you

7

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

8

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular

minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework

minus Map-reduce processingFull-text indexer amp search engine

minus Using Lucene or Solrminus Support for distributed search

Robust API and integration options

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 7: Nutch - web-scale search engine toolkit

7

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

8

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular

minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework

minus Map-reduce processingFull-text indexer amp search engine

minus Using Lucene or Solrminus Support for distributed search

Robust API and integration options

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 8: Nutch - web-scale search engine toolkit

8

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch features at a glancePage database and link database (web graph)Plugin-based highly modular

minus Most behavior can be changed via pluginsMulti-protocol multi-threaded distributed crawlerPlugin-based content processing (parsing filtering)Robust crawling frontier controlsScalable data processing framework

minus Map-reduce processingFull-text indexer amp search engine

minus Using Lucene or Solrminus Support for distributed search

Robust API and integration options

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 9: Nutch - web-scale search engine toolkit

9

Nut

ch ndash

Apa

cheC

on U

S 0

9

Hadoop foundationFile system abstraction

bull Local FS orbull Distributed FS

minus also Amazon S3 Kosmos and other FS implMap-reduce processing

bull Currently central to Nutch algorithmsbull Processing tasks are executed as one or more map-

reduce jobsbull Data maintained as Hadoop MapFile-s SequenceFile-sbull Massive updates very efficientbull Small updates costly

minus Hadoop data is immutable once createdminus Updates are really merge amp replace

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 10: Nutch - web-scale search engine toolkit

10

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search engine building blocks

Web graph-page info-links (inout)

Crawler

Parser

Searcher

IndexerUpdater

Scheduler

Contentrepository

Injector

Crawling frontier controls

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 11: Nutch - web-scale search engine toolkit

11

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch building blocks

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 12: Nutch - web-scale search engine toolkit

12

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Maintains info on all known URL-s Fetch schedule Fetch status Page signature Metadata

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 13: Nutch - web-scale search engine toolkit

13

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

For each target URL keeps info on incoming links ie list of source URL-s and their associated anchor text

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 14: Nutch - web-scale search engine toolkit

14

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch data

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

Shards (ldquosegmentsrdquo) keep Raw page content Parsed content + discovered metadata + outlinks Plain text for indexing and snippets Optional per-shard indexes

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 15: Nutch - web-scale search engine toolkit

15

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch shards (aka ldquosegmentsrdquo)Unit of work (batch) ndash easier to process massive datasetsConvenience placeholder using predefined directory namesUnit of deployment to the search infrastructureMay include per-shard Lucene indexesOnce completed they are basically unmodifiable

minus No in-place updates of content or replacing of obsolete contentPeriodically phased-out

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

2009043012345

crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_text

200904301234crawl_generatecrawl_fetchcontentcrawl_parseparse_dataparse_textindexes

Generator

Fetcher

Parser

Indexersnippets

ldquocachedrdquo view

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 16: Nutch - web-scale search engine toolkit

16

Nut

ch ndash

Apa

cheC

on U

S 0

9

URL life-cycle and shardsinjected

discoveredscheduledfor fetching

S1

fetched up-to-date

scheduledfor fetching

fetched up-to-date

A time-lapse view of realityGoals

bull Fresh index rarr sync with the real rate of changes

bull Minimize re-fetching rarr dont fetch unmodified pages

Each page may need its own crawl scheduleShard management

bull Page may be present in many shards but only the most recent record is valid

bull Inconvenient to update in-place just mark as deleted

bull Phase-out old shards and force re-fetch of the remaining pages

scheduledfor fetching

fetched up-to-date

time

S2

S3

real page changesobserved changes

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 17: Nutch - web-scale search engine toolkit

17

Nut

ch ndash

Apa

cheC

on U

S 0

9

Crawling frontierNo authoritative catalog of web pagesSearch engines discover their view of web universe

minus Start from ldquoseed listrdquominus Follow (walk) some (useful interesting) outlinks

Many dangers of simply wandering aroundminus explosion or collapse of the frontierminus collecting unwanted content (spam junk offensive)

I could usesome guidance

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 18: Nutch - web-scale search engine toolkit

18

Nut

ch ndash

Apa

cheC

on U

S 0

9

Controlling the crawling frontierURL filter plugins

minus White-list black-list regexminus May use external resources

(DB-s services )

URL normalizer pluginsminus Resolving relative path

elementsminus ldquoEquivalentrdquo URLs

Additional controls using scoring plugins

minus priority metadata selectblockCrawler traps ndash difficult

minus Domain host path-level stats

seed

i = 1

i = 2i = 3

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 19: Nutch - web-scale search engine toolkit

19

Nut

ch ndash

Apa

cheC

on U

S 0

9

Wide vs focused crawlingDifferences

bull Little technical difference in configurationbull Big difference in operations maintenance and qualityWide crawling

minus (Almost) Unlimited crawling frontierminus High risk of spamming and junk contentminus ldquoPolitenessrdquo a very important limiting factorminus Bandwidth amp DNS considerations

Focused (vertical or enterprise) crawlingminus Limited crawling frontierminus Bandwidth or politeness is often not an issueminus Low risk of spamming and junk content

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 20: Nutch - web-scale search engine toolkit

20

Nut

ch ndash

Apa

cheC

on U

S 0

9

Vertical amp enterprise searchVertical search

bull Range of selected ldquoreferencerdquo sitesbull Robust control of the crawling frontierbull Extensive content post-processingbull Business-driven decisions about rankingEnterprise search

bull Variety of data sources and data formatsbull Well-defined and limited crawling frontierbull Integration with in-house data sourcesbull Little danger of spambull PageRank-like scoring usually works poorly

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 21: Nutch - web-scale search engine toolkit

21

Nut

ch ndash

Apa

cheC

on U

S 0

9

Face to face with Nutch

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 22: Nutch - web-scale search engine toolkit

22

Nut

ch ndash

Apa

cheC

on U

S 0

9

Simple set-up and runningYou already have Java 5+ rightGet a 10 release or a nightly build (pretty stable)Simple search setup uses Tomcat for search web appCommand-line bash script binnutch

bull Windows users get Cygwinbull Early version of a web-based UI consolebull Hadoop web-based monitoring

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 23: Nutch - web-scale search engine toolkit

23

Nut

ch ndash

Apa

cheC

on U

S 0

9

Configuration filesEdit configuration in confnutch-sitexml

bull Check nutch-defaultxml for defaults and docsbull You MUST at least fill the name of your agentActive plugins configurationPer-plugin propertiesExternal configuration files

bull regex-urlfilterxmlbull regex-normalizexmlbull parse-pluginsxml mapping of MIME type to plugin

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 24: Nutch - web-scale search engine toolkit

24

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch pluginsPlugin-based extensions for

bull Crawl schedulingbull URL filtering and normalizationbull Protocol for getting the contentbull Content parsingbull Text analysis (tokenization)bull Page signature (to detect near-duplicates)bull Indexing filters (index fields amp metadata)bull Snippet generation and highlightingbull Scoring and rankingbull Query translation and expansion (user rarr Lucene)

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 25: Nutch - web-scale search engine toolkit

25

Nut

ch ndash

Apa

cheC

on U

S 0

9

Main Nutch workflow

Inject initial creation of CrawlDBbull Insert seed URLsbull Initial LinkDB is empty

Generate new shards fetchlistFetch raw contentParse content (discovers outlinks)Update CrawlDB from shardsUpdate LinkDB from shardsIndex shards

(repeat)

Command-line binnutchinject

generatefetchparseupdatedbupdatelinkdbindex solrindex

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 26: Nutch - web-scale search engine toolkit

26

Nut

ch ndash

Apa

cheC

on U

S 0

9

Injecting new URL-s

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 27: Nutch - web-scale search engine toolkit

27

Nut

ch ndash

Apa

cheC

on U

S 0

9

Generating fetchlists

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 28: Nutch - web-scale search engine toolkit

28

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow generating fetchlistsWhat to fetch next

bull Breadth-first ndash important due to ldquopolitenessrdquo limitsbull Expired (longer than fetchTime + fetchInterval)bull Highly ranking (PageRank)bull Newly addedFetchlist generation

bull ldquotopNrdquo - select best candidatesminus Priority based on many factors pluggable

Adaptive fetch schedulebull Detect rate of changes amp time of changebull Detect unmodified content

minus Hmm and how to recognize this rarr approximate page signatures (near-duplicate detection)

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 29: Nutch - web-scale search engine toolkit

29

Nut

ch ndash

Apa

cheC

on U

S 0

9

Fetching content

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 30: Nutch - web-scale search engine toolkit

30

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow fetching

Multi-protocol HTTP(s) FTP file etc Coordinates multiple threads accessing the same hostbull ldquopolitenessrdquo issues vs crawling efficiencybull Host an IP or DNS namebull Redirections should follow immediately What kind

Other netiquette issuesbull robotstxt disallowed paths Crawl-DelayPreparation for parsingbull Content type detection issuesbull Parsing usually executed as a separate step

(resource hungry and sometimes unstable)

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 31: Nutch - web-scale search engine toolkit

31

Nut

ch ndash

Apa

cheC

on U

S 0

9

Content processing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 32: Nutch - web-scale search engine toolkit

32

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow content processingProtocol plugins retrieve content as plain bytes + protocol-level metadata (eg HTTP headers)Parse pluginsbull Content is parsed by MIME-type specific parsersbull Content is parsed into parseData (title outlinks other

metadata) and parseText (plain text content)Nutch supports many popular file formats

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 33: Nutch - web-scale search engine toolkit

33

Nut

ch ndash

Apa

cheC

on U

S 0

9

Link inversion

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 34: Nutch - web-scale search engine toolkit

34

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow link inversionPages have outgoing links (outlinks)

hellip I know where Im pointing to

Question who points to mehellip I dont know there is no catalog of pageshellip NOBODY knows for sure either

In-degree indicates importance of the pageAnchor text provides important semantic infoPartial answer invert the outlinks that I know about and group by target

src 1

tgt 2

tgt 3

tgt 4tgt 5

tgt 1

tgt 1

src 2

src 3

src 4src 5

src 1

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 35: Nutch - web-scale search engine toolkit

35

Nut

ch ndash

Apa

cheC

on U

S 0

9

Page importance - scoring

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 36: Nutch - web-scale search engine toolkit

36

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow page scoringQuery-independent importance factor

bull Affects search rankingbull Affects crawl prioritizationMay include arbitrary decision of the operatorCurrently two systems (plugins + tools)OPIC scoring

bull Doesnt require explicit steps - ldquoonlinerdquobull Difficult to stabilizePageRank scoring with loop detection

bull Periodically run to update CrawlDb scoresbull Computationally intensive esp loop detection

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 37: Nutch - web-scale search engine toolkit

37

Nut

ch ndash

Apa

cheC

on U

S 0

9

Indexing

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 38: Nutch - web-scale search engine toolkit

38

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow indexingIndexing plugins

bull Create full-text search index from segment databull Apply adjustments to score per documentbull May further post-process the parsed content (eg

language identification) to facilitate advanced searchIndexers

bull Lucene indexer ndash builds indexes to be served by Nutch search servers

bull Solr indexer ndash submits documents to a Solr instance

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 39: Nutch - web-scale search engine toolkit

39

Nut

ch ndash

Apa

cheC

on U

S 0

9

De-duplication

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 40: Nutch - web-scale search engine toolkit

40

Nut

ch ndash

Apa

cheC

on U

S 0

9

Workflow de-duplicationThe same page may be present in many shards

bull Obsolete versions in older shardsbull Mirrored pages or equivalent (acom rarr wwwacom)Many other pages are almost identical

bull Template-related differences (banners current date)bull Font layout changes minor re-wording

Hmm hellip what is a significant change bull Tricky issue Hint what is the page content

Near-duplicate detection and removalbull Nutch uses approximate page signatures (fingerprints)bull Duplicates are only marked as deleted

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 41: Nutch - web-scale search engine toolkit

41

Nut

ch ndash

Apa

cheC

on U

S 0

9

Cycles may overlap

Fetcher

Parser

Searcher

IndexerUpdater

Generator

Shards(segments)

LinkDB

CrawlDB

Linkinverter

Injector

URL filters amp normalizers parsingindexing filters scoring plugins

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 42: Nutch - web-scale search engine toolkit

42

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch distributed search

Multiple search serversSearch front-end dispatches queries and collects search results

bull multiple access methods (API OpenSearch XML JSON)

Page servers build snippetsFault-tolerant

with degraded quality

Search front-end(s)Web app

APIXML

NutchBean

JSON

Search server(Nutch or Solr)

indexed shards

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 43: Nutch - web-scale search engine toolkit

43

Nut

ch ndash

Apa

cheC

on U

S 0

9

Search configurationNutch syntax is limited ndash on purpose

bull Some queries are costly eg leading wildcard very longbull Some queries may need implicit (hidden) expansionQuery plugins

bull From user query to LuceneSolr queryUser query web search+(urlweb^40 anchorweb^20 contentweb titleweb^15 hostweb^20)+(urlsearch^40 anchorsearch^20 contentsearch titlesearch^15hostsearch^20) urlweb search~10^40 anchorweb search~4^20contentweb search~10 titleweb search~10^15hostweb search~10^20

Search server configurationbull Single search vs distributedbull Using Nutch searcher or Solr or a mixbull Currently no global IDF calculation

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 44: Nutch - web-scale search engine toolkit

44

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment single-serverThe easiest but the most limited deploymentCentralized storage centralized processingHadoop LocalFS and LocalJobTrackerDrop nutchwar in Tomcatwebapps and point to shards

Searchfront-endSearch

front-end

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 45: Nutch - web-scale search engine toolkit

45

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment distributed searchLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

LocalFSStorage

LocalFSStorage

FetcherFetcher

IndexerIndexer

DB mgmtDB mgmt

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 46: Nutch - web-scale search engine toolkit

46

Nut

ch ndash

Apa

cheC

on U

S 0

9

Deployment map-reduce processing amp distrib search

Fully distributed storage and processingLocal storage on search servers preferred (perf)

Searchfront-endSearch

front-end

Search serverSearch server

Search server

DFS Data NodeDFS Data Node

DFS Data Node

DFS Name NodeDFS Name Node(Job Tracker)

FetcherIndexer

DB mgmt

(Job Tracker)FetcherIndexer

DB mgmt

Task TrackerTask Tracker

Task Tracker

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 47: Nutch - web-scale search engine toolkit

47

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch on a Hadoop cluster

Assumes an up amp running Hadoop clusterhellip which is covered elsewhere

Build nutchjob jar and use it as usual

binhadoop jar nutchjob ltclassNamegt ltargsgt

Note Nutch configuration is inside nutchjobminus When you change it you need to rebuild job jar

Searching is not a map-reduce job ndash often on a separate group of machines

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 48: Nutch - web-scale search engine toolkit

48

Nut

ch ndash

Apa

cheC

on U

S 0

9

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 49: Nutch - web-scale search engine toolkit

49

Nut

ch ndash

Apa

cheC

on U

S 0

9

Nutch index

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 50: Nutch - web-scale search engine toolkit

50

Nut

ch ndash

Apa

cheC

on U

S 0

9

hellip and press Search

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 51: Nutch - web-scale search engine toolkit

51

Nut

ch ndash

Apa

cheC

on U

S 0

9

Conclusions(This overview is a tip of the iceberg)

NutchImplements all core search engine componentsScales wellExtremely configurable and modularIts a complete solution ndash and a toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 52: Nutch - web-scale search engine toolkit

52

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of NutchAvoid code duplicationParsing rarr Tika

bull Almost total overlapbull Still some missing functionality in Tika rarr contributePlugins rarr OSGI

bull Home-grown plugin system has some deficienciesbull Initial port availableIndexing amp Search rarr Solr Zookeeper

bull Distributed and replicated search is difficultbull Initial integration needs significant improvementbull Shard management - KattaWeb-graph amp page repository rarr HBase

bull Combine CrawlDB LinkDB and shard storagebull Avoid tedious shard managementbull Initial port available

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 53: Nutch - web-scale search engine toolkit

53

Nut

ch ndash

Apa

cheC

on U

S 0

9

Future of Nutch (2)Whats left then

bull Crawling frontier management discoverybull Re-crawl algorithmsbull Spider trap handlingbull Fetcherbull Ranking enterprise-specific user-feedbackbull Duplicate detection URL aliasing (mirror detection)bull Template detection and cleanup pagelet-level crawlingbull Spam amp junk controlShare code with other crawler projects rarr crawler-commons

Vision aacute la carte toolkit scalable from1-1000s nodes

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch

Page 54: Nutch - web-scale search engine toolkit

54

Nut

ch ndash

Apa

cheC

on U

S 0

9

Summary

QampA

Further informationbull httpluceneapacheorgnutch