epl660: information retrieval and search engines lab 8 · epl660: information retrieval and search...

University of Cyprus

Department of

Computer Science

EPL660: Information

Retrieval and Search

Engines – Lab 8

Παύλος Αντωνίου

Γραφείο: B109, ΘΕΕ01

What is Apache Nutch?

• Production ready Web Crawler

• Operates at one of three scales:

– local filesystem (reliable, no network errors, caching is

unnecessary)

– Intranet (local/corporate network)

– whole web (whole Web crawling is difficult)

• Nutch can run on a single machine (local mode), but

gains a lot of its strength from running οn a Hadoop

cluster (deploy mode)

• Relies on Apache Hadoop data structures, which are

great for batch processing

• Open source

• Implemented in Java

Nutch Code Bases

• Nutch 1.x:

– A well matured, production ready crawler

– Fine grained configuration

– Relies on Apache Hadoop data structures

• Nutch 2.x:

– An emerging alternative taking direct inspiration from 1.x,

– Differs in one key area; storage is abstracted away from

any specific underlying data store by using Apache

Gora for handling object to persistent mappings.

– Provides extremely flexible model/stack for storing

everything (fetch time, status, content, parsed text,

outlinks, inlinks, etc.) into a number of NoSQL storage

solutions.

Nutch vs Lucene

• Nutch is using Lucene (through Solr or Elastic

Search) for indexing

• Common question "Should I use Lucene or

Nutch?"

– Simple answer: You should use Lucene if you

don't need a web crawler i.e. for fetching the docs to be

indexed

• Nutch is a better fit for sites

– where you don't have direct access to the underlying

– data comes from disparate sources

• multiple domains

• different doc format: json, xml, text, html, ...

Nutch vs Solr/ElasticSearch

• Nutch is a web crawler

– collect web pages or other web accessible resources

– uses Solr or ElasticSearch for indexing

• Solr/ElasticSearch is a search platform

– No crawling: doesn't fetch the data, you have to feed it

– Perfect if you have data to be indexed already (in XML,

json, database, etc.)

Nutch building blocks

Nutch Data

• Nutch data is composed of:

– crawl/crawldb

• contains information about all pages (URLs) known

to the crawler and their status, such as the last time

it visited the page, its fetching status, refresh

interval, content checksum, page importance, etc.

– crawl/linkdb

• for each URL known to Nutch, it contains a list of

other URLs pointing to it (incoming links) and their

associated anchor text (from HTML <a href=“…”>anchor text</a> elements)

Nutch Data

– crawl/segments

Segments are directories with the following

subdirectories:

• a crawl_generate names a set of URLs to be fetched

• a crawl_fetch contains the status of fetching each URL

• a content contains the raw content retrieved from each

URL (for indexing)

• a parse_text contains the parsed text of each URL

• a parse_data contains outlinks and metadata parsed

from each URL (such as anchor text)

• a crawl_parse contains the outlink URLs, used to

update the crawldb

Crawling frontier challenge

• No authoritative catalog of web pages

• Where to start crawling from?

• Crawlers need to discover their view of web

universe

– Start from “seed list” & follow (walk) some (useful?

interesting?) outlinks

• Many dangers of simply wandering around

– explosion or collapse of the frontier; collecting

unwanted content (spam, junk, offensive)

Main Nutch workflow

• Inject: initial creation of CrawlDB

– Insert seed URLs to CrawlDB

– Initial LinkDB is empty

• Generate new shard's fetchlist

• Fetch raw content

• Parse content (discovers outlinks)

• Update CrawlDB from shards

• Update LinkDB from shards

• Index shards

Command-line:

bin/nutch

inject

generate

updatedb

invertlinks

index /

solrindexEvery step is implemented as one (or more) MapReduce job(s)

(from crawldb to

crawl/segments/crawl_generate)

Injecting new URLs1) Specify a list of

URLs you want to crawl

2) Use a URL filter

3) Use the injector to add URLs to the crawldb

Note: filters, normalizers and plugins allow Nutch to be highly

modular, flexible and very customizable throughout the whole process.

Generate-ing fetchlists4) Generate a fetch list from the crawldb

5) Create segment directory for the generated fetch list

Fetching content

6) Fetch segment

Content processing

7) Parse the results

and update CrawldB

Link inversion

8) Before indexing, invert all links, so that incoming anchor

text can be indexed with pages

Link Inversion

• Pages (urls) have outgoing links (outlinks)

– … I know where I am pointing to

• Question: Who points to me?

– … I don’t know, there is no catalog of pages

– … NOBODY knows for sure either!

• In-degree may indicate importance of the page

• Anchor text provides important semantic info

• Answer: invert the outlinks that I know about

Link Inversion as MR job

• Goal: Compute inlinks for all downloaded and

parsed pages

• Input: each page as a pair of <srcUrl, ParseData>

– ParseData contain page outlinks (destUrls)

• Map <srcUrl, ParseData> → <destUrl, Inlinks>

– where Inlinks: <srcUrl, anchorText>

• Reduce: Map output pairs <destUrl, Inlinks>

grouped by destUrl, append Inlinks in a dedicated

java writeable class

• Output: <destUrl, list of Inlinks>

Page importance - scoring

9) Page importance metadata based on inverted links are

stored in CrawlDB

Indexing

Lucene

10) Using data from all possible sources (crawlDB, linkDB,

segments) the indexer creates an index and saves it within

the Solr directory. For indexing, the Lucene library is used.

11) Users can search for information

regarding the crawled web pages via Solr.

Nutch from binary distribution

• Download Apache Nutch 1.16 binary package

from here (you can download Nutch 2.4)

• Unzip your binary Nutch package

• cd apache-nutch-1.16/

• Confirm correct installation

– run "bin/nutch"

• If you are seeing "Permission denied"

– run "chmod +x bin/nutch"

Crawl your first website

• Nutch requires two configuration changes before

a website can be crawled:

1. Customize your crawl properties, where at a

minimum, you provide a name for your crawler

for external servers to recognize

2. Set a seed list of URLs to crawl

Customize your crawl properties

• Default crawl properties: conf/nutch-default.xml

– Mainly remains unchanged

• conf/nutch-site.xml serves as a place to add your

own custom crawl properties that

overwrite conf/nutch-default.xml.

– Add your agent name in the value field of the

http.agent.name property in conf/nutch-site.xml within

<name>http.agent.name</name>

<value>My Nutch Spider</value>

</property>

Crawl your first website: Seed list

• A URL seed list includes a list of websites, one-

per-line, which nutch will look to crawl

• Create a URL seed list

– mkdir -p urls

– cd urls

– nano seed.txt to create a text file seed.txt under

urls/ with the following content (one URL per line for

each site you want Nutch to crawl). • http://nutch.apache.org/

• (one URL per line for each site you want Nutch to crawl)

Configure Reg. Expression Filters

• conf/regex-urlfilter.txt will provide regular

expressions that allow nutch to filter and narrow

the types of web resources to crawl and download

• Edit the file conf/regex-urlfilter.txt and

REPLACE

# accept anything else

+^http://([a-z0-9]*\.)*nutch.apache.org/

if, for example, you wished to limit the crawl to

the nutch.apache.org domain

• NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all

domains linking to your seed URLs file being crawled as well.

Seeding crawldb with list of URLs

• The injector adds URLs to the crawldb

– bin/nutch inject crawl/crawldb urls

• STEP 1: FETCHING, PARSING PAGES

• Generate fetch list for all pages due to be fetched.

The fetch list is placed in a newly created

segment directory

– bin/nutch generate crawl/crawldb

crawl/segments

– The segment directory is named by the time it's created• s1=`ls -d crawl/segments/2* | tail -1`

• echo $s1

• Run the fetcher on this segment

– bin/nutch fetch $s1

• Parse the entries

– bin/nutch parse $s1

• When this is complete, we update the crawldb

database with the results of the fetch:

– bin/nutch updatedb crawl/crawldb $s1

• First fetching: Now crawldb database contains

both updated entries for all initial pages + new

entries that correspond to newly discovered

pages linked from the initial set.

• Now we generate and fetch a new segment containing the

top-scoring 1,000 pages:– bin/nutch generate crawl/crawldb crawl/segments -

topN 1000

– s2=`ls -d crawl/segments/2* | tail -1`

• Let’s fetch one more round:– bin/nutch generate crawl/crawldb crawl/segments -

topN 1000

– s3=`ls -d crawl/segments/2* | tail -1`

• STEP 2: INVERTLINKS

• Before indexing we first invert all links, so that we

may index incoming anchor text with the pages.

– bin/nutch invertlinks crawl/linkdb -dir

crawl/segments

• STEP 3: INDEXING INTO APACHE SOLR

[Nutch-Solr integration needed]• Usage: bin/nutch solrindex <solr url> <crawldb> [-

linkdb <linkdb>][-params k1=v1&k2=v2...] (<segment>

...| -dir <segments>) [-noCommit] [-deleteGone] [-

filter] [-normalize]

• Example: bin/nutch solrindex

http://localhost:8983/solr crawl/crawldb/ -linkdb

crawl/linkdb/ crawl/segments/20131108063838/ -filter

-normalize

• STEP 4: DELETING DUPLICATES

• Ensure urls are unique in index

• Usage: bin/nutch solrdedup <solr url>

• Example: /bin/nutch solrdedup

http://localhost:8983/solr

• STEP 5: CLEANING SOLR

• Scan crawldb directory looking for entries with

status DB_GONE (404) and sends delete

requests to Solr for those documents

• Usage: bin/nutch solrclean <crawldb>

• Example: /bin/nutch solrclean

crawl/crawldb/ http://localhost:8983/solr

All In One: Using the Crawl Command

bin/crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>

-i|--index Indexes crawl results into a configured indexer

-D A Java property to pass to Nutch calls

Seed Dir Directory in which to look for a seeds file

Crawl Dir Directory where the crawl/link/segments dirs are saved

Num Rounds The number of rounds to run this crawl for

Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2

Nutch Command Line Options

Below are some of the command line options

• bin/nutch readdb crawlDir/crawldb -stats

• bin/nutch readdb crawlDir/crawldb -dump

outdump

• bin/nutch readdb crawlDir/crawldb -topN 2

outreaddbtop

• bin/nutch readdb crawlDir/linkdb -dump

outputlinkdb

For more options:

http://wiki.apache.org/nutch/CommandLineOptions

Nutch deploy mode

• local mode: run Nutch in a single process on one

machine, using Hadoop as a dependency

• deploy mode: take into account Hadoop

configurations installed on machine

– Copy hadoop-env.sh, core-site.xml, hdfs-site.xml,

mapred-site.xml from /usr/local/hadoop/conf to

~/apache-nutch-1.16/conf

• sudo cp /usr/local/hadoop/conf/hadoop-env.sh ~/apache-nutch-

1.16/conf

• sudo cp /usr/local/hadoop/conf/hdfs-site.xml ~/apache-nutch-

1.16/conf

• sudo cp /usr/local/hadoop/conf/mapred-site.xml ~/apache-

nutch-1.16/conf

• sudo cp /usr/local/hadoop/conf/core-site.xml ~/apache-nutch-

1.16/conf

Integrate Solr with Nutch

• https://wiki.apache.org/nutch/NutchTutorial#Integr

ate_Solr_with_Nutch

• Replace Solr schema.xml with Nutch-specific

schema.xml

• Run the Solr Index command:

– bin/nutch solrindex

http://127.0.0.1:8983/solr/ crawl/crawldb

-linkdb crawl/linkdb crawl/segments/*

Checking Your Index

• http://solr_server:8983/solr/admin/

Useful Links

• http://wiki.apache.org/nutch/NutchTutorial

• http://wiki.apache.org/nutch/

• http://nutch.apache.org/

• http://wiki.apache.org/nutch/CommandLineOption

• http://today.java.net/pub/a/today/2006/01/10/intro

duction-to-nutch-1.html

epl660: information retrieval and search engines lab 8 · epl660: information retrieval and search...

Documents

enhancing internet search engines to achieve concept-based...

epl660: data classification

information retrieval and web search engines

the college of saint rose cis 460 – search and information...

epl660: information retrieval and search engines lab 3 ·...

information retrieval engines

search engines information retrieval in practice all slides...

summary vector space model dd2476 search engines and...

search engines chapter 1 – introduction · information...

information retrieval in text part ii reference: michael w....

evaluating search engines in chapter 8 of the book search...

content detection and analysis csci 572: information...

dd2476: search engines and information retrieval systems

introduction to nutch csci 572: information retrieval and...

next generation search engines - advanced models for...

search engines & information retrieval

geographical web search engines and geographical...

information retrieval in text part i reference: michael w....

the search engine architecture csci 572: information...

comp4210 information retrieval and search engines lecture 9:...