epl660: information retrieval and search engines lab 8 · epl660: information retrieval and search...
Post on 20-May-2020
5 Views
Preview:
TRANSCRIPT
University of Cyprus
Department of
Computer Science
EPL660: Information
Retrieval and Search
Engines – Lab 8
Παύλος Αντωνίου
Γραφείο: B109, ΘΕΕ01
What is Apache Nutch?
• Production ready Web Crawler
• Operates at one of three scales:
– local filesystem (reliable, no network errors, caching is
unnecessary)
– Intranet (local/corporate network)
– whole web (whole Web crawling is difficult)
• Nutch can run on a single machine (local mode), but
gains a lot of its strength from running οn a Hadoop
cluster (deploy mode)
• Relies on Apache Hadoop data structures, which are
great for batch processing
• Open source
• Implemented in Java
Nutch Code Bases
• Nutch 1.x:
– A well matured, production ready crawler
– Fine grained configuration
– Relies on Apache Hadoop data structures
• Nutch 2.x:
– An emerging alternative taking direct inspiration from 1.x,
– Differs in one key area; storage is abstracted away from
any specific underlying data store by using Apache
Gora for handling object to persistent mappings.
– Provides extremely flexible model/stack for storing
everything (fetch time, status, content, parsed text,
outlinks, inlinks, etc.) into a number of NoSQL storage
solutions.
Nutch vs Lucene
• Nutch is using Lucene (through Solr or Elastic
Search) for indexing
• Common question "Should I use Lucene or
Nutch?"
– Simple answer: You should use Lucene if you
don't need a web crawler i.e. for fetching the docs to be
indexed
• Nutch is a better fit for sites
– where you don't have direct access to the underlying
data
– data comes from disparate sources
• multiple domains
• different doc format: json, xml, text, html, ...
Nutch vs Solr/ElasticSearch
• Nutch is a web crawler
– collect web pages or other web accessible resources
– uses Solr or ElasticSearch for indexing
• Solr/ElasticSearch is a search platform
– No crawling: doesn't fetch the data, you have to feed it
– Perfect if you have data to be indexed already (in XML,
json, database, etc.)
Nutch building blocks
Nutch Data
• Nutch data is composed of:
– crawl/crawldb
• contains information about all pages (URLs) known
to the crawler and their status, such as the last time
it visited the page, its fetching status, refresh
interval, content checksum, page importance, etc.
– crawl/linkdb
• for each URL known to Nutch, it contains a list of
other URLs pointing to it (incoming links) and their
associated anchor text (from HTML <a href=“…”>anchor text</a> elements)
Nutch Data
– crawl/segments
Segments are directories with the following
subdirectories:
• a crawl_generate names a set of URLs to be fetched
• a crawl_fetch contains the status of fetching each URL
• a content contains the raw content retrieved from each
URL (for indexing)
• a parse_text contains the parsed text of each URL
• a parse_data contains outlinks and metadata parsed
from each URL (such as anchor text)
• a crawl_parse contains the outlink URLs, used to
update the crawldb
Crawling frontier challenge
• No authoritative catalog of web pages
• Where to start crawling from?
• Crawlers need to discover their view of web
universe
– Start from “seed list” & follow (walk) some (useful?
interesting?) outlinks
• Many dangers of simply wandering around
– explosion or collapse of the frontier; collecting
unwanted content (spam, junk, offensive)
Main Nutch workflow
• Inject: initial creation of CrawlDB
– Insert seed URLs to CrawlDB
– Initial LinkDB is empty
• Generate new shard's fetchlist
• Fetch raw content
• Parse content (discovers outlinks)
• Update CrawlDB from shards
• Update LinkDB from shards
• Index shards
rep
eat
Command-line:
bin/nutch
inject
generate
fetch
parse
updatedb
invertlinks
index /
solrindexEvery step is implemented as one (or more) MapReduce job(s)
(from crawldb to
crawl/segments/crawl_generate)
Injecting new URLs1) Specify a list of
URLs you want to crawl
2) Use a URL filter
3) Use the injector to add URLs to the crawldb
Note: filters, normalizers and plugins allow Nutch to be highly
modular, flexible and very customizable throughout the whole process.
Generate-ing fetchlists4) Generate a fetch list from the crawldb
5) Create segment directory for the generated fetch list
Fetching content
6) Fetch segment
Content processing
7) Parse the results
and update CrawldB
Link inversion
8) Before indexing, invert all links, so that incoming anchor
text can be indexed with pages
Link Inversion
• Pages (urls) have outgoing links (outlinks)
– … I know where I am pointing to
• Question: Who points to me?
– … I don’t know, there is no catalog of pages
– … NOBODY knows for sure either!
• In-degree may indicate importance of the page
• Anchor text provides important semantic info
• Answer: invert the outlinks that I know about
Link Inversion as MR job
• Goal: Compute inlinks for all downloaded and
parsed pages
• Input: each page as a pair of <srcUrl, ParseData>
– ParseData contain page outlinks (destUrls)
• Map <srcUrl, ParseData> → <destUrl, Inlinks>
– where Inlinks: <srcUrl, anchorText>
• Reduce: Map output pairs <destUrl, Inlinks>
grouped by destUrl, append Inlinks in a dedicated
java writeable class
• Output: <destUrl, list of Inlinks>
Page importance - scoring
9) Page importance metadata based on inverted links are
stored in CrawlDB
Indexing
SOLR
Lucene
10) Using data from all possible sources (crawlDB, linkDB,
segments) the indexer creates an index and saves it within
the Solr directory. For indexing, the Lucene library is used.
11) Users can search for information
regarding the crawled web pages via Solr.
Nutch from binary distribution
• Download Apache Nutch 1.16 binary package
from here (you can download Nutch 2.4)
• Unzip your binary Nutch package
• cd apache-nutch-1.16/
• Confirm correct installation
– run "bin/nutch"
• If you are seeing "Permission denied"
– run "chmod +x bin/nutch"
Crawl your first website
• Nutch requires two configuration changes before
a website can be crawled:
1. Customize your crawl properties, where at a
minimum, you provide a name for your crawler
for external servers to recognize
2. Set a seed list of URLs to crawl
Customize your crawl properties
• Default crawl properties: conf/nutch-default.xml
– Mainly remains unchanged
• conf/nutch-site.xml serves as a place to add your
own custom crawl properties that
overwrite conf/nutch-default.xml.
– Add your agent name in the value field of the
http.agent.name property in conf/nutch-site.xml within
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
Crawl your first website: Seed list
• A URL seed list includes a list of websites, one-
per-line, which nutch will look to crawl
• Create a URL seed list
– mkdir -p urls
– cd urls
– nano seed.txt to create a text file seed.txt under
urls/ with the following content (one URL per line for
each site you want Nutch to crawl). • http://nutch.apache.org/
• (one URL per line for each site you want Nutch to crawl)
Configure Reg. Expression Filters
• conf/regex-urlfilter.txt will provide regular
expressions that allow nutch to filter and narrow
the types of web resources to crawl and download
• Edit the file conf/regex-urlfilter.txt and
REPLACE
# accept anything else
+.
WITH
+^http://([a-z0-9]*\.)*nutch.apache.org/
if, for example, you wished to limit the crawl to
the nutch.apache.org domain
• NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all
domains linking to your seed URLs file being crawled as well.
Seeding crawldb with list of URLs
• The injector adds URLs to the crawldb
– bin/nutch inject crawl/crawldb urls
• STEP 1: FETCHING, PARSING PAGES
• Generate fetch list for all pages due to be fetched.
The fetch list is placed in a newly created
segment directory
– bin/nutch generate crawl/crawldb
crawl/segments
– The segment directory is named by the time it's created• s1=`ls -d crawl/segments/2* | tail -1`
• echo $s1
• Run the fetcher on this segment
– bin/nutch fetch $s1
Seeding crawldb with list of URLs
• Parse the entries
– bin/nutch parse $s1
• When this is complete, we update the crawldb
database with the results of the fetch:
– bin/nutch updatedb crawl/crawldb $s1
• First fetching: Now crawldb database contains
both updated entries for all initial pages + new
entries that correspond to newly discovered
pages linked from the initial set.
Seeding crawldb with list of URLs
• Now we generate and fetch a new segment containing the
top-scoring 1,000 pages:– bin/nutch generate crawl/crawldb crawl/segments -
topN 1000
– s2=`ls -d crawl/segments/2* | tail -1`
– bin/nutch fetch $s2
– bin/nutch parse $s2
– bin/nutch updatedb crawl/crawldb $s2
• Let’s fetch one more round:– bin/nutch generate crawl/crawldb crawl/segments -
topN 1000
– s3=`ls -d crawl/segments/2* | tail -1`
– bin/nutch fetch $s3
– bin/nutch parse $s3
– bin/nutch updatedb crawl/crawldb $s3
Seeding crawldb with list of URLs
• STEP 2: INVERTLINKS
• Before indexing we first invert all links, so that we
may index incoming anchor text with the pages.
– bin/nutch invertlinks crawl/linkdb -dir
crawl/segments
• STEP 3: INDEXING INTO APACHE SOLR
[Nutch-Solr integration needed]• Usage: bin/nutch solrindex <solr url> <crawldb> [-
linkdb <linkdb>][-params k1=v1&k2=v2...] (<segment>
...| -dir <segments>) [-noCommit] [-deleteGone] [-
filter] [-normalize]
• Example: bin/nutch solrindex
http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20131108063838/ -filter
-normalize
Seeding crawldb with list of URLs
• STEP 4: DELETING DUPLICATES
• Ensure urls are unique in index
• Usage: bin/nutch solrdedup <solr url>
• Example: /bin/nutch solrdedup
http://localhost:8983/solr
• STEP 5: CLEANING SOLR
• Scan crawldb directory looking for entries with
status DB_GONE (404) and sends delete
requests to Solr for those documents
• Usage: bin/nutch solrclean <crawldb>
<solrurl>
• Example: /bin/nutch solrclean
crawl/crawldb/ http://localhost:8983/solr
All In One: Using the Crawl Command
bin/crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>
-i|--index Indexes crawl results into a configured indexer
-D A Java property to pass to Nutch calls
Seed Dir Directory in which to look for a seeds file
Crawl Dir Directory where the crawl/link/segments dirs are saved
Num Rounds The number of rounds to run this crawl for
Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2
Nutch Command Line Options
Below are some of the command line options
• bin/nutch readdb crawlDir/crawldb -stats
• bin/nutch readdb crawlDir/crawldb -dump
outdump
• bin/nutch readdb crawlDir/crawldb -topN 2
outreaddbtop
• bin/nutch readdb crawlDir/linkdb -dump
outputlinkdb
For more options:
http://wiki.apache.org/nutch/CommandLineOptions
Nutch deploy mode
• local mode: run Nutch in a single process on one
machine, using Hadoop as a dependency
• deploy mode: take into account Hadoop
configurations installed on machine
– Copy hadoop-env.sh, core-site.xml, hdfs-site.xml,
mapred-site.xml from /usr/local/hadoop/conf to
~/apache-nutch-1.16/conf
• sudo cp /usr/local/hadoop/conf/hadoop-env.sh ~/apache-nutch-
1.16/conf
• sudo cp /usr/local/hadoop/conf/hdfs-site.xml ~/apache-nutch-
1.16/conf
• sudo cp /usr/local/hadoop/conf/mapred-site.xml ~/apache-
nutch-1.16/conf
• sudo cp /usr/local/hadoop/conf/core-site.xml ~/apache-nutch-
1.16/conf
Integrate Solr with Nutch
• https://wiki.apache.org/nutch/NutchTutorial#Integr
ate_Solr_with_Nutch
• Replace Solr schema.xml with Nutch-specific
schema.xml
• Run the Solr Index command:
– bin/nutch solrindex
http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/*
Useful Links
• http://wiki.apache.org/nutch/NutchTutorial
• http://wiki.apache.org/nutch/
• http://nutch.apache.org/
• http://wiki.apache.org/nutch/CommandLineOption
s
• http://today.java.net/pub/a/today/2006/01/10/intro
duction-to-nutch-1.html
top related