getting started with apache nutch

33
Getting Started With Apache Nutch By Akaram Siddiqui And Abdulbasit F Shaikh Getting Started with Apache Nutch 1

Upload: dextervip

Post on 28-Jul-2016

62 views

Category:

Documents


6 download

DESCRIPTION

Getting Started WithApache Nutch

TRANSCRIPT

Page 1: Getting Started With  Apache Nutch

Getting Started With Apache Nutch

By Akaram Siddiqui And

Abdulbasit F Shaikh

Getting Started with Apache Nutch1

Page 2: Getting Started With  Apache Nutch

Installing And Configuring of Nutch

Figure 1 : Grab a distribution of Nutch 2.X from http://apache.claz.org/nutch/2.2/

Figure 2 : Download HBase. You can get it http://archive.apache.org/dist/hbase/hbase-0.90.4/ .

Figure 3 : Extract it.

Figure 4 : Go to hbase-site.xml(/../hbase-0.90.4/conf) and modify it as below,

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>hbase.rootdir</name>

<value><Your path></value>

</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value><Your path></value>

</property>

</configuration>2

Getting Started with Apache Nutch

Page 3: Getting Started With  Apache Nutch

Figure 5 : Specify the GORA backend in nutch-site.xml(/../apache-nutch-

2.2.1/conf)

<property>

<name>storage.data.store.class</name>

<value>org.apache.gora.hbase.store.HBaseStore</value>

<description>Default class for storing data</description>

</property>

Figure 6 : Ensure the HBase gora-hbase dependency is available in ivy/ivy.xml(/../apache-nutch-2.2.1/ivy)

<!-- Uncomment this to use HBase as Gora backend. -->

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />(This line would be commented by default.So uncomment it)

Figure 7 : Ensure that HBaseStore is set as the default datastore in gora.properties(/../apache-nutch-2.2.1/conf)

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore(This line would not be there.So add it at the top.)

Figure 8 : Go to apache nutch home directory(/../apache-nutch-2.2.1) and fire below command from terminal,

ant runtime

Figure 9 : Make sure HBase is started and working properly.

Getting Started with Apache Nutch3

Page 4: Getting Started With  Apache Nutch

Figure 10 : For checking go to home directory of hbase (/../hbase-0.90.4) from terminal and type below command,

./bin/hbase shell

If all succeed then you will get output like this,

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010

hbase(main):001:0>

Figure 11 : You should then be able to use it by going to /../apache-nutch-2.2.1/runtime/local/bin

You should find more details in the logs on /../apache-nutch-2.2.1/runtime/local/logs/hadoop.log.

Verify your Nutch installation

1) Go to local directory of apache nutch(/../apache-nutch-2.2.1/runtime/local) from terminal and type below command,

bin/nutch

If all succeed then you will get below output,

Usage: nutch COMMAND

..

..4

Getting Started with Apache Nutch

Page 5: Getting Started With  Apache Nutch

..

Most commands print help when invoked w/o parameters.

2) Run the following command if you are seeing "Permission denied":

chmod +x bin/nutch

3) Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:

export JAVA_HOME=<Your Java path>

Crawl your first website

1) Add your agent name in the value field of the http.agent.name property in nutch-site.xml(/../apache-nutch-2.2.1/runtime/local/conf), for example:

<configuration>

<property>

<name>http.agent.name</name>

<value>My Nutch Spider</value>

</property>

</configuration>

2) Go to local directory(/../apache-nutch-2.2.1/runtime/local) of apache nutch and create directory called urls.

mkdir -p urls

Getting Started with Apache Nutch5

Page 6: Getting Started With  Apache Nutch

3) cd urls

4) Type below command for creating seed.txt under urls/ with the following content (one URL per line for each site you want Nutch to crawl).

touch seed.txt

5) Modify the file by putting below content,

http://nutch.apache.org/

6) Edit the file regex-urlfilter.txt(/../apache-nutch-2.2.1/conf) and replace

# accept anything else

+.

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:

+^http://([a-z0-9]*\.)*nutch.apache.org/

This will include any URL in the domain nutch.apache.org.

Crawling website using the crawl script

P.S : I have tested this with solr 3.6.2.If you want to run it with higher version then you need to configure it accordingly.

1) Download solr from http://apache.mirrors.hoobly.com/lucene/solr/3.6.2/

2) Extract it.

3) Go to example directory(/../apache-solr-3.6.2/example) from terminal.6

Getting Started with Apache Nutch

Page 7: Getting Started With  Apache Nutch

4) Type the below command,

java -jar start.jar

If all succeed then you will get below output,

...

...

...

5948 [main] INFO org.eclipse.jetty.server.AbstractConnector – Started [email protected]:8983

5) Verify solr installation by hitting below url on browser,

http://localhost:8983/solr/admin/

6) We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed URL(s). Below are the steps to delegate searching to Solr for links to be searchable:

7) cp /../apache-nutch-2.2.1/conf/schema.xml /../apache-solr-3.6.2/example/solr/conf/

8) Restart Solr with the command “java -jar start.jar” under /../apache-solr-3.6.2/example

9) Go to Home directory of hbase(/../hbase-0.90.4) from terminal and start hbase by below command,

./bin/start-hbase.sh

Getting Started with Apache Nutch7

Page 8: Getting Started With  Apache Nutch

If all succeed then you will get below output, starting Master, logging to logs/hbase-user-master-example.org.out

If you get below output that means hbase is already started.No need to start it.

master running as process 2948. Stop it first.

10) Go to local directory(/../apache-nutch-2.2.1/runtime/local) from terminal and type below command,

bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2

If all succeed then you will get below output,

...

...

...

Adding 1 documents

SOLR dedup -> http://localhost:8983/solr/

The crawl script has lot of parameters set, and you can modify the parameters to your needs. It would be ideal to understand the parameters before setting up big crawls.

Crawling the web, the CrawlDb, and URL filters

Crawling the web is already explained above.You can add more urls in seed.txt file and crawl the same.

8Getting Started with Apache Nutch

Page 9: Getting Started With  Apache Nutch

Crawling the crawlDB is automatically done by crawl script as we showed above.Previously we need to manually do it.But apache-nutch developers replace it by crawl script.I am just defining the steps which is followed by crawl script for crawling CrawlDB.

1) Generate : $bin/nutch generate $commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId $CRAWL_ID -batchId $batchId

2) Fetch : $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId $CRAWL_ID -threads 50

3) Parse : $bin/nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId $CRAWL_ID

4) Update : $bin/nutch updatedb $commonOptions -crawlId $CRAWL_ID

URLFilters is also explained above.For reference follow the 6th step in “Crawl your first website” topic above.

Parsing and Parse Filters

Parsing contains the parsed text of each URL, the outlink URLs used to update the crawldb and also contains outlinks and metadata parsed from each URL.

Parsing is also done by crawl script as explained above.For do it manually,you need to first execute inject,generate and fetch command respectively.

go to local directory of apache-nutch(/../apache-nutch-2.2.1/runtime/local) and type below command,

For Inject : bin/nutch inject urls(You can pass different arguments as your need)

Getting Started with Apache Nutch9

Page 10: Getting Started With  Apache Nutch

For Generate : bin/nutch generate -topN 1(You can pass different arguments as your need)

For Fetch : bin/nutch fetch -all(You can pass different arguments as your need)

For parse : bin/nutch parse -all(You can pass different arguments as your need)

Parse Filters :

HtmlParseFilter -- Permits one to add additional metadata to HTML parses.

Analysis, Link analysis, and scoring

Link analysis program that converges to stable global scores for each url.

WebGraph

The WebGraph program is the first job that must be run once all segments are fetched and ready to be processed. WebGraph is found at org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs usage.

usage: WebGraph

-help show this help message

-segment <segment> the segment(s) to use

-webgraphdb <webgraphdb> the web graph database to use

The WebGraph program can take multiple segments to process and requires an output directory in which to place the completed web graph components. The WebGraph creates three different components: an inlink database, an outlink database, and a node database. The inlink database is a listing of url and all of

10Getting Started with Apache Nutch

Page 11: Getting Started With  Apache Nutch

its inlinks. The outlink database is a listing of url and all of its outlinks. The node database is a listing of url with node meta information including the number of inlinks and outlinks, and eventually the score for that node.

Loops

Once the web graph is built we can begin the process of link analysis. Loops is an optional program that attempts to help weed out spam sites by determining link cycles in a web graph. An example of a link cycle would be sites A, B, C, and D, where A links to B which links to C which links to D which links back to A. This program is computationally expensive and usually, due to time and space requirement, can't be run on more than a three or four level depth. While it does identify sites which appear to be spam and those links are then discounted in the later LinkRank program, its benefit to cost ratio is very low. It is included in this package for completeness and because there may be a better way to perform this function with a different algorithm. But on current large production webgraphs, its use is discouraged. Loops is found at org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs usage.

usage: Loops

-help show this help message

-webgraphdb <webgraphdb> the web graph database to use

LinkRank

With the web graph built we can now run LinkRank to perform an iterative link analysis. LinkRank is a PageRank-like link analysis program that converges to stable global scores for each url. Similar to PageRank, the LinkRank program starts with a common score for all urls. It then creates a global score for each url based on the number of incoming links and the scores for those links and the number of outgoing links from the page. The process is iterative and scores tend

Getting Started with Apache Nutch11

Page 12: Getting Started With  Apache Nutch

to converge after a given number of iterations. It is different from PageRank in that nepotistic links such as links internal to a website and reciprocal links between websites can be ignored. The number of iterations can also be configured; by default 10 iterations are performed. Unlike the previous OPIC scoring, the LinkRank program does not keep scores from one processing time to another. The web graph and the link scores are recreated at each processing run and so we don't have the problems of ever increasing scores. LinkRank requires the WebGraph program to have completed successfully and it stores its output scores for each url in the node database of the webgraph. LinkRank is found at org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs usage.

usage: LinkRank

-help show this help message

-webgraphdb <webgraphdb> the web graph db to use

ScoreUpdater

Once the LinkRank program has been run and link analysis is completed, the scores must be updated into the crawl database to work with the current Nutch functionality. The ScoreUpdater program takes the scores stored in the node database of the webgraph and updates them into the crawldb. If a url exists in the crawldb that doesn't exist in the webgraph then its score is cleared in the crawldb. The ScoreUpdater requires that the WebGraph and LinkRank programs have both been run and requires a crawl database to update. ScoreUpdater is found at org.apache.nutch.scoring.webgraph.ScoreUpdater. Below is a printout of the programs usage.

usage: ScoreUpdater

-crawldb <crawldb> the crawldb to use

12Getting Started with Apache Nutch

Page 13: Getting Started With  Apache Nutch

-help show this help message

-webgraphdb <webgraphdb> the webgraphdb to use

ScoringP.S : Apache-nutch 2.2.1 is not supporting this.So I have configured it with apache-nuch-1.7.You can install apache-nutch-1.7 same way as apach-nucth-2.2.1

The new scoring functionality can be found in org.apache.nutch.scoring.webgraph. This package contains multiple programs that build web graphs, perform a stable convergent link-analysis, and update the crawldb with those scores.For doing scoring go to local directory(/../apache-nutch-1.7/runtime/local) from terminal of apache-nutch and type below commands,

bin/nutch inject crawl/crawldb urls/

bin/nutch generate crawl/crawldb/ crawl/segments

bin/nutch fetch crawl/segments/xxxxxxxxxxxxxx/

bin/nutch updatedb crawl/crawldb/ crawl/segments/xxxxxxxxxxxxxxxxx/

bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/xxxxxxxxxxxxxx/ -webgraphdb crawl/webgraphdb

One thing to point out here is that WebGraph is meant to be used on larger web crawls to create web graphs. By default it ignores outlinks to pages in the same domain, including subdomains, and pages with the same hostname. It also limits to one outlink per page to links in the same page or the same domain. All of these options are changeable through the following configuration options:

<!-- linkrank scoring properties -->

Getting Started with Apache Nutch13

Page 14: Getting Started With  Apache Nutch

<property>

<name>link.ignore.internal.host</name>

<value>true</value>

<description>Ignore outlinks to the same hostname.</description>

</property>

<property>

<name>link.ignore.internal.domain</name>

<value>true</value>

<description>Ignore outlinks to the same domain.</description>

</property>

<property>

<name>link.ignore.limit.page</name>

<value>true</value>

<description>Limit to only a single outlink to the same page.</description>

</property>

<property>

14Getting Started with Apache Nutch

Page 15: Getting Started With  Apache Nutch

<name>link.ignore.limit.domain</name>

<value>true</value>

<description>Limit to only a single outlink to the same domain.</description>

</property>

But by default if you are only crawling pages within a domain or within a set of subdomains, all outlinks will be ignored and you will come up with an empty webgraph. This in turn will throw an error while processing through the LinkRank job. The flip side is by NOT ignoring links to the same domain/host and by not limiting those links, the webgraph becomes much, much more dense and hence there is a lot more links to process which probably won't affect relevancy as much.

bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/

bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb crawl/webgraphdb/

bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb/

bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper -scores -topn 1000 -webgraphdb crawl/webgraphdb/ -output crawl/webgraphdb/dump/scores

bin/nutch readdb crawl/crawldb/ -stats

--------------------------------------------------------------------------------

CrawlDb statistics start: crawl/crawldb/

Getting Started with Apache Nutch15

Page 16: Getting Started With  Apache Nutch

Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

Statistics for CrawlDb: crawl/crawldb/

TOTAL urls: 16711

retry 0: 16686

retry 1: 25

min score: 0.0

avg score: 0.022716654

max score: 0.495

status 1 (db_unfetched): 15739

status 2 (db_fetched): 677

status 3 (db_gone): 75

status 4 (db_redir_temp): 143

status 5 (db_redir_perm): 77

CrawlDb statistics: done

Apache nutch plugin:

This example is for nutch1.4

16Getting Started with Apache Nutch

Page 17: Getting Started With  Apache Nutch

Plugin Use : I simply would like to add a new field to the index. This new field should indicate the length of the parsed content of the respective web page and therefore be called “pageLength”.

As a first step, you need to create all the necessary new files. Lets say, we call the plugin “myPlugin”. Then you need to create the new folder $NUTCH_HOME/src/plugin/myPlugin. Next, simply copy and paste all the files from the urlmeta-plugin ($NUTCH_HOME/src/plugin/urlmeta) to the myPlugin-folder. Now, rename and delete the adequate files and directories in order to get the following structure (you can do this within Eclipse as well as directly on the file system):

Create file like myPlugin in $NUTCH_HOME/src/plugin/myPlugin.

Place following file inside myPlugin file are below

plugin.xml, build.xml , ivy.xml, src/java/org/apache/nutch/indexer and add AddField.java class file inside src/java/org/apache/nutch/indexer/

Your plugin.xml should be like below..

<?xml version="1.0" encoding="UTF-8"?>

<plugin id="myPlugin" name="Add Field to Index"

version="1.0.0" provider-name="your name">

<runtime><library name="myPlugin.jar">

<export name="*"/>

</library></runtime>

Getting Started with Apache Nutch17

Page 18: Getting Started With  Apache Nutch

<extension id="org.apache.nutch.indexer.myPlugin"name="Add Field to Index"point="org.apache.nutch.indexer.IndexingFilter"><implementation id="myPlugin"

class="org.apache.nutch.indexer.AddField"/>

</extension></plugin>

4. You need to change in your build.xml file..

<?xml version="1.0" encoding="UTF-8"?><project name="myPlugin" default="jar"><importfile="../build-plugin.xml"/></project>

5. Add following code in your AddField.java class

package org.apache.nutch.indexer;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.Text;

import org.apache.nutch.crawl.CrawlDatum;

import org.apache.nutch.crawl.Inlinks;

import org.apache.nutch.indexer.IndexingFilter;

import org.apache.nutch.indexer.NutchDocument;

import org.apache.nutch.parse.Parse;

18Getting Started with Apache Nutch

Page 19: Getting Started With  Apache Nutch

public class AddField implements IndexingFilter {

private static final Log LOG = LogFactory.getLog(AddField.class);

private Configuration conf;

//implements the filter-method which gives you access to important Objects like NutchDocument

public NutchDocument filter(NutchDocument doc, Parse parse, Text url,CrawlDatum datum, Inlinks inlinks) {

String content = parse.getText();

//adds the new field to the document

doc.add("pageLength", content.length());

return doc;

}

//Boilerplatep

public Configuration getConf() {

return conf;}

public void setConf(Configuration conf) {

this.conf = conf;}}

6. .src/plugin/build.xml in build.xml file add

<ant dir="myPlugin" target="deploy" />

Getting Started with Apache Nutch19

Page 20: Getting Started With  Apache Nutch

./conf/nutch-site.xml in nutch-site.xml add following property

<property><name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)

scoring-opic |urlnormalizer-(pass|regex|basic)|myPlugin</value>

<description>Added myPlugin</description></property>

$SOLR_HOME/…/solr/conf/schema.xml add following line in schema.xml....

<field name="pageLength" type="long" stored="true" indexed="true"/>

in following directory $NUTCH_HOME/conf/solrindex-mapping.xml. Add below code..

<field dest="pageLength" source="pageLength"/>

Now, in a last step, you need to build Nutch by executing $NUTCH_HOME/build.xml.

Apache nutch plugin:

This example is for nutch1.3

This section covers the integral components required to develop and use a plugin. As you can see inside the $NUTCH_HOME/src/plugin directory, the plugin folder urlmeta contains the following:

A plugin.xml file that tells Nutch about your plugin.

A build.xml file that tells ant how to build your plugin.

20Getting Started with Apache Nutch

Page 21: Getting Started With  Apache Nutch

A ivy.xml file containing either the description of the dependencies of a module, its published artifacts and its configurations or else the location of another file which does specify this information.

A /src directory containing the source code of our plugin with the directory structure shown in the hierarchical view below.

plugin.xml

build.xml

ivy.xml

src/ java/org/apache/nutch/indexer/

package.html

URLMetaIndexingFilter.java

src/ java/org/apache/nutch/scoring

package.html

URLMetaScoringFilter.java

4) Your plugin.xml file should look like this:

<?xml version="1.0" encoding="UTF-8"?>

<plugin

id="urlmeta"

name="URL Meta Indexing Filter"

version="1.0.0"

provider-name="sgonyea">

<runtime>

<library name="urlmeta.jar">

<export name="*"/>Getting Started with Apache Nutch

21

Page 22: Getting Started With  Apache Nutch

</library>

</runtime>

<requires>

<import plugin="nutch-extensionpoints"/>

</requires>

<extension id="org.apache.nutch.indexer.urlmeta"

name="URL Meta Indexing Filter"

point="org.apache.nutch.indexer.IndexingFilter">

<implementation id="indexer-urlmeta"

class="org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter"/>

</extension>

<extension id="org.apache.nutch.scoring.urlmeta"

name="URL Meta Scoring Filter"

point="org.apache.nutch.scoring.ScoringFilter">

<implementation id="scoring-urlmeta"

class="org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter" />

</extension>

</plugin>

5) build.xml its looks like this:

<?xml version="1.0"?>

<project name="recommended" default="jar-core">

<import file="../build-plugin.xml"/>

</project>

22Getting Started with Apache Nutch

Page 23: Getting Started With  Apache Nutch

6) ivy.xml

This file is used to describe the dependencies of the plugin on other libraries. It looks like..

<ivy-module version="1.0">

<info organisation="org.apache.nutch" module="${ant.project.name}">

<license name="Apache 2.0"/>

<ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>

<description>

Apache Nutch

</description>

</info>

<configurations>

<include file="${nutch.root}/ivy/ivy-configurations.xml"/>

</configurations>

<publications>

<!--get the artifact from our module name-->

<artifact conf="master"/>

</publications>

<dependencies>

</dependencies>

</ivy-module>

7) The Indexer Extension

This is the source code for the IndexingFilter extension. Meta Tags that are included in your Crawl URLs, during injection, will be propagated

Getting Started with Apache Nutch23

Page 24: Getting Started With  Apache Nutch

throughout the outlinks of those Crawl URLs. This means that when you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried.

package org.apache.nutch.indexer.urlmeta;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.Text;

import org.apache.nutch.crawl.CrawlDatum;

import org.apache.nutch.crawl.Inlinks;

import org.apache.nutch.indexer.IndexingException;

import org.apache.nutch.indexer.IndexingFilter;

import org.apache.nutch.indexer.NutchDocument;

import org.apache.nutch.parse.Parse;

public class URLMetaIndexingFilter implements IndexingFilter {

private static final Log LOG = LogFactory

.getLog(URLMetaIndexingFilter.class);

private static final String CONF_PROPERTY = "urlmeta.tags";

private static String[] urlMetaTags;

private Configuration conf;

24Getting Started with Apache Nutch

Page 25: Getting Started With  Apache Nutch

/**

* This will take the metatags that you have listed in your "urlmeta.tags"

* property, and looks for them inside the CrawlDatum object. If they exist,

* this will add it as an attribute inside the NutchDocument.

*

* @see IndexingFilter#filter

*/

public NutchDocument filter(NutchDocument doc, Parse parse, Text url,

CrawlDatum datum, Inlinks inlinks) throws IndexingException {

if (conf != null)

this.setConf(conf);

if (urlMetaTags == null || doc == null)

return doc;

for (String metatag : urlMetaTags) {

Text metadata = (Text) datum.getMetaData().get(new Text(metatag));

if (metadata != null)

doc.add(metatag, metadata.toString());

}

Getting Started with Apache Nutch25

Page 26: Getting Started With  Apache Nutch

return doc;

}

/** Boilerplate */

public Configuration getConf() {

return conf;

}

public void setConf(Configuration conf) {

this.conf = conf;

if (conf == null)

return;

urlMetaTags = conf.getStrings(CONF_PROPERTY);

}

}

8) The Scoring Extension

The following is the code for the URLMetaScoringFilter extension. If the document being indexed had a recommended meta tag this extension adds a lucene text field to the index called "recommended" with the content of that meta tag.

package org.apache.nutch.scoring.urlmeta;

import java.util.Collection;

import java.util.Map.Entry;26

Getting Started with Apache Nutch

Page 27: Getting Started With  Apache Nutch

import java.util.Iterator;

import java.util.List;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.io.Text;

import org.apache.nutch.crawl.CrawlDatum;

import org.apache.nutch.crawl.Inlinks;

import org.apache.nutch.indexer.NutchDocument;

import org.apache.nutch.parse.Parse;

import org.apache.nutch.parse.ParseData;

import org.apache.nutch.protocol.Content;

import org.apache.nutch.scoring.ScoringFilter;

import org.apache.nutch.scoring.ScoringFilterException;

public class URLMetaScoringFilter extends Configured implements ScoringFilter {

private static final Log LOG = LogFactory.getLog(URLMetaScoringFilter.class);

private static final String CONF_PROPERTY = "urlmeta.tags";

private static String[] urlMetaTags;

private Configuration conf;

Getting Started with Apache Nutch27

Page 28: Getting Started With  Apache Nutch

public CrawlDatum distributeScoreToOutlinks(Text fromUrl,

ParseData parseData, Collection<Entry<Text, CrawlDatum>> targets,

CrawlDatum adjust, int allCount) throws ScoringFilterException {

if (urlMetaTags == null || targets == null || parseData == null)

return adjust;

Iterator<Entry<Text, CrawlDatum>> targetIterator = targets.iterator();

while (targetIterator.hasNext()) {

Entry<Text, CrawlDatum> nextTarget = targetIterator.next();

for (String metatag : urlMetaTags) {

String metaFromParse = parseData.getMeta(metatag);

if (metaFromParse == null)

continue;

nextTarget.getValue().getMetaData().put(new Text(metatag),

new Text(metaFromParse));

}

}

return adjust;

}

28Getting Started with Apache Nutch

Page 29: Getting Started With  Apache Nutch

public void passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) {

if (urlMetaTags == null || content == null || datum == null)

return;

for (String metatag : urlMetaTags) {

Text metaFromDatum = (Text) datum.getMetaData().get(new Text(metatag));

if (metaFromDatum == null)

continue;

content.getMetadata().set(metatag, metaFromDatum.toString());

}

}

public void passScoreAfterParsing(Text url, Content content, Parse parse) {

if (urlMetaTags == null || content == null || parse == null)

return;

for (String metatag : urlMetaTags) {

String metaFromContent = content.getMetadata().get(metatag);

if (metaFromContent == null)

continue;

Getting Started with Apache Nutch29

Page 30: Getting Started With  Apache Nutch

parse.getData().getParseMeta().set(metatag, metaFromContent);

}

}

/** Boilerplate */

public float generatorSortValue(Text url, CrawlDatum datum, float initSort)

throws ScoringFilterException {

return initSort;

}

/** Boilerplate */

public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum,

CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)

throws ScoringFilterException {

return initScore;

}

public void initialScore(Text url, CrawlDatum datum)

throws ScoringFilterException {

return;

}

public void injectedScore(Text url, CrawlDatum datum)

throws ScoringFilterException {

return;

30Getting Started with Apache Nutch

Page 31: Getting Started With  Apache Nutch

}

public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,

List inlinked) throws ScoringFilterException {

return;

}

public void setConf(Configuration conf) {

super.setConf(conf);

if (conf == null)

return;

urlMetaTags = conf.getStrings(CONF_PROPERTY);

}

public Configuration getConf() {

return conf;

}

}

9) Getting Nutch to Use Your Plugin

In order to get Nutch to use your plugin, you need to edit your conf/nutch-site.xml file and add in a block like this:

<property>

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value>

Getting Started with Apache Nutch31

Page 32: Getting Started With  Apache Nutch

<description>Regular expression naming plugin directory names to

include. Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpoints plugin. By

default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins.

</description>

</property>

10) You'll want to edit the regular expression so that it includes the name of the urlmeta plugin.

11) Getting Ant to Compile Your Plugin

In order for ant to compile and deploy your plugin you need to edit the src/plugin/build.xml file (NOT the build.xml in the root of your checkout directory). You'll see a number of lines that look like.

Edit this block to add a line for your plugin before the </target> tag.

<ant dir="urlmeta" target="deploy" />

Running 'ant' in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl both the scoring and indexing extension will be used which will enable us to search for meta tags within our Solr index.

Use terminal by writing following command..

bin/nutch crawl ./urls/seed.txt/ -solr http://localhost:8983/solr/ -depth 3 -topN 5

32Getting Started with Apache Nutch

Page 33: Getting Started With  Apache Nutch

Getting Started with Apache Nutch33