web scraping with nutch solr part 2

Web Scraping Using Nutch and Solr - Part 2

The following example assumes that you have

Watched web scraping with nutch and solr

The above movie identity is cAiYBD4BQeE

Set up Linux based Nutch/Solr environment

Run the web scrape in the above movie

Now we will

Clean up that environment

Web scrape a parameterised url

View the urls in the data

Empty Nutch Database

Clean up the Nutch crawl database

Previously used apache-nutch-1.6/nutch_start.sh

This contained -dir crawl option

This created apache-nutch-1.6/crawl directory

Which contains our Nutch data

Clean this as

cd apache-nutch-1.6; rm -rf crawl

Only because it contained dummy data !

Next run of script will create dir again

Empty Solr Database

Clean Solr database via a url

Book mark this url

Only use it if you need to empty your data

Run the following ( with solr server running )

http://localhost:8983/solr/update?commit=true -d '*:*'

Set up Nutch

Now we will do something more complex

Web scrape a url that has parameters i.e.

http:///?var1=val1&var2=val2

This web scrape will

Have extra url characters '?=&'

Need greater search depth

Need better url filtering

Remember that you need to get permission to scrape a third party web site

Nutch Configuration

Change seed file for Nutch

apache-nutch-1.6/urls/seed.txt

In this instance I will use a url of the form

http://somesite.co.nz/Search?DateRange=7&industry=62

( this is not a real url just an example )

Change conf regex-urlfilter.txt entry i.e.

# skip URLs containing certain characters

-[*!@]

# accept anything else

+^http://([a-z0-9]*\.)*somesite.co.nz\/Search

This will only consider some site Search urls

Run Nutch

Now run nutch using start script

cd apache-nutch-1.6 ; ./nutch_start.bash

Monitor for errors in solr admin log window

The Nutch crawl should end with

crawl finished: crawl

Checking Data

Data should have been indexed in Solr

In Solr Admin window

Set 'Core Selector' = collection1

Click 'Query'

In Query window set fl field = url

Click Execute Query

The result ( next ) shows the filtered list of urls in Solr

Checking Data

Results

Congratulations you have completed your second crawl

With parameterised urls

More complex url filtering

With a Solr Query search

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

[email protected]

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems

web scraping with nutch solr part 2

Technology