web scraping with nutch solr part 2

Download Web scraping with nutch solr part 2

If you can't read please download the document

Upload: semtech-solutions-ltd

Post on 16-Apr-2017

3.503 views

Category:

Technology


0 download

TRANSCRIPT

Web Scraping Using Nutch and Solr - Part 2

The following example assumes that you have

Watched web scraping with nutch and solr

The above movie identity is cAiYBD4BQeE

Set up Linux based Nutch/Solr environment

Run the web scrape in the above movie

Now we will

Clean up that environment

Web scrape a parameterised url

View the urls in the data

Empty Nutch Database

Clean up the Nutch crawl database

Previously used apache-nutch-1.6/nutch_start.sh

This contained -dir crawl option

This created apache-nutch-1.6/crawl directory

Which contains our Nutch data

Clean this as

cd apache-nutch-1.6; rm -rf crawl

Only because it contained dummy data !

Next run of script will create dir again

Empty Solr Database

Clean Solr database via a url

Book mark this url

Only use it if you need to empty your data

Run the following ( with solr server running )

http://localhost:8983/solr/update?commit=true -d '*:*'

Set up Nutch

Now we will do something more complex

Web scrape a url that has parameters i.e.

http:///?var1=val1&var2=val2

This web scrape will

Have extra url characters '?=&'

Need greater search depth

Need better url filtering

Remember that you need to get permission to scrape a third party web site

Nutch Configuration

Change seed file for Nutch

apache-nutch-1.6/urls/seed.txt

In this instance I will use a url of the form

http://somesite.co.nz/Search?DateRange=7&industry=62

( this is not a real url just an example )

Change conf regex-urlfilter.txt entry i.e.

# skip URLs containing certain characters

-[*!@]

# accept anything else

+^http://([a-z0-9]*\.)*somesite.co.nz\/Search

This will only consider some site Search urls

Run Nutch

Now run nutch using start script

cd apache-nutch-1.6 ; ./nutch_start.bash

Monitor for errors in solr admin log window

The Nutch crawl should end with

crawl finished: crawl

Checking Data

Data should have been indexed in Solr

In Solr Admin window

Set 'Core Selector' = collection1

Click 'Query'

In Query window set fl field = url

Click Execute Query

The result ( next ) shows the filtered list of urls in Solr

Checking Data

Results

Congratulations you have completed your second crawl

With parameterised urls

More complex url filtering

With a Solr Query search

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

[email protected]

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems