web scraping with nutch solr part 2
TRANSCRIPT
Web Scraping Using Nutch and Solr - Part 2
The following example assumes that you have
Watched web scraping with nutch and solr
The above movie identity is cAiYBD4BQeE
Set up Linux based Nutch/Solr environment
Run the web scrape in the above movie
Now we will
Clean up that environment
Web scrape a parameterised url
View the urls in the data
Empty Nutch Database
Clean up the Nutch crawl database
Previously used apache-nutch-1.6/nutch_start.sh
This contained -dir crawl option
This created apache-nutch-1.6/crawl directory
Which contains our Nutch data
Clean this as
cd apache-nutch-1.6; rm -rf crawl
Only because it contained dummy data !
Next run of script will create dir again
Empty Solr Database
Clean Solr database via a url
Book mark this url
Only use it if you need to empty your data
Run the following ( with solr server running )
http://localhost:8983/solr/update?commit=true -d '*:*'
Set up Nutch
Now we will do something more complex
Web scrape a url that has parameters i.e.
http:///?var1=val1&var2=val2
This web scrape will
Have extra url characters '?=&'
Need greater search depth
Need better url filtering
Remember that you need to get permission to scrape a third party web site
Nutch Configuration
Change seed file for Nutch
apache-nutch-1.6/urls/seed.txt
In this instance I will use a url of the form
http://somesite.co.nz/Search?DateRange=7&industry=62
( this is not a real url just an example )
Change conf regex-urlfilter.txt entry i.e.
# skip URLs containing certain characters
-[*!@]
# accept anything else
+^http://([a-z0-9]*\.)*somesite.co.nz\/Search
This will only consider some site Search urls
Run Nutch
Now run nutch using start script
cd apache-nutch-1.6 ; ./nutch_start.bash
Monitor for errors in solr admin log window
The Nutch crawl should end with
crawl finished: crawl
Checking Data
Data should have been indexed in Solr
In Solr Admin window
Set 'Core Selector' = collection1
Click 'Query'
In Query window set fl field = url
Click Execute Query
The result ( next ) shows the filtered list of urls in Solr
Checking Data
Results
Congratulations you have completed your second crawl
With parameterised urls
More complex url filtering
With a Solr Query search
Contact Us
Feel free to contact us at
www.semtech-solutions.co.nz
We offer IT project consultancy
We are happy to hear about your problems
You can just pay for those hours that you need
To solve your problems