creating a web crawler in 3 steps issac goldstand [email protected] mirimar networks
TRANSCRIPT
Creating a Web Crawler in 3 Steps
Issac Goldstand
Mirimar Networks
http://www.mirimar.net/
The 3 steps
• Creating the User Agent
• Creating the content parser
• Tying it together
Step 1 – Creating the User Agent
• Lib-WWW Perl (LWP)
• OO interface to creating user agents for interacting with remote websites and web applications
• We will look at LWP::RobotUA
Creating the LWP Object
• User agent
• Cookie jar
• Timeout
Robot UA extras
• Robot rules
• Delay
• use_sleep
Implementation of Step 1
use LWP::RobotUA;
# First, create the user agent - MyBot/1.0
my $ua=LWP::RobotUA->new('MyBot/1.0', \ '[email protected]');
$ua->delay(15/60); # 15 seconds delay
$ua->use_sleep(1); # Sleep if delayed
Step 2 – Creating the content parser
• HTML::Parser
• Event-driven parser mechanism
• OO and function oriented interfaces
• Hooks to functions at certain points
Subclassing HTML::Parser
• Biggest issue is non-persistence
• CGI authors may be used to this, but still makes for many caveats
• You must implement your own state preservation mechanism
Implementation of Step 2package My::LinkParser; # Parser classuse base qw(HTML::Parser);
use constant START=>0; # Define simple constantsuse constant GOT_NAME=>1;
sub state { # Simple access methods return $_[0]->{STATE};}sub author { return $_[0]->{AUTHOR};}
Implementation of Step 2 (cont)sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0;}
sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq
"author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }}
Shortcut HTML::SimpleLinkExtor
• Simple package to extract links from HTML
• Handles many links – we only want HREF type links
Step 3 – Tying it together
• Simple application
• Instanciate objects
• Enter request loop
• Spit data to somewhere
• Add parsed links to queue
Implementation of Step 3
for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue }}
End result#!/usr/bin/perl
use strict;
use LWP::RobotUA;
use HTML::Parser;
use HTML::SimpleLinkExtor;
my @urls; # List of URLs to visit
my %authors;
my $ua=LWP::RobotUA->new('AuthorBot/1.0','[email protected]'); # First, create & setup the user agent
$ua->delay(15/60); # 15 seconds delay
$ua->use_sleep(1); # Sleep if delayed
my $p=My::LinkParser->new; # Create parsers
my $linkex=HTML::SimpleLinkExtor->new;
$urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs
End resultfor (my $i=0;$i<10;$i++) { # Parse loop
my $response=$ua->get(pop @urls); # Get HTTP response
if ($response->is_success) { # If reponse is OK
$p->reset;
$p->parse($response->content); # Parse for author
$p->eof;
if ($p->state==1) { # If state is FOUND_AUTHOR
$authors{$p->author}++; # then add author count
} else {
$authors{'Not Specified'}++; # otherwise add default count
}
$linkex->parse($response->content); # parse for links
unshift @urls,$linkex->a; # and add links to queue
}
}
print "Results:\n"; # Print results
map {print "$_\t$authors{$_}\n"} keys %authors;
End resultpackage My::LinkParser; # Parser class
use base qw(HTML::Parser);
use constant START=>0; # Define simple constants
use constant GOT_NAME=>1;
sub state { # Simple access methods
return $_[0]->{STATE};
}
sub author {
return $_[0]->{AUTHOR};
}
sub reset { # Clear parser state
my $self=shift;
undef $self->{AUTHOR};
$self->{STATE}=START;
return 0;
}
End resultsub start { # Parser hook
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ($tagname eq "meta" && lc($attr->{name}) eq "author") {
$self->{STATE}=GOT_NAME;
$self->{AUTHOR}=$attr->{content};
}
}
What’s missing?
• Full URLs for relative links
• Non-HTTP links
• Queues & caches
• Persistent storage
• Link (and data) validation
In review
• Create robot user agent to crawl websites nicely
• Create parsers to extract data from sites, and links to the next sites
• Create a simple program to parse a queue of URLs
Thank you!
For more information:
Issac Goldstand [email protected]
http://www.beamartyr.net/
http://www.mirimar.net/