creating a web crawler in 3 steps issac goldstand [email protected] mirimar networks

Creating a Web Crawler in 3 Steps

Issac Goldstand

[email protected]

Mirimar Networks

http://www.mirimar.net/

The 3 steps

• Creating the User Agent

• Creating the content parser

• Tying it together

Step 1 – Creating the User Agent

• Lib-WWW Perl (LWP)

• OO interface to creating user agents for interacting with remote websites and web applications

• We will look at LWP::RobotUA

Creating the LWP Object

• User agent

• Cookie jar

• Timeout

Robot UA extras

• Robot rules

• Delay

• use_sleep

Implementation of Step 1

use LWP::RobotUA;

# First, create the user agent - MyBot/1.0

my $ua=LWP::RobotUA->new('MyBot/1.0', \ '[email protected]');

$ua->delay(15/60); # 15 seconds delay

$ua->use_sleep(1); # Sleep if delayed

Step 2 – Creating the content parser

• HTML::Parser

• Event-driven parser mechanism

• OO and function oriented interfaces

• Hooks to functions at certain points

Subclassing HTML::Parser

• Biggest issue is non-persistence

• CGI authors may be used to this, but still makes for many caveats

• You must implement your own state preservation mechanism

Implementation of Step 2package My::LinkParser; # Parser classuse base qw(HTML::Parser);

use constant START=>0; # Define simple constantsuse constant GOT_NAME=>1;

sub state { # Simple access methods return $_[0]->{STATE};}sub author { return $_[0]->{AUTHOR};}

Implementation of Step 2 (cont)sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0;}

sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq

"author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; }}

Shortcut HTML::SimpleLinkExtor

• Simple package to extract links from HTML

• Handles many links – we only want HREF type links

Step 3 – Tying it together

• Simple application

• Instanciate objects

• Enter request loop

• Spit data to somewhere

• Add parsed links to queue

Implementation of Step 3

for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue }}

End result#!/usr/bin/perl

use strict;

use LWP::RobotUA;

use HTML::Parser;

use HTML::SimpleLinkExtor;

my @urls; # List of URLs to visit

my %authors;

my $ua=LWP::RobotUA->new('AuthorBot/1.0','[email protected]'); # First, create & setup the user agent

$ua->delay(15/60); # 15 seconds delay

$ua->use_sleep(1); # Sleep if delayed

my $p=My::LinkParser->new; # Create parsers

my $linkex=HTML::SimpleLinkExtor->new;

$urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs

End resultfor (my $i=0;$i<10;$i++) { # Parse loop

my $response=$ua->get(pop @urls); # Get HTTP response

if ($response->is_success) { # If reponse is OK

$p->reset;

$p->parse($response->content); # Parse for author

$p->eof;

if ($p->state==1) { # If state is FOUND_AUTHOR

$authors{$p->author}++; # then add author count

} else {

$authors{'Not Specified'}++; # otherwise add default count

}

$linkex->parse($response->content); # parse for links

unshift @urls,$linkex->a; # and add links to queue

}

}

print "Results:\n"; # Print results

map {print "$_\t$authors{$_}\n"} keys %authors;

End resultpackage My::LinkParser; # Parser class

use base qw(HTML::Parser);

use constant START=>0; # Define simple constants

use constant GOT_NAME=>1;

sub state { # Simple access methods

return $_[0]->{STATE};

}

sub author {

return $_[0]->{AUTHOR};

}

sub reset { # Clear parser state

my $self=shift;

undef $self->{AUTHOR};

$self->{STATE}=START;

return 0;

}

End resultsub start { # Parser hook

my($self, $tagname, $attr, $attrseq, $origtext) = @_;

if ($tagname eq "meta" && lc($attr->{name}) eq "author") {

$self->{STATE}=GOT_NAME;

$self->{AUTHOR}=$attr->{content};

}

}

What’s missing?

• Full URLs for relative links

• Non-HTTP links

• Queues & caches

• Persistent storage

• Link (and data) validation

In review

• Create robot user agent to crawl websites nicely

• Create parsers to extract data from sites, and links to the next sites

• Create a simple program to parse a queue of URLs

Thank you!

For more information:

Issac Goldstand [email protected]

http://www.beamartyr.net/

http://www.mirimar.net/

creating a web crawler in 3 steps issac goldstand [email protected] mirimar networks

Documents