10 crawling
TRANSCRIPT
-
8/12/2019 10 Crawling
1/34
A. Haeberlen, Z. Ives
CIS 455/555: Internet and Web Systems
1University of Pennsylvania
Crawling and Publish/Subscribe
February 15, 2012
-
8/12/2019 10 Crawling
2/34
A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
2University of Pennsylvania
NEXT
-
8/12/2019 10 Crawling
3/34
A. Haeberlen, Z. Ives
Motivation
Suppose you want to build a search engine Need a large corpus of web pages
How can we find these pages?
Idea: crawlthe web
What else can you crawl? For example, social network
3University of Pennsylvania
-
8/12/2019 10 Crawling
4/34
A. Haeberlen, Z. Ives
4
Crawling: The basic process
What state do we need? Q := Queue of URLs to visit
P := Set of pages already crawled
Basic process:1. Initialize Q with a set of seed URLs
2. Pick the first URL from Q and download the corresponding page
3. Extract all URLs from the page ( tag, anchor links, CSS, DTDs,scripts, optionally image links)
4. Append to Q any URLs that a) meet our criteria, and b) are not already in P5. If Q is not empty, repeat from step 2
Can one machine crawl the entire web? Of course not! Need to distribute crawling across many machines.
University of Pennsylvania
-
8/12/2019 10 Crawling
5/34
A. Haeberlen, Z. Ives
Crawling visualized
5University of Pennsylvania
Seeds
Unseen web
"Frontier"
URLs alreadycrawled
URLs that willeventually be
crawled
URLs that do notfit our criteria(too deep, etc)
The Web
Pages currentlybeing crawled
-
8/12/2019 10 Crawling
6/34
A. Haeberlen, Z. Ives
Crawling complications
What order to traverse in? Polite to do BFS - why?
Malicious pages Spam pages / SEO
Spider traps (incl. dynamically generated ones)
General messiness Cycles
Varying latency and bandwidth to remote servers
Site mirrors, duplicate pages, aliases
Web masters' stipulations
How deep to crawl? How often to crawl? Continuous crawling; freshness
6University of Pennsylvania
Need to berobust!
-
8/12/2019 10 Crawling
7/34A. Haeberlen, Z. Ives
Normalization; eliminating duplicates
Some of the extracted URLs are relative URLs Example: /~ahae/papers/ from www.cis.upenn.edu
Normalize it: http://www.cis.upenn.edu/~ahae/papers
Duplication is widespread on the web If the fetched page is already in the index, do not process it
Can verify using document fingerprint (hash) or shingles
7University of Pennsylvania
-
8/12/2019 10 Crawling
8/34A. Haeberlen, Z. Ives
8
Crawler etiquette
Explicit politeness Look for meta tags; for example, ignore pages that have
-
8/12/2019 10 Crawling
9/34A. Haeberlen, Z. Ives
9
Robots.txt
What should be in robots.txt? See http://www.robotstxt.org/wc/robots.html
To exclude all robots from a server:User-agent: *
Disallow: / To exclude one robot from two directories:
User-agent: BobsCrawler
Disallow: /news/Disallow: /tmp/
University of Pennsylvania
http://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.html -
8/12/2019 10 Crawling
10/34A. Haeberlen, Z. Ives
10
Recap: Crawling
How does the basic process work?
What are some of the main challenges? Duplicate elimination
Politeness
Malicious pages / spider traps
Normalization
Scalability
University of Pennsylvania
-
8/12/2019 10 Crawling
11/34A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
11University of Pennsylvania
NEXT
-
8/12/2019 10 Crawling
12/34A. Haeberlen, Z. Ives
12
Mercator: A scalable web crawler
Written entirely in Java
Expands a URL frontier Avoids re-crawling same URLs
Also considers whether a document has beenseen before Same content, different URL [when might this occur?]
Every document has signature/checksum info computed as
its crawled
Despite the name, it does not actually scaleto a large number of nodes But it would not be too difficult to parallelize
Heydon and Najork: Mercator, a scalable, extensible web crawler (WWW'99)University of Pennsylvania
-
8/12/2019 10 Crawling
13/34A. Haeberlen, Z. Ives13
Mercator architecture
1. Dequeue frontier URL
2. Fetch document
3. Record into RewindStream(RIS)
4. Check against fingerprints toverify its new
5. Extract hyperlinks
6. Filter unwanted links
7. Check if URL repeated(compare its hash)
8. Enqueue URL
Source:Mercatorpape
r
University of Pennsylvania
-
8/12/2019 10 Crawling
14/34A. Haeberlen, Z. Ives
14
Mercators polite frontier queues
Tries to go beyond breadth-first approach Goal is to have only one crawler thread per server
What does this mean for the load caused by Mercator?
Distributed URL frontier queue: One subqueue per worker thread
The worker thread is determined by hashing the hostnameof the URL
Thus, only one outstanding request per web server
University of Pennsylvania
-
8/12/2019 10 Crawling
15/34A. Haeberlen, Z. Ives
Mercators HTTP fetcher
First, needs to ensure robots.txt is followed Caches the contents of robots.txt for various web sites as it
crawls them
Designed to be extensible to other protocols Had to write own HTTP requestor in Java
their Java version didnt have timeouts Today, can use setSoTimeout()
Could use Java non-blocking I/O: http://www.owlmountain.com/tutorials/NonBlockingIo.htm
But they use multiple threads and synchronous I/O
15University of Pennsylvania
http://www.owlmountain.com/tutorials/NonBlockingIo.htmhttp://www.owlmountain.com/tutorials/NonBlockingIo.htm -
8/12/2019 10 Crawling
16/34A. Haeberlen, Z. Ives
16
Other caveats
Infinitely long URL names (good way to get abuffer overflow!)
Aliased host names
Alternative paths to the same host Can catch most of these with signatures of
document data (e.g., MD5) Comparison to Bloom filters
Crawler traps (e.g., CGI scripts that link tothemselves using a different name) May need to have a way for human to override certain URL
pathssee Section 5 of paper
University of Pennsylvania
-
8/12/2019 10 Crawling
17/34
-
8/12/2019 10 Crawling
18/34A. Haeberlen, Z. Ives
18
Further considerations
May want to prioritize certain pages as beingmost worth crawling Focused crawling tries to prioritize based on relevance
May need to refresh certain pages more often
University of Pennsylvania
-
8/12/2019 10 Crawling
19/34A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
19University of Pennsylvania
NEXT
-
8/12/2019 10 Crawling
20/34
-
8/12/2019 10 Crawling
21/34A. Haeberlen, Z. Ives
Example: RSS
Web server publishes XML file with events
Clients periodically request the file to see ifthere are new events
Is this a good solution?
21University of Pennsylvania
-
8/12/2019 10 Crawling
22/34A. Haeberlen, Z. Ives
Interest-based crawling
Suppose we want to crawl XML documentsbased on user interests
We need several parts: A list of interests expressed in an executable form,
perhaps XPath queries
A crawlergoes out and fetches XML content
A filter / routing enginematches XML content againstusers interests, sends them the content if it matches
22University of Pennsylvania
-
8/12/2019 10 Crawling
23/34A. Haeberlen, Z. Ives
Plan for today
Basic crawling
Mercator
Publish/subscribe
XFilter
23University of Pennsylvania
NEXT
-
8/12/2019 10 Crawling
24/34
A. Haeberlen, Z. Ives24
XML-Based information dissemination
Basic model (XFilter, YFilter, Xyleme):
Users are interested in data relating to a particular topic, andknow the schema
/politics/usa//body
A crawler-aggregator reads XML files from the web (or getsthem from data sources) and feeds them to interestedparties
XPath
(here used tomatch
documents,not nodes)
University of Pennsylvania
-
8/12/2019 10 Crawling
25/34
A. Haeberlen, Z. Ives
25
Engine for XFilter [Altinel & Franklin 00]
University of Pennsylvania
-
8/12/2019 10 Crawling
26/34
A. Haeberlen, Z. Ives
26
How does it work?
Each XPath segment is basically a subset ofregular expressions over element tags Convert into finite state automata
Parse data as it comes inuse SAX API Match against finite state machines
Most of these systems use modified FSMsbecause they want to match manypatternsat the same time
University of Pennsylvania
-
8/12/2019 10 Crawling
27/34
A. Haeberlen, Z. Ives
27
Path nodes and FSMs
XPath parser decomposes XPath expressions into a set ofpath nodes
These nodes act as the states of corresponding FSM
A node in the Candidate List denotes the current state
The rest of the states are in corresponding Wait Lists
Simple FSM for/politics[@topic=president]/usa//body:
politics usa body
Q1_1 Q1_2 Q1_3
University of Pennsylvania
-
8/12/2019 10 Crawling
28/34
A. Haeberlen, Z. Ives
28
Decomposing into path nodes
Query ID
Position in state machine
Relative Position (RP) in tree:
0 for root node if its notpreceded by //
-1 for any node preceded by//
Else =1+ (no of * nodes frompredecessor node)
Level:If current node has fixed
distance from root, then 1+distance
Else if RP =1, then1, else 0
Finaly, NextPathNodeSet points tonext node
Q1=/politics[@topic=president]/usa//body
Q1 Q1 Q1
1 2 3
0 1 -1
1 2 -1
Q1-1 Q1-2 Q1-3
Q2 Q2 Q21 2 3
-1 2 1
-1 0 0
Q2-1 Q2-2 Q2-3
Q2=//usa/*/body/p
University of Pennsylvania
-
8/12/2019 10 Crawling
29/34
A. Haeberlen, Z. Ives
29
Query index
Query index entryfor each XML tag Two lists: Candidate
List (CL) and WaitList (WL) dividedacross the nodes
Live queriesstates are in CL;pending queries +
states are in WL Events that cause
state transition aregenerated by theXML parser
politics
usa
body
p
Q1-1
Q2-1
Q1-3 Q2-2
Q2-3
X
X
X
X
X
X
X
X CL
WL
Q1-2
University of Pennsylvania
-
8/12/2019 10 Crawling
30/34
A. Haeberlen, Z. Ives
30
Encountering an element
Look up the element name in the QueryIndex and all nodes in the associated CL
Validate that we actually have a match
Q1
1
0
1
Q1-1politicsQ1-1
X
X
WL
startElement: politics
CL
Query ID
Position
Rel. Position
LevelEntry in Query Index:
NextPathNodeSet
University of Pennsylvania
-
8/12/2019 10 Crawling
31/34
A. Haeberlen, Z. Ives
31
Validating a match
We first check that the current XML depthmatches the level in the user query: If level in CL node is less than 1, then ignore height
else level in CL node must = height
This ensures were matching at the right point in the tree!
Finally, we validate any predicates againstattributes (e.g., [@topic=president])
University of Pennsylvania
-
8/12/2019 10 Crawling
32/34
A. Haeberlen, Z. Ives
32
Processing further elements
Queries that dont meet validation areremoved from the Candidate Lists
For other queries, we advance to the next
state We copy the next node of the query from the WL to the CL,
and update the RP and level
When we reach a final state (e.g., Q1-3), we can output thedocument to the subscriber
When we encounter an end element, wemust remove that element from the CL
University of Pennsylvania
-
8/12/2019 10 Crawling
33/34
A. Haeberlen, Z. Ives
A simpler approach
Instantiate a DOM tree for each document
Traverse and recursively match XPaths
Pros and cons?
33University of Pennsylvania
-
8/12/2019 10 Crawling
34/34
34
Recap: Publish-subscribe model
Publish-subscribe model Publishers produce events
Each subscriber is interested in a subset of the events
Challenge: Efficient implementation Comparison: XFilter vs RSS
XFilter Interests are specified with XPaths (very powerful!)
Sophisticated technique for efficiently matching documentsagainst many XPaths in parallel