building web spiders web-based information architectures msec 20-760 mini ii jaime carbonell

29
Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Building Web SpidersWeb-Based Information Architectures

MSEC 20-760Mini II

Jaime Carbonell

Page 2: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

General Topic: Spidering the Web

• Motivation: Acquiring a Collection

• Bare Essentials of Graph Theory

• Web Spidering Algorithms

• Web Spidering: Current Practice

Page 3: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Acquiring a Collection (1)

Revising the Total IR Scheme1. Acquire the collection, i.e. all the documents

[Off-line process]2. Create an inverted index (Homework 1)

[Off-line process]3. Match queries to documents (Homework 2)

[On-line process, the actual retrieval]4. Present the results to user

[On-line process: display, summarize, ...]

Page 4: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Acquiring a Collection (2)Document Collections and Sources• Fixed, pre-existing document collection

e.g., the classical philosophy works• Pre-existing collection with periodic updates

e.g., the MEDLINE biomedical collection• Streaming data with temporal decay

e.g., the Wall-Street financial news feed• Distributed proprietary document collections

See Prof. Callan’s methods• Distributed, linked, publicly-accessible documents

e.g. the Web

Page 5: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Technical Detour:Properties of Graphs I (1)

Definitions:

Graph

a set of nodes n and a set of edges (binary links) v between the nodes.

Directed graph

a graph where every edge has a pre-specified direction.

Page 6: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Technical Detour:Properties of Graphs I (2)

Connected graph a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other.

The web graph the directed graph where n = {all web pages} and v = {all HTML-defined links from one web page to another}.

Page 7: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Technical Detour:Properties of Graphs I (3)

Tree

a connected graph without any loops and with a unique path between any two nodes

Spanning tree of graph G

a tree constructed by including all n in G, and a subset of v such that G remains connected, but all loops are eliminated.

Page 8: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Technical Detour:Properties of Graphs I (4)

Forest

a set of trees (without inter-tree links)

k-Spanning forest

Given a graph G with k connected subgraphs, the set of k trees each of which spans a different connected subgraph.

Page 9: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph G = <n, v>

Page 10: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Directed Graph Example

Page 11: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Tree

Page 12: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Web Graph

<href …>

<href …>

<href …>

<href …>

<href …>

<href …>

<href …>

HTML references are linksWeb Pages are nodes

Page 13: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Technical Detour:Properties of Graphs II (1)

Theorem 1: For every connected graph G, there exists a spanning tree.

Proof: Depth-first search starting at any node in G builds the spanning tree.

Page 14: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Technical Detour:Properties of Graphs II (2)

Theorem 2: For every G with k disjoint connected subgraphs, there exists a k-spanning forest.

Proof: Each connected subgraph has a spanning tree (Theorem 1), and the set of k spanning trees (being disjoint) define a k-spanning forest.

Page 15: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Technical Detour:Properties of Graphs II (3)

Additional Observations• The web graph at any instant of time

contains k-connected subgraphs (but we do not know the value of k, nor do we know a-priori the structure of the web-graph).

• If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF."

Page 16: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph-Search Algorithms IPROCEDURE SPIDER1(G)

Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>

While STACK is not empty,

URLcurr := pop(STACK)

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)

Return COLLECTION

What is wrong with the above algorithm?

Page 17: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Depth-first Search

1

2

3 4

5

6

7numbers = order inwhich nodes arevisited

Page 18: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph-Search Algorithms II (1)

SPIDER1 is Incorrect

• What about loops in the web graph?

=> Algorithm will not halt

• What about convergent DAG structures?

=> Pages will replicated in collection

=> Inefficiently large index

=> Duplicates to annoy user

Page 19: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph-Search Algorithms II (2)

SPIDER1 is Incomplete

• Web graph has k-connected subgraphs.

• SPIDER1 only reaches pages in the the connected web subgraph where ROOT page lives.

Page 20: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph-Search Algorithms IIIA Correct Spidering Algorithm

PROCEDURE SPIDER2(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>

While STACK is not empty,

| Do URLcurr := pop(STACK)

| Until URLcurr is not in COLLECTION

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)Return COLLECTION

Page 21: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph-Search Algorithms IVA More Efficient Correct Algorithm

PROCEDURE SPIDER3(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>

| Initialize VISITED <big hash-table>

While STACK is not empty,

| Do URLcurr := pop(STACK)

| Until URLcurr is not in VISITED

| insert-hash(URLcurr, VISITED)

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)Return COLLECTION

Page 22: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph-Search Algorithms VA More Complete Correct Algorithm

PROCEDURE SPIDER4(G, {SEEDS})| Initialize COLLECTION <big file of URL-page pairs>| Initialize VISITED <big hash-table>

| For every ROOT in SEEDS| Initialize STACK <stack data structure>| Let STACK := push(ROOT, STACK)

While STACK is not empty,

Do URLcurr := pop(STACK)

Until URLcurr is not in VISITED

insert-hash(URLcurr, VISITED)

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)Return COLLECTION

Page 23: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph-Search Algorithms VICompleteness Observations (1)

Completeness is not guaranteed

• In k-connected web G, we do not know k

• Impossible to guarantee each connected subgraph is sampled

• Better: more seeds, more diverse seeds

Page 24: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Graph-Search Algorithms VICompleteness Observations (2)

Search Engine Practice

• Wish to maximize subset of web indexed.

• Maintain (secret) set of diverse seeds

(grow this set opportunistically, e.g. when X complains his/her page not indexed).

• Register new web sites on demand

New registrations are seed candidates.

Page 25: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

To Spider or not to Spider? (1)User Perceptions• Most annoying: Engine finds nothing (too small

an index, but not an issue since 1998 or so).• Somewhat annoying: Obsolete links

=> Refresh Collection by deleting dead links

(OK if index is slightly smaller)

=> Done every 1-2 weeks in best engines• Mildly annoying: Failure to find new site

=> Re-spider entire web

=> Done every 2-4 weeks in best engines

Page 26: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

To Spider or not to Spider? (2)

Cost of Spidering• Semi-parallel algorithmic decomposition• Spider can (and does) run in hundreds of severs

simultaneously• Very high network connectivity (e.g. T3 line)• Servers can migrate from spidering to query

processing depending on time-of-day load• Running a full web spider takes days even with

hundreds of dedicated servers

Page 27: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Current Status of Web Spiders (1)

Historical Notes

• WebCrawler: first documented spider

• Lycos: first large-scale spider

• Top-honors for most web pages spidered: First Lycos, then Alta Vista, then Google...

Page 28: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Current Status of Web Spiders (2)

Enhanced Spidering

• In-link counts to pages can be established during spidering.

• Hint: In SPIDER4, store <URL, COUNT> pair in VISITED hash table.

• In-link counts are the basis for GOOGLE’s page-rank method

Page 29: Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Current Status of Web Spiders (3)

Unsolved Problems• Most spidering re-traverses stable web graph

=> on-demand re-spidering when changes occur• Completeness or near-completeness is still a major

issue• Cannot Spider JAVA-triggered or local-DB stored

information