web search – summer term 2006 vii. web search - indexing: structure index (c) wolfgang hürst,...

Web Search – Summer Term 2006

VII. Web Search -Indexing: Structure Index

(c) Wolfgang Hürst, Albert-Ludwigs-University

Structure Index (Links)

Represents the links between the indexed pages

Important for- Relevance calculation (PageRank, HITS, ...)- Crawling (importance metrics, ...)and some other applications (Web mining, etc.)

Most critical issues (again):- Size and rate of change

Most important requirements:- Reduce space / compression- Support required operations (random and streaming access, add / delete)- Speed

The Web Graph

The structure index represents the web graph:- Node = web page- Directed edge = link

1

3

2

Common representation techniques for graphs:a) Adjacency matrix

The Web Graph

The structure index represents the web graph:- Node = web page- Directed Edge = link

1

3

2

Common representation techniques for graphs:b) Adjacency list

The Structure Index

Example: The Connectivity Server [3]

Based on a data structure that supports the following operations:

- Given a URL u (or a set of URLs U), return a list of pages that point to u (U), i.e. its predecessors (back links) and a list of pages that are pointed to from u (U), i.e. its successors (forward links)

- Given a set of URLs U and a distance, return the respective neighborhood of U in the graph

The Connectivity ServerNodes: Array (1 node = 1 element)

Edges: - OUTLIST: Adjacency list (successors) - INLIST: Inverted adjacency list (predecessors)

URLDATA-BASE

PTR TO URL PTR TO INLIST PTR TO OUTLIST

... ... ...

NODE TABLE

...

INLIST TABLE

...

OUTLIST TABLE

The Connectivity Server (cont.)Additional data structure to map ULRs to IDs (and vice versa)

ID = index in the lexicographically sorted list of all crawled URLs

Advantage: Compression, i.e. delta-encoding

Example:WWW.FOOBAR.COM/WWW.FOOBAR.COM/GANDALF.HTMWWW.FOOGRAB.COM/

0 WWW.FOOBAR.COM/ 115 GANDALF.HTM 267 GRAB.COM/ 41

ORIGINAL TEXT

DELTA ENCODING

The Connectivity Server (cont.)

Problem: Need to scan all URLs because of delta encoding (i.e. saves space at cost of speed)Solution: Include Checkpoint URLs

Another problem: Updates are hard to do

Several other (newer) approaches exist that take into account (e.g.) the actual web structure

S-Node Representation [4]Observations on the web structure:

- Link copying: Lots of clusters with nodes containing very similar adjacency lists

- Domain and URL locality: A significant fraction of links on a page point to pages from the same domain

- Page similarity: Pages that have very similar adjacency lists are likely to be related

Idea: Make use of these observations, e.g. by grouping related pages / similar URLs

S-Node Representation - ExamplePARTITION P = {N1, N2, N3}

N1 = {P1, P2}N2 = {P3}N3 = {P4, P5}

1

2

3

5

4

1

2

3

5

4

INTRA-NODES Ni

N2

N1 N3

SUPERNODEGRAPH

S-Node Representation - ExamplePARTITION P = {N1, N2, N3}

N1 = {P1, P2}N2 = {P3}N3 = {P4, P5}

1

2

3

5

4

N2

N1 N3

1

2

3

5

4

INTRA-NODES Ni

SUPERNODEGRAPH

POSITIVE SUPEREDGES

2 523

1

43 51

NEGATIVE SUPER-EDGES 53 2

4

5

1

2

41

5

Creating partitions

1. Initial partition: Based on URL (top two levels of DNS), e.g. - www.informatik.uni-freiburg.de - ad.informatik.uni-freiburg.de - www.imtek.uni-freiburg.de

2. URL Split: Split Nis based on URL prefixes, e.g. - www.informatik.uni-freiburg.de/students - www.informatik.uni-freiburg.de/studienberatung

3. Clustered Split: Use clustering algorithm to split partitions into groups with similar adjacency lists

References - Indexing[1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 4 (Indexing)

[2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4 (System Anatomy)

[3] BHARAT, BRODER, HENZINGER, KUMAR, VENKATASUBRAMINAIN: "THE CONNECTIVITY SERVER: FAST ACCESS TO LINKAGE INFORMATION ON THE WEB", WWW 1998

[4] RAGHAVAN, GARCIA-MOLINA: "REPRESENTING WEB GRAPHS", STANFORD TECHNICAL REPORT 2002

General Web Search Engine Architecture

CLIENT

QUERY ENGINE

RANKING

CRAWL CONTROL

CRAWLER(S)

USAGE FEEDBACK

RESULTSQUERIES

WWW

COLLECTION ANALYSIS MOD.

INDEXER MODULE

PAGE REPOSITORY

INDEXESSTRUCTUREUTILITY TEXT

(CF. [1] FIG. 1)

The evolution of search engines1st generation: Use only "on page", text data- Word frequency, language

1995-1997 (AltaVista, Excite, Lycos, etc.)

2nd gen.: Use off-page, web-specific data- Link (or connectivity) analysis- Click-through data (what results people click on)- Anchor-text (how people refer to a page)

From 1998 (made popular by Google but everyone now)

TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

Still experimental

The evolution of search engines

Semantic analysis - What is it about?

Focus on user need, rather than on query- Corpus reflects user needs / expectations- Integrates multiple sources of data- Help the user create a good query

Context determination- Spatial (user location/target location)- Query stream (previous queries)- Personal (user profile)- Explicit (vertical search)- Implicit (on altavista.de)


3rd gener.: Answer the need behind the query

Still experimental

The evolution of search engines

3rd gener.: Answer the need behind the query (cont.)


Helping the user- UI, spell checking, query refinement, query suggestion, syntax driven feedback, context help, context transfer, etc.

Integration of search and text analysis

Example: Google

3rd gener.: Answer the need behind the query

Web Search Lecture - Schedule

1. Classic IR (Basics)

2. Classic IR Exercises

3. Web Search (Basics)

4. Web Search Exercises [June, 28th till July 12th]

5. Web Search (Selected Topics) [July, 18th till July 26th]

Web Search – Summer Term 2006

Web Search Basics -(Programming) Exercises

(c) Wolfgang Hürst, Albert-Ludwigs-University

General Web Search Engine Architecture

CLIENT

QUERY ENGINE

RANKING

CRAWL CONTROL

CRAWLER(S)

USAGE FEEDBACK

RESULTSQUERIES

WWW

COLLECTION ANALYSIS MOD.

INDEXER MODULE

PAGE REPOSITORY

INDEXESSTRUCTUREUTILITY TEXT

(CF. [1] FIG. 1)

Programming Exercises

Exercise sheet 1: Tools, Library (Lucene)

Exercise sheet 2: Database (and text index)

Exercise sheet 3: Index (structure index)

Exercise sheet 4: Search (link-based ranking)

Web Search Lecture - Schedule

1. Classic IR (Basics)

2. Classic IR Exercises

3. Web Search (Basics)

4. Web Search Exercises [June, 28th till July 12th]

5. Web Search (Selected Topics) [July, 18th till July 26th]

New Lecturnity Player

Advanced replay features (developed by us)

Modification of replay speed (while preserving the pitch of the voice)

web search – summer term 2006 vii. web search - indexing: structure index (c) wolfgang hürst,...

Documents