web search – summer term 2006 vii. web search - indexing: structure index (c) wolfgang hürst,...
TRANSCRIPT
Web Search – Summer Term 2006
VII. Web Search -Indexing: Structure Index
(c) Wolfgang Hürst, Albert-Ludwigs-University
Structure Index (Links)
Represents the links between the indexed pages
Important for- Relevance calculation (PageRank, HITS, ...)- Crawling (importance metrics, ...)and some other applications (Web mining, etc.)
Most critical issues (again):- Size and rate of change
Most important requirements:- Reduce space / compression- Support required operations (random and streaming access, add / delete)- Speed
The Web Graph
The structure index represents the web graph:- Node = web page- Directed edge = link
1
3
2
Common representation techniques for graphs:a) Adjacency matrix
The Web Graph
The structure index represents the web graph:- Node = web page- Directed Edge = link
1
3
2
Common representation techniques for graphs:b) Adjacency list
The Structure Index
Example: The Connectivity Server [3]
Based on a data structure that supports the following operations:
- Given a URL u (or a set of URLs U), return a list of pages that point to u (U), i.e. its predecessors (back links) and a list of pages that are pointed to from u (U), i.e. its successors (forward links)
- Given a set of URLs U and a distance, return the respective neighborhood of U in the graph
The Connectivity ServerNodes: Array (1 node = 1 element)
Edges: - OUTLIST: Adjacency list (successors) - INLIST: Inverted adjacency list (predecessors)
URLDATA-BASE
PTR TO URL PTR TO INLIST PTR TO OUTLIST
... ... ...
NODE TABLE
...
INLIST TABLE
...
OUTLIST TABLE
The Connectivity Server (cont.)Additional data structure to map ULRs to IDs (and vice versa)
ID = index in the lexicographically sorted list of all crawled URLs
Advantage: Compression, i.e. delta-encoding
Example:WWW.FOOBAR.COM/WWW.FOOBAR.COM/GANDALF.HTMWWW.FOOGRAB.COM/
0 WWW.FOOBAR.COM/ 115 GANDALF.HTM 267 GRAB.COM/ 41
ORIGINAL TEXT
DELTA ENCODING
The Connectivity Server (cont.)
Problem: Need to scan all URLs because of delta encoding (i.e. saves space at cost of speed)Solution: Include Checkpoint URLs
Another problem: Updates are hard to do
Several other (newer) approaches exist that take into account (e.g.) the actual web structure
S-Node Representation [4]Observations on the web structure:
- Link copying: Lots of clusters with nodes containing very similar adjacency lists
- Domain and URL locality: A significant fraction of links on a page point to pages from the same domain
- Page similarity: Pages that have very similar adjacency lists are likely to be related
Idea: Make use of these observations, e.g. by grouping related pages / similar URLs
S-Node Representation - ExamplePARTITION P = {N1, N2, N3}
N1 = {P1, P2}N2 = {P3}N3 = {P4, P5}
1
2
3
5
4
1
2
3
5
4
INTRA-NODES Ni
N2
N1 N3
SUPERNODEGRAPH
S-Node Representation - ExamplePARTITION P = {N1, N2, N3}
N1 = {P1, P2}N2 = {P3}N3 = {P4, P5}
1
2
3
5
4
N2
N1 N3
1
2
3
5
4
INTRA-NODES Ni
SUPERNODEGRAPH
POSITIVE SUPEREDGES
2 523
1
43 51
NEGATIVE SUPER-EDGES 53 2
4
5
1
2
41
5
Creating partitions
1. Initial partition: Based on URL (top two levels of DNS), e.g. - www.informatik.uni-freiburg.de - ad.informatik.uni-freiburg.de - www.imtek.uni-freiburg.de
2. URL Split: Split Nis based on URL prefixes, e.g. - www.informatik.uni-freiburg.de/students - www.informatik.uni-freiburg.de/studienberatung
3. Clustered Split: Use clustering algorithm to split partitions into groups with similar adjacency lists
References - Indexing[1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 4 (Indexing)
[2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4 (System Anatomy)
[3] BHARAT, BRODER, HENZINGER, KUMAR, VENKATASUBRAMINAIN: "THE CONNECTIVITY SERVER: FAST ACCESS TO LINKAGE INFORMATION ON THE WEB", WWW 1998
[4] RAGHAVAN, GARCIA-MOLINA: "REPRESENTING WEB GRAPHS", STANFORD TECHNICAL REPORT 2002
General Web Search Engine Architecture
CLIENT
QUERY ENGINE
RANKING
CRAWL CONTROL
CRAWLER(S)
USAGE FEEDBACK
RESULTSQUERIES
WWW
COLLECTION ANALYSIS MOD.
INDEXER MODULE
PAGE REPOSITORY
INDEXESSTRUCTUREUTILITY TEXT
(CF. [1] FIG. 1)
The evolution of search engines1st generation: Use only "on page", text data- Word frequency, language
1995-1997 (AltaVista, Excite, Lycos, etc.)
2nd gen.: Use off-page, web-specific data- Link (or connectivity) analysis- Click-through data (what results people click on)- Anchor-text (how people refer to a page)
From 1998 (made popular by Google but everyone now)
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
Still experimental
The evolution of search engines
Semantic analysis - What is it about?
Focus on user need, rather than on query- Corpus reflects user needs / expectations- Integrates multiple sources of data- Help the user create a good query
Context determination- Spatial (user location/target location)- Query stream (previous queries)- Personal (user profile)- Explicit (vertical search)- Implicit (on altavista.de)
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
3rd gener.: Answer the need behind the query
Still experimental
The evolution of search engines
3rd gener.: Answer the need behind the query (cont.)
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
Helping the user- UI, spell checking, query refinement, query suggestion, syntax driven feedback, context help, context transfer, etc.
Integration of search and text analysis
Example: Google
3rd gener.: Answer the need behind the query
Web Search Lecture - Schedule
1. Classic IR (Basics)
2. Classic IR Exercises
3. Web Search (Basics)
4. Web Search Exercises [June, 28th till July 12th]
5. Web Search (Selected Topics) [July, 18th till July 26th]
Web Search – Summer Term 2006
Web Search Basics -(Programming) Exercises
(c) Wolfgang Hürst, Albert-Ludwigs-University
General Web Search Engine Architecture
CLIENT
QUERY ENGINE
RANKING
CRAWL CONTROL
CRAWLER(S)
USAGE FEEDBACK
RESULTSQUERIES
WWW
COLLECTION ANALYSIS MOD.
INDEXER MODULE
PAGE REPOSITORY
INDEXESSTRUCTUREUTILITY TEXT
(CF. [1] FIG. 1)
Programming Exercises
Exercise sheet 1: Tools, Library (Lucene)
Exercise sheet 2: Database (and text index)
Exercise sheet 3: Index (structure index)
Exercise sheet 4: Search (link-based ranking)
Web Search Lecture - Schedule
1. Classic IR (Basics)
2. Classic IR Exercises
3. Web Search (Basics)
4. Web Search Exercises [June, 28th till July 12th]
5. Web Search (Selected Topics) [July, 18th till July 26th]
New Lecturnity Player
Advanced replay features (developed by us)
Modification of replay speed (while preserving the pitch of the voice)