authoritative sources in a hyperlinked environment hui han cse dept, psu 10/15/01

20
Authoritative Sources in a Hyperlinked Authoritative Sources in a Hyperlinked Environment Environment Hui Han Hui Han CSE dept, PSU CSE dept, PSU 10/15/01 10/15/01

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Authoritative Sources in a Hyperlinked EnvironmentAuthoritative Sources in a Hyperlinked Environment

Hui HanHui Han

CSE dept, PSUCSE dept, PSU

10/15/0110/15/01

Page 2: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Main IdeaMain Idea

Return more authoritative information, using the algorithm based on link structure analysis, esp. the relationship between a set of relevant authoritative pages and the set of “hub” pages that join them together in the link structure

Page 3: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

The types of queriesThe types of queries

Specific queries – scarcity problem

“Does Netscape support the JDK 1.1 code-signing API” Broad-topic queries – abundance problem

“Find information about the Java programming language.”

Solution: find a small set of the most “authoritative” relevant pages.

Similar-page queries “Find pages ‘similar’ to java.sun.com

Page 4: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Linked Based AnalysisLinked Based Analysis

Limitations of text based analysis– Text-based ranking function

Eg. Could www.harvard.edu be recognized as one of the most authoritative pages, since many other webpages contain “harvard” more often.

– Pages are not sufficiently self – descriptive Usually the term “search engine” doesn’t appear on

search engine web pages

Page 5: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Limitations of Basic Link-based AnalysisLimitations of Basic Link-based Analysis

Basic model: Of all pages containing the query string, return those with the greatest number of in-links

Large number of links are created primarily for navigational purposes

Difficulty in finding a balance between relevance and popularity– Popular sites like www.yahoo.com would be considered as highly

authoritative on any query strings it contained.

P QConferred Authority

Page 6: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Hub-Authority methodHub-Authority method

A link-based model for the conferral of authority, using the method that consistently identifies relevant, authoritative WWW pages for broad search topics.

Difference from Clustering– Distinguishing pages related to different

meanings or senses of a query term

Page 7: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Step1: Constructing a Focused Subgraph Step1: Constructing a Focused Subgraph of the WWWof the WWW

Properties of a ideal collection S of pages

– S is relatively small

– S is rich in relevant pages

– S contains most(or many) of the strongest authorities

Construction of S– Collect the t highest-ranked pages for the query from a text-

based search engine (Altavista, hotbot) as root set R– Expand R to base set S

a strong authority is quite likely to be pointed to by at least one page in R

– Keep only transverse links

Page 8: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Step2: Compute Hubs and AuthoritiesStep2: Compute Hubs and Authorities

Authoritative pages should– have large in-degree– have a considerable overlap in the sets of pages that

point to them, since they share a common topic

“Hub” pages “pull together”(link to) authorities on a common topic

Page 9: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

An Iterative AlgorithmAn Iterative Algorithm

G : the subgraph induced on the pages in S

x(p) : authority weight of Page Py(p): hub weight of Page P

Normalize: pS (x(p))2 = 1

pS (y(p))2 = 1 I operation:

x(p) = q : (q,p) Ey(p)

O operation:

y(p) = q : (q,p) Ex(p)

Hub and Authority exhibit “mutually reinforcing relationship”

Page 10: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

An Iterative Algorithm(cont.)An Iterative Algorithm(cont.)

{x(p) } is a vector x with authority weight for each page in G

{y(p) } is a vector y with hub weight for each page in G

Filter out the top c authorities and top c hubs

Page 11: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Basic knowledge of MatrixBasic knowledge of Matrix

M: symmetric n*n matrix :vector : a number

If for some vector , M = , we say,The set of all such is a subspace of Rn Eigenspace

associated with ;

These 1(M), 2(M), … are eigenvalues, while 1(M), 2(M), … are eigenvectors

i(M) belongs to the subspace of i(M)

If we assume |1(M) > 2(M)|, we refer to 1(M) as the principal eigenvector, and all other i(M) as non-principal eigenvector.

Page 12: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Convergence Proof of Iterate ProcedureConvergence Proof of Iterate Procedure

Theorem1. The sequences x1, x2, x3, … and y1, y2, y3, … converge to x* and y* respectively.

Proof: G=(V,E); V={p1, p2, …, pn}; A is the adjacency matrix of graph G; Aij = 1 if (pi, pj) is an edge of G.

I & O operations can be written as:x ATy y Ax K loops,

So, x (1) AT Ax (0); x(0) = AT z x* … x (k) (AT A)k-1 AT zy* … y (k) (AAT)k z

“if is a vector not orthogonal to the principle eigenvector 1(M), the unit vector in the direction of Mk converges to 1(M) as k increases without bound”

Page 13: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Convergence Proof of Iterate Procedure(cont.)Convergence Proof of Iterate Procedure(cont.)

A is called an orthogonal matrix if AAT = AT A = E.

Theorem2: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT.

Experiment finds that k=20 is sufficient for the convergence of vectors.

Page 14: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Experimental resultsExperimental results Most of these authorities do not contain any occurrences of

the initial query string , except http://www.roadhead.com

Page 15: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Similar-Page QueriesSimilar-Page Queries

The strongest authorities in the local region of the link structure near p, can server as a broad-topic summary of the pages related to p.

A small difference in initial request to the search engine: “Find t pages pointing to p”, to get R

Page 16: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Multiple sets of Hubs and AuthoritiesMultiple sets of Hubs and Authorities

Why?– The query string may have several very different

meanings. Eg.“java”– The string may arise as a term in the context of multiple

technical communities. Eg. “randomized algorithms”– The string may refer to a highly polarized issue,

involving groups that are not likely to link to one another. Eg. “abortion”

Idea:– The NON-principle eigenvectors of ATA and AAT

provide us with a natural way to extract additional densely linked collections of hubs and authorities from the base set S.

Page 17: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Multiple sets of Hubs and AuthoritiesMultiple sets of Hubs and AuthoritiesExperimental result1Experimental result1

For the query “jaguar*”, the strongest collections of authoritative sources concerned the Atari Jaguar product, the NFL football team from Jacksonville, and the automobile.

Page 18: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Multiple sets of Hubs and AuthoritiesMultiple sets of Hubs and AuthoritiesExperimental result2Experimental result2

For the query “randomized algorithms”, none of the strongest collections of hubs and authorities are precisely on the query topic. They include home pages of theoretical computer scientists, compendia of mathesmatical software and pages on wavelets.

Page 19: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

Multiple sets of Hubs and AuthoritiesMultiple sets of Hubs and AuthoritiesExperimental result3Experimental result3

For the query “abortion”, there is a natural separation.

Page 20: Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01

ConclusionConclusion

A technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic.

Related work.– Standing, influence in social networks, scientific

citations, etc.– Hypertext and WWW rankings– …