web search algorithms - dcu school of computingasmeaton/ca652/websearchalgo.pdf•web 3.0 –ugc...

92
- 1 - Web Search Algorithms

Upload: others

Post on 07-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 1 -

Web Search Algorithms

Page 2: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 2 -

Why web search in this module ?

• WWW is the delivery platform and the interface• How do we find information and services on the

web … we try to generate a url that seemssensible– Dell Computers – www.dell.ie, Ford Ireland –

www.ford.ie But products?• GPS Devices – www.gps.ie is not ok

• Or, we use a Search Engine– So we rely on Search Engines - we even use them to

look up spellings and as a calculator !• Search Engines bring people to a website

– For most, such as Google, ranking algorithm is closelyguarded, wholesome, true, uncorrupted, and not paid

• advertisements are merely sold based on similarity toquery keywords.

• This leads to the industry of Search Engine Optimisations(SEO) ... the “Google Dance”

Page 3: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 3 -

Text IR - Google as example

• Google is operational since 1998– Two PhD students from Stanford

• ?? Billion documents– Early search engines competed on size of index, related to

how powerful their infrastructure was. Not an issue now.– Stopped advertising after 8,168,684,336 pages in Aug

2005– Size now, effectively unknown

• Also has ??? billion images– not all unique images– Flickr has about 2B (Nov 2007); FaceBook had 4.1 B at

that time

Page 4: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 4 -

Searching or Marketing?

• However, Search Engines must make a profit!– Advertisment Sales– Marketing– Paid Listings– And selling their indexes

• A lot of Search Engines are also marketingcompanies…– This is at odds with the idea that a search engine is a

page you visit on the way elsewhere.• The less time you spend there the better!• But, many people ‘pass through the doors’, so they sell

query focussed advertisements– You can estimate by looking at the main page of the

search engine.

Page 5: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 5 -

How do SEs help user searches

• It is known that we search for– people / home pages.– companies / company HPs (or guess from URLs).– a particular product or service.– a fact, buried in one or more documents, any one of which

will do…– a document, an entire document, with text/image, and

nothing smaller will do.– an overview on a broad or narrow topic– Media Search

• an MPEG-4 file.• Through image databases.• Through (digital) video library, and/or through a video.

• If the SE knows the type of query, then ranking can betailored to that query, because different search typescan be satisfied by different search algorithms.

Page 6: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 6 -

Search Engines

• Originally SE’s were web directorys– Manually generated (e.g. Yahoo!)

• Then automatic crawler-based Search Enginesdeveloped– The web got big and manual categorisation was

becoming too difficult (e.g. Lycos)– Today the large SE’s index over ?? billion web pages.– The first crawler-based SE was the WWWW in 1994

Page 7: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 7 -

Architecture of a Search Engine

Page 8: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 8 -

My Google!

Page 9: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 9 -

Bing

Page 10: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 10 -

Facebook ? Is it a Search Engine?

Page 11: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 11 -

Facebook Social Graph

CollegeFriends

Friends

A PreviousClass

IR ResearchCommunity

Page 12: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 12 -

TWITTER

Page 13: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 13 -

The Landscape is changing

Page 14: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 14 -

Web 1.0 Web 3.0

• Web 1.0– Static content... Companies created content– We were consumers

• Web 2.0– User generated content– Communities and creators... We create,

filter, recommend the content

• Web 3.0– UGC and... Semantic Web... Life streams?– Social and Location– What is the next big thing?

Page 15: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 15 -

Web 1.0

• Search engines over prepared andplanned content

• Organisations and some users

• SEO was the way to optimise WEB 1.0

• HTML and static content

Page 16: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 16 -

Page 17: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 17 -

Web 2.0

• User and Organisation GeneratedContent

• Social Graphs• Social Filtering and Social Ranking• Examples:

– Social networks : facebook, twitter, linkedin– Shared bookmarks: digg, delicious, reddit,

stumbleupon– Social media sharing :flickr, youtube– Blogs (MSN space, wordpress, blogger)– Even 3D social worlds... Social gaming?

Page 18: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 18 -

Page 19: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 19 -

Web 3.0

• Semantic Web– Many media types... Integrated for smarter

uses

• Rich media integration

• Personalisation to the user context

• Life streaming of content– We are integrated into our own

entertainment

Page 20: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 20 -

What is Web3.0 about?

Page 21: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 21 -

The Search Landscape

Changing enormously

Page 22: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 22 -

Continuous Partial Attention

• Be aware of Continuous Partial Attention... akind of multitasking

• skimming the surface of the incoming data,picking out the relevant details, and movingon to the next stream.

• Continuous not episodic• Cast a wider net, but never full attention

• So.. How does this impact on search?

http://www.wisegeek.com/what-is-continuous-partial-attention.htm

Page 23: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 23 -

And don’t forget the twitter curve…

http://headrush.typepad.com/creating_passionate_users/2006/12/httpwww37signal.html

Page 24: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 24 -

Google AdSense

Page 25: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 25 -

Spamming

• Spamming is a technique based on themanipulation of content in order to affectranking from search engines– Bogus meta tags, hidden text, plan text…– Also link spamming…

• Huge SE resources are used in defeatingspamming - more than in search qualityimprovement !

• Getting in the top-10 is essential forbusinesses– 85% of users only look at top 10.– Lead to the business of Search Engine Optimisation

Page 26: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 26 -

Search Engine Ranking

As we all know, simply examiningweb page content as text is notenough.. We need to examineranking factors.. Positive and

negative.

Page 27: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 27 -

Positive Ranking Factors : Term Location

• In the TITLE of the page, most important• In the body of the text, but must MAKE SENSE• In the Heading text (H1,H2…)• In the Domain Name

– Also in page URL

• In ALT tag and image title• In BOLD/STRONG tags• Terms near the top likely ranked higher than

other terms

Page 28: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 28 -

Positive Ranking Factors : Page Attributes

• Importance of the page in the Website– Number of links to it from the same website

• Quality of links to other pages• Age of a document

– Older may be more authorative• We will see authorities later!

– Newer may be better for some queries (e.g. news)

• Amount of text on the page• Structure of the page• Frequency of updates• Spelling and correctness of HTML

Page 29: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 29 -

SE Ranking + : Website Issues

• Linkage of the Website– Global link popularity of the

website• Like a global Pagerank (SiteRank)

– Relevance of the links into thewebsite

– Link popularity of the site in atopical community

– Rate of new inbound links to a website

• Age of a website (older is better)• Freshness of a website (new pages is better)• Relevancy of the website (as well as the page)• Clickthrough rate for the website• Reputation of the top-level domain

– E.g. .GOV & .EDU … can not easily be bought

Page 30: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 30 -

SE Ranking + : Linkage Issues

• Anchor text of inbound links as a description ofthe WWW page– Also text surrounding the link into the webpage

• Topical relationship between source and targetof link

• Link popularity of the page in a topicalcommunity

• Age of links– The older the better, i.e. long lasting links

• Pagerank of the webpage– Googles PageRank algorithm

• Number of links into a web page

Page 31: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 31 -

Positive Ranking Factors : Images

• Images on a web page– Can provide a chance to express ideas in a

visual way that can convey a considerableamount of information

– Add to the attractiveness and perceivedquality of a site.

– Recent Microsoft Patent on “ScoringRelevance of a Document Based on ImageText”

– Also.. Remember to name the imageproperly and have alt element

Page 32: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 32 -

Negative Ranking Factors

• Link Farm Participation– Try to artificially increase PageRank

• Proportion of links to or from knownSpamming sites

• Duplicate Content to already indexedcontent

• Server Errors or server down-time• External links to low-quality content• Low level of visitors to the website• Try to include hidden text on the page

Page 33: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 33 -

Using the Ranking Factors…

Term Location Factors

Page Factors

Website Factors

Linkage FactorsPageRank Factors

Negative Factors

Result

User Query

The Search Engineranking process is a

closely guarded trade secret of the

search engines.

Page 34: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 34 -

So lets look in some detail at someof these ranking factors…

Linkage-based Search

Page 35: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 35 -

The Shape of the WWW

This is based on a study of 200 million web pages. Scale up to WWW scale.

Page 36: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 36 -

Spidering : finding WWW content

• A Search Engine needs to find WWWcontent for its index– This is done by the spidering software

• Starting from some ‘seed’ WWW pages,the spider software downloads thesepages and extracts the links, therebylearning about new pages to crawl.

• WWW-scale crawling means crawlingthousands of pages per second

Page 37: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 37 -

A Basic Crawling Algorithm

• You need to be linked to from the mainWWW.. Remember the shape!

• Given a set of ‘seed’ URLs (WWW pagesaddresses):– Add them to a (priority) queue of URLs– While the queue is not empty (!empty)

• Take the first URL (u) off the queue• Download the WWW page for u• Store the URL in a list of seen URLs• Index it• If u is a HTML page, extract the links (y)

– For each y add it to the queue if it has not beenvisited before

Page 38: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 38 -

Spiders must behave!

• Most crawlers/spiders will follow some rules:– A spider must never request large numbers of

documents from the same host sequentially… changethe target website as often as is feasible.

– A spider must never (for whatever reason)repeatedly request the same document. If adocument is unavailable, … it’s position in the queuemust be penalized … Repeated failures must betaken into account and the document flagged asunavailable and taken off the queue.

– A spider must respect author’s wishes as expressedusing the robots exclusion protocol

Page 39: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 39 -

Robots Exclusion

To exclude all robots from the entire serverUser-agent: *Disallow: /

To allow all robots complete accessUser-agent: *Disallow:

To exclude all robots from part of the serverUser-agent: *Disallow: /cgi-bin/Disallow: /private/

To exclude a single robotUser-agent: BadBotDisallow: /

To allow a single robotUser-agent: WebCrawlerDisallow:User-agent: *Disallow: /

allows Web site administrators to indicate to visiting robots which parts of their site shouldnot be visited by the robot. Most good robots will process it… BUT it makes a crawler less

efficient… more explorative crawling required…

Page 40: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 40 -

Robots.txt example

Page 41: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 41 -

Another Example

Page 42: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 42 -

And one more…

Page 43: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 43 -

Simple Overview

1. Spidering

2. Indexing

3. Ranking

WWW

View WWWpage

Page 44: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 44 -

WWWW – the first SE

• WWWW (94) did not use the content of a page for indexing,it used:– Title of the document– Text in the URL String– Any anchor text from links pointing to the page.

• Based on using the UNIX egrep program tosearch through disk files.

All SEs now use Linkage Analysis to exploit latent humanjudgement to improve retrieval performance

This is in addition to using the document content.

Page 45: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 45 -

Some history… Citation Analysis

Most significant contribution to web search is thetechnique for how to rank Journals based on quality(impact)

Citation indexing… the ‘impact factor’ measurement…based on two elements:– the number of citations in the current year to any articles

published in the journal over the previous two years.– the number of articles published by the journal during

these two years.• Letting j be a journal and IFj be the Impact Factor of

journal j, we have:

• This “impact factor” was originally applied to medicaljournals as a simple method of comparing journals toeach other regardless of their size.

)2(#)2(#yearslastArticlesPublished

yearslastCitationsIFj =

Page 46: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 46 -

Hirsch Index (h-index)

• Citation Analysis is a balance between quality (numberof citations) and quantity (number of papers);

• Among scientists, the h-index is becoming popular formeasurement … it’s the number of published paperswhich each have a number of citations greater or equalto that number.– Alan Smeaton has 250+ papers, about 3,000 citations, and

an h-index of 30;– Desmond Higgins (UCD) has 29,000 citations (22,500 on

one paper), and an h-index of 22;• Linkage analysis in web topology does something like

this, as we’ll see.

Page 47: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 47 -

Linkage Analysis

Linkage Analysis : a method of ranking web sites which is based on theexploitation of latent human judgments mined from the hyperlinks thatexist between documents on the WWW.

The first generation of web search engines were effectivelyTF-IDF or BM25, or equivalent.

And they have addressed the engineering problems of webspidering and efficient searching for large numbers of bothusers and documents.

Linkage Analysis important since late 90s.

Anecdotally this appears to have improved the precision ofretrieval yet there was little scientific evidence in supportof this until recently.

Page 48: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 48 -

Origin : Citation Analysis

How to rank Journals based on quality (impact)

Citation indexing… the ‘impact factor’ measurement…based on two elements:– the number of citations in the current year to any articles

published in the journal over the previous two years.– the number of articles published by the journal during

these two years.

• Letting j be a journal and IFj be the Impact Factor ofjournal j, we have:

• This “impact factor” was originally applied to medicaljournals as a simple method of comparing journals toeach other regardless of their size.

)2(#)2(#yearslastArticlesPublished

yearslastCitationsIFj =

Page 49: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 49 -

Mining links can tell us that…

• Bibliographic Coupling– A and B are similar because they both cite

C,D,E

• Co-citation Analysis– A and B are similar because they are both

cited by C,D,E

Page 50: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 50 -

What else can we do with links?

• Count them?• Distinguish between good and bad ones?

• How we employ them is called LinkageAnalysis– Linkage-based ranking schemes can be seen to

belong to one of two distinct classes:• Query-independent schemes,

– A score is assigned to a document once and used for allsubsequent queries.

» independent of a given query.– Fast processing at query time!

• Query-dependent schemes,– assigns a linkage score to a page in the context of a given

query.– Slower processing at query time!

Page 51: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 51 -

Assumed Properties of Links

When extracting information for linkage analysisfrom hyperlinks on the Web, two coreproperties can be assumed:

– A link between two documents on the web carriesthe implication of related content.

– If different people authored the documents (differentdomains, therefore off-site links), then the firstauthor found the second document valuable.

• An author can-not be allowed to influence the linkagescore of documents within his/her domain.

– Off-site links (links between web sites) are moreimportant that links within websites or within documents.

Page 52: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 52 -

Link Types

in-link to doc F : 5,8,9out-link from doc F : 4,6,10self-links: 2,11on-site links: 6,8,12off-site links: 1,3,4,5,9,10on-site in-links to doc F: ?off-site out-links of doc F: ?

Page 53: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 53 -

Basic Linkage Analysis

A

D

C

G

E

F

B

J

I

H

Given a linkage graph (below), Page A is a betterpage than B because…

Off-site links only…

Page 54: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 54 -

Expanding on this…

A

D

C

G

E

F

B

J

I

H

However, page B may actually be better…

CNN

So we use iterative processes… like PageRank or Kleinberg’s

Yahoo

Page 55: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 55 -

Generating a linkage score

nn SP =

Let n be some web page and Sn be the set of web pages that linkinto n across off-site links:

In this case, the Pn score (Popularity score) is based purely on thein-degree of document n…

Could be the sole source of document ranking given a set of relevantdocuments (boolean IR) OR could work by integrating normal documentretrieval (TF-IDF / BM25 scores) to generate an overall weight.Once again, we let n be some web page and Sn be the set to pages thatlink into n:

assumes normalisation( ) ( )nn SnqSimSc !+!= "# ),('

parameters

Page 56: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 56 -

More simple linkage techniques

• Weighted Citation Raking

• Spreading Activation & Co-citation Analysis– SA: Spreads a score across outlinks– CA: Passes a score back to hub document

Page 57: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 57 -

Hubs & Authorities

• A Hub is a documentthat contains links tomany otherdocuments

• An Authority is adocument that manydocuments link to

• A good Hub links togood Authorities

• A good Authoritylinks to good Hubs

W

X

Y

Z

A

F

C

ED

Page 58: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 58 -

What makes a good Hub…?

Hub

What makes a good hub for the query “web browsers”?

InternetExplorer

Netscape

Opera

Mozilla

Amaya

Firefox

MyBrowserNeoPlanet

Page 59: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 59 -

What Makes a good Authority

Hub

What makes a good Authority for the query “web browsers”?

Internet Explorer

Netscape

Opera

Mozilla

Amaya

Firefox

MyBrowserNeoPlanet

HubHub

Hub

Hub

HubHub

Hub

Hub Hub

Page 60: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 60 -

And What makes these authoritiesgood?

Hub

Good hubs that themselves link into good authorities… a self-re-inforcing relationship!

Internet Explorer

Netscape

Opera

Mozilla

Amaya

Firefox

MyBrowserNeoPlanet

HubHub

Hub

Hub

HubHub

Hub

HubHub

Page 61: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 61 -

The Influence of Links

• A Document’s content can be represented bythe anchor text of the in-links (all) into thatdoc, not by the document itself.

• More in-links, means more content, betterchance of getting returned for a query.

• Very Simple, but effective!• Improved by windowing…

Document Anchor Text Doc

Page 62: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 62 -

The Importance of Windows

Page 63: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 63 -

The Importance of Windows

Page 64: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 64 -

Iterative Linkage Algorithms

PageRank

Page 65: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 65 -

PageRank

• Query INDEPENDENT score for every documents

• An important aspect of Google ranking…? It allocates aPageRank (query independent importance) score toevery document in an index, and this score is usedwhen ranking documents.

• Simple Iterative Algorithm– Until convergence

• A simulation of a random user’s behaviour whenbrowsing the web.– Equivalent to a user randomly following links, or getting

bored and randomly jumping to a random page anywhereon the WWW. In effect it is based on the probability of auser landing on any given page.

• This can be applied to other graphs than the WWWgraph… social networks, blog comments?

Page 66: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 66 -

Key points…

PRA = 1 A

1/4

1/4

1/4

1/4

The PR of A is dividedequally among its out-links

B

The PR of B is equal tothe sum of thetransferable PR of all itsin-links

Z

PRB = 2¼

W

X

Y

PRW=1

PRX=1

PRY=1

PRZ=1

¼½½

+ 12¼

Page 67: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 67 -

For Example…

the PageRank PRF of document F is equal to PRB divided the out-degree of B summed with PRD divided by the out-degree of D.

32DB

FPRPRPR +=

Page 68: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 68 -

The Simplified Technique

NPRNinnallfor n

1, =

!"

#=nSm m

mn egreeoutd

PRc'PR

1, Calculate a pre-iteration PageRank score for each document

2, Calculate PageRank score for each document

3, Store new PageRank scores

4, If not convergence then goto 2

nn 'PRPR,Ninnallfor =

…assume c = 1

Page 69: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 69 -

A Simple Web Graph

B

FG

C

D

E

A

Page 70: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 70 -

PageRank – Sample Graph

1

11

1

1

1

1

Total = 7.0

Page 71: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 71 -

PageRank – after Iteration 1

1

1.51

1

.5

.5

.5

Total = 6.0

Page 72: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 72 -

PageRank – after Iteration 2

.75

1.51.5

.5

.25

.5

. 5

Total = 5.5

Page 73: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 73 -

PageRank – Problem 1 (Dangling Links)

?

?

Page 74: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 74 -

PageRank - Problem 2 (Rank-Sink)

Page 75: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 75 -

PageRank – Problem 1 (Dangling Links)

?

?

Page 76: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 76 -

PageRank – Problem 1 (Dangling Links)

removed

Page 77: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 77 -

PageRank - Problem 2 (Rank-Sink)

Page 78: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 78 -

PageRank - Problem 2 (Rank-Sink)

15%

15%

15%15%

15%

15%

15%

0.14

0.14

0.14

0.14

0.14

0.14

0.14

Doc 1Doc 2

Doc 3

Doc 4

Doc 5

Doc 6

Doc 7

A Vector overAll Web Pages

Hence if all PageRankssum to 1.0, then

||E|| = 0.15

Page 79: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 79 -

The two problems…

• Dangling Links: these are links that point to a pagewhich itself contains no outLinks…

• Docs which the system knows about (and has anchor textdescriptions for) but has not downloaded yet.

• Or just docs with no links out…– If the PageRank of the web pages associated with the

target of these links is not redistributed at each iterationand is lost from the system…

– SOLUTION : Remove page• or use Universal Document…

• Rank Sinks: these are two or more pages that haveoutLinks to each other, but to no other pages. Assumingwe have at least one inLink into these pages from apage outside of these pages then at each iteration rankenters these pages and never exits… accumulatesrank…– SOLUTION: using the E Vector with |E| = 0.15 or…– … the inclusion of a Virtual (Universal Document)

Page 80: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 80 -

How to use this Vector?

• This vector has an entry for each document and is usedas an indicator of how to distribute any redundant rankback into the system.– Each documents entry in the Vector (E) represents the

proportion of rank to given to that document, but it isbelieved to be uniform with ||E|| = 0.15 if the sum of allpageranks sums to 1.

– But we can do personalisation…e.g. to focus on Formula1pages increase their weight in E.

• Letting En be some vector over the Web pages thatcorresponds to a source of rank, c is a constant which ismaximised and ||PR|| = 1 (sum of all PageRanks = 1),we have the following formula:

nSnm m

mn Ec

egreeoutdPR

cPR !"+!= #$

)1('

Page 81: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 81 -

Alternate Solution!

UD

0.14

0.14

0.14

0.14

0.14

0.14

0.14

Vector

*

Doc 1Doc 2

Doc 3

Doc 4

Doc 5

Doc 6

Doc 7

Probability of a user being boredis now 1/(n+1) where n = numberof outlinks… not 0.15

Page 82: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 82 -

Personalised PageRank

UD

0.10

0.05

0.05

0.35

0.25

0.10

0.10

Vector

*

Doc 1Doc 2

Doc 3

Doc 4

Doc 5

Doc 6

Doc 7

Page 83: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 83 -

Using PageRank…

ContentScore (n)

PageRankScore (n)

??? Formula ???

PageRank ArrayQuery

Final Document Score

Page 84: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 84 -

Kleinberg’s Algorithm

Kleinberg’s algorithm is similar to PageRank, in that it is aniterative algorithm based purely on the linkage of the documentson the web. However it does have some major differences:

• It is executed at query time, and not at indexing time, withthe associated hit on performance that accompanies query-timeprocessing.

• Is it used in SE’s… not common!

• It computes two scores per document (hub and authority)as opposed to a single score.

• It is processed on a small subset of ‘relevant’ documents,not all documents as was the case with PageRank.

Page 85: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 85 -

Recall Hubs and Authorities

HUB Page: a hub page is a page that contains a number of linksto pages containing information about some topic, e.g. a resourcepage containing links to documents on a topic such as ‘Formula 1motor racing’. Required pages have a hub score representingit’s quality as a source of links.

AUTHORITY Page: an authority page is one that contains a lot ofinformation about some topic, an ‘authoritive’ page. Consequently,many pages will link to this page, thus giving us a means ofidentifying it. Required pages also have an authority scorerepresenting its perceived quality by other people.

Documents with high authority scores are expected to containrelevant content, whereas documents with high hub scores areexpected to contain links to relevant documents.

Page 86: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 86 -

HITS Process

ExpandedSet

RootSet

1

2

4

3

Focused subgraphof WWW

( )! "= pqallforHubAuth qp

( )! "= qpallforAuthHub qp

Page 87: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 87 -

Hub Scores

( )! "= qpallforAuthHub qp

P

X

Y

ZQ containsX,Y and Z

Page 88: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 88 -

Authority Scores

( )! "= pqallforHubAuth qp

P

X

Y

ZQ containsX,Y and Z

Page 89: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 89 -

Kleinberg’s HITS Technique

• Iteratively calculates Hub & Authority scores• Begin with all Hubs & Authority scores = 1• 10+ iterations needed until convergence

– Hub scores based on Authority scores of off-siteoutLink docs.

– Auth scores based on Hub scores of off-site inLinkdocs.

• Return top X Hubs and/or Authorities• Once expanded set generated then no further

content analysis (Topic Independent).• Narrow Topic will diffuse to a Broader Topic

– Broad Topic may produce inaccurate results

Page 90: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 90 -

Kleinberg’s Algorithm

)(

,','

'

':...,2,1

:11

convergednotwhileend

HubobtainingHubNormaliseAuthobtainingAuthNormalise

AuthHub

HubAuthNnfor

loopAuthHub

nn

nn

Tnon

Smmn

i

i

n

n

!

!

"

"

=

=

=

#

#

Page 91: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 91 -

Wrapping up SEs

• SE’s now provide more than just searching and are“portals” - consumer-oriented gateway to webresources which is editorially controlled links to whatsearch engines, or their paying clients, believe you maybe interested in.

• Search engines are “for profit” ventures, not charities…– Some sell their indexes– Mostly advertising

• 10% to 15% of queries to the major search engines areon adult themes.

• Offer lots of extra’s including : media search,identification of names, Amazon links, related searcheslisting, page translation, language specific search…– then there is photo management, email, music…

Page 92: Web Search Algorithms - DCU School of Computingasmeaton/CA652/WebSearchAlgo.pdf•Web 3.0 –UGC and... Semantic Web... Life streams? –Social and Location –What is the next big

- 92 -

Final thoughts

• Sub 1 second querying is essential– No time for interesting algorithms, Q&A, Manual

Query Expansion, …

• Belief is that searchers happy with sub-optimalresults as long as no delay in getting them.

• No industry standard benchmark forevaluation.