web search algorithms - dcu school of computingasmeaton/ca652/websearchalgo.pdf•web 3.0 –ugc...
TRANSCRIPT
- 1 -
Web Search Algorithms
- 2 -
Why web search in this module ?
• WWW is the delivery platform and the interface• How do we find information and services on the
web … we try to generate a url that seemssensible– Dell Computers – www.dell.ie, Ford Ireland –
www.ford.ie But products?• GPS Devices – www.gps.ie is not ok
• Or, we use a Search Engine– So we rely on Search Engines - we even use them to
look up spellings and as a calculator !• Search Engines bring people to a website
– For most, such as Google, ranking algorithm is closelyguarded, wholesome, true, uncorrupted, and not paid
• advertisements are merely sold based on similarity toquery keywords.
• This leads to the industry of Search Engine Optimisations(SEO) ... the “Google Dance”
- 3 -
Text IR - Google as example
• Google is operational since 1998– Two PhD students from Stanford
• ?? Billion documents– Early search engines competed on size of index, related to
how powerful their infrastructure was. Not an issue now.– Stopped advertising after 8,168,684,336 pages in Aug
2005– Size now, effectively unknown
• Also has ??? billion images– not all unique images– Flickr has about 2B (Nov 2007); FaceBook had 4.1 B at
that time
- 4 -
Searching or Marketing?
• However, Search Engines must make a profit!– Advertisment Sales– Marketing– Paid Listings– And selling their indexes
• A lot of Search Engines are also marketingcompanies…– This is at odds with the idea that a search engine is a
page you visit on the way elsewhere.• The less time you spend there the better!• But, many people ‘pass through the doors’, so they sell
query focussed advertisements– You can estimate by looking at the main page of the
search engine.
- 5 -
How do SEs help user searches
• It is known that we search for– people / home pages.– companies / company HPs (or guess from URLs).– a particular product or service.– a fact, buried in one or more documents, any one of which
will do…– a document, an entire document, with text/image, and
nothing smaller will do.– an overview on a broad or narrow topic– Media Search
• an MPEG-4 file.• Through image databases.• Through (digital) video library, and/or through a video.
• If the SE knows the type of query, then ranking can betailored to that query, because different search typescan be satisfied by different search algorithms.
- 6 -
Search Engines
• Originally SE’s were web directorys– Manually generated (e.g. Yahoo!)
• Then automatic crawler-based Search Enginesdeveloped– The web got big and manual categorisation was
becoming too difficult (e.g. Lycos)– Today the large SE’s index over ?? billion web pages.– The first crawler-based SE was the WWWW in 1994
- 7 -
Architecture of a Search Engine
- 8 -
My Google!
- 9 -
Bing
- 10 -
Facebook ? Is it a Search Engine?
- 11 -
Facebook Social Graph
CollegeFriends
Friends
A PreviousClass
IR ResearchCommunity
- 12 -
- 13 -
The Landscape is changing
- 14 -
Web 1.0 Web 3.0
• Web 1.0– Static content... Companies created content– We were consumers
• Web 2.0– User generated content– Communities and creators... We create,
filter, recommend the content
• Web 3.0– UGC and... Semantic Web... Life streams?– Social and Location– What is the next big thing?
- 15 -
Web 1.0
• Search engines over prepared andplanned content
• Organisations and some users
• SEO was the way to optimise WEB 1.0
• HTML and static content
- 16 -
- 17 -
Web 2.0
• User and Organisation GeneratedContent
• Social Graphs• Social Filtering and Social Ranking• Examples:
– Social networks : facebook, twitter, linkedin– Shared bookmarks: digg, delicious, reddit,
stumbleupon– Social media sharing :flickr, youtube– Blogs (MSN space, wordpress, blogger)– Even 3D social worlds... Social gaming?
- 18 -
- 19 -
Web 3.0
• Semantic Web– Many media types... Integrated for smarter
uses
• Rich media integration
• Personalisation to the user context
• Life streaming of content– We are integrated into our own
entertainment
- 20 -
What is Web3.0 about?
- 21 -
The Search Landscape
Changing enormously
- 22 -
Continuous Partial Attention
• Be aware of Continuous Partial Attention... akind of multitasking
• skimming the surface of the incoming data,picking out the relevant details, and movingon to the next stream.
• Continuous not episodic• Cast a wider net, but never full attention
• So.. How does this impact on search?
http://www.wisegeek.com/what-is-continuous-partial-attention.htm
- 23 -
And don’t forget the twitter curve…
http://headrush.typepad.com/creating_passionate_users/2006/12/httpwww37signal.html
- 24 -
Google AdSense
- 25 -
Spamming
• Spamming is a technique based on themanipulation of content in order to affectranking from search engines– Bogus meta tags, hidden text, plan text…– Also link spamming…
• Huge SE resources are used in defeatingspamming - more than in search qualityimprovement !
• Getting in the top-10 is essential forbusinesses– 85% of users only look at top 10.– Lead to the business of Search Engine Optimisation
- 26 -
Search Engine Ranking
As we all know, simply examiningweb page content as text is notenough.. We need to examineranking factors.. Positive and
negative.
- 27 -
Positive Ranking Factors : Term Location
• In the TITLE of the page, most important• In the body of the text, but must MAKE SENSE• In the Heading text (H1,H2…)• In the Domain Name
– Also in page URL
• In ALT tag and image title• In BOLD/STRONG tags• Terms near the top likely ranked higher than
other terms
- 28 -
Positive Ranking Factors : Page Attributes
• Importance of the page in the Website– Number of links to it from the same website
• Quality of links to other pages• Age of a document
– Older may be more authorative• We will see authorities later!
– Newer may be better for some queries (e.g. news)
• Amount of text on the page• Structure of the page• Frequency of updates• Spelling and correctness of HTML
- 29 -
SE Ranking + : Website Issues
• Linkage of the Website– Global link popularity of the
website• Like a global Pagerank (SiteRank)
– Relevance of the links into thewebsite
– Link popularity of the site in atopical community
– Rate of new inbound links to a website
• Age of a website (older is better)• Freshness of a website (new pages is better)• Relevancy of the website (as well as the page)• Clickthrough rate for the website• Reputation of the top-level domain
– E.g. .GOV & .EDU … can not easily be bought
- 30 -
SE Ranking + : Linkage Issues
• Anchor text of inbound links as a description ofthe WWW page– Also text surrounding the link into the webpage
• Topical relationship between source and targetof link
• Link popularity of the page in a topicalcommunity
• Age of links– The older the better, i.e. long lasting links
• Pagerank of the webpage– Googles PageRank algorithm
• Number of links into a web page
- 31 -
Positive Ranking Factors : Images
• Images on a web page– Can provide a chance to express ideas in a
visual way that can convey a considerableamount of information
– Add to the attractiveness and perceivedquality of a site.
– Recent Microsoft Patent on “ScoringRelevance of a Document Based on ImageText”
– Also.. Remember to name the imageproperly and have alt element
- 32 -
Negative Ranking Factors
• Link Farm Participation– Try to artificially increase PageRank
• Proportion of links to or from knownSpamming sites
• Duplicate Content to already indexedcontent
• Server Errors or server down-time• External links to low-quality content• Low level of visitors to the website• Try to include hidden text on the page
- 33 -
Using the Ranking Factors…
Term Location Factors
Page Factors
Website Factors
Linkage FactorsPageRank Factors
Negative Factors
Result
User Query
The Search Engineranking process is a
closely guarded trade secret of the
search engines.
- 34 -
So lets look in some detail at someof these ranking factors…
Linkage-based Search
- 35 -
The Shape of the WWW
This is based on a study of 200 million web pages. Scale up to WWW scale.
- 36 -
Spidering : finding WWW content
• A Search Engine needs to find WWWcontent for its index– This is done by the spidering software
• Starting from some ‘seed’ WWW pages,the spider software downloads thesepages and extracts the links, therebylearning about new pages to crawl.
• WWW-scale crawling means crawlingthousands of pages per second
- 37 -
A Basic Crawling Algorithm
• You need to be linked to from the mainWWW.. Remember the shape!
• Given a set of ‘seed’ URLs (WWW pagesaddresses):– Add them to a (priority) queue of URLs– While the queue is not empty (!empty)
• Take the first URL (u) off the queue• Download the WWW page for u• Store the URL in a list of seen URLs• Index it• If u is a HTML page, extract the links (y)
– For each y add it to the queue if it has not beenvisited before
- 38 -
Spiders must behave!
• Most crawlers/spiders will follow some rules:– A spider must never request large numbers of
documents from the same host sequentially… changethe target website as often as is feasible.
– A spider must never (for whatever reason)repeatedly request the same document. If adocument is unavailable, … it’s position in the queuemust be penalized … Repeated failures must betaken into account and the document flagged asunavailable and taken off the queue.
– A spider must respect author’s wishes as expressedusing the robots exclusion protocol
- 39 -
Robots Exclusion
To exclude all robots from the entire serverUser-agent: *Disallow: /
To allow all robots complete accessUser-agent: *Disallow:
To exclude all robots from part of the serverUser-agent: *Disallow: /cgi-bin/Disallow: /private/
To exclude a single robotUser-agent: BadBotDisallow: /
To allow a single robotUser-agent: WebCrawlerDisallow:User-agent: *Disallow: /
allows Web site administrators to indicate to visiting robots which parts of their site shouldnot be visited by the robot. Most good robots will process it… BUT it makes a crawler less
efficient… more explorative crawling required…
- 40 -
Robots.txt example
- 41 -
Another Example
- 42 -
And one more…
- 43 -
Simple Overview
1. Spidering
2. Indexing
3. Ranking
WWW
View WWWpage
- 44 -
WWWW – the first SE
• WWWW (94) did not use the content of a page for indexing,it used:– Title of the document– Text in the URL String– Any anchor text from links pointing to the page.
• Based on using the UNIX egrep program tosearch through disk files.
All SEs now use Linkage Analysis to exploit latent humanjudgement to improve retrieval performance
This is in addition to using the document content.
- 45 -
Some history… Citation Analysis
Most significant contribution to web search is thetechnique for how to rank Journals based on quality(impact)
Citation indexing… the ‘impact factor’ measurement…based on two elements:– the number of citations in the current year to any articles
published in the journal over the previous two years.– the number of articles published by the journal during
these two years.• Letting j be a journal and IFj be the Impact Factor of
journal j, we have:
• This “impact factor” was originally applied to medicaljournals as a simple method of comparing journals toeach other regardless of their size.
)2(#)2(#yearslastArticlesPublished
yearslastCitationsIFj =
- 46 -
Hirsch Index (h-index)
• Citation Analysis is a balance between quality (numberof citations) and quantity (number of papers);
• Among scientists, the h-index is becoming popular formeasurement … it’s the number of published paperswhich each have a number of citations greater or equalto that number.– Alan Smeaton has 250+ papers, about 3,000 citations, and
an h-index of 30;– Desmond Higgins (UCD) has 29,000 citations (22,500 on
one paper), and an h-index of 22;• Linkage analysis in web topology does something like
this, as we’ll see.
- 47 -
Linkage Analysis
Linkage Analysis : a method of ranking web sites which is based on theexploitation of latent human judgments mined from the hyperlinks thatexist between documents on the WWW.
The first generation of web search engines were effectivelyTF-IDF or BM25, or equivalent.
And they have addressed the engineering problems of webspidering and efficient searching for large numbers of bothusers and documents.
Linkage Analysis important since late 90s.
Anecdotally this appears to have improved the precision ofretrieval yet there was little scientific evidence in supportof this until recently.
- 48 -
Origin : Citation Analysis
How to rank Journals based on quality (impact)
Citation indexing… the ‘impact factor’ measurement…based on two elements:– the number of citations in the current year to any articles
published in the journal over the previous two years.– the number of articles published by the journal during
these two years.
• Letting j be a journal and IFj be the Impact Factor ofjournal j, we have:
• This “impact factor” was originally applied to medicaljournals as a simple method of comparing journals toeach other regardless of their size.
)2(#)2(#yearslastArticlesPublished
yearslastCitationsIFj =
- 49 -
Mining links can tell us that…
• Bibliographic Coupling– A and B are similar because they both cite
C,D,E
• Co-citation Analysis– A and B are similar because they are both
cited by C,D,E
- 50 -
What else can we do with links?
• Count them?• Distinguish between good and bad ones?
• How we employ them is called LinkageAnalysis– Linkage-based ranking schemes can be seen to
belong to one of two distinct classes:• Query-independent schemes,
– A score is assigned to a document once and used for allsubsequent queries.
» independent of a given query.– Fast processing at query time!
• Query-dependent schemes,– assigns a linkage score to a page in the context of a given
query.– Slower processing at query time!
- 51 -
Assumed Properties of Links
When extracting information for linkage analysisfrom hyperlinks on the Web, two coreproperties can be assumed:
– A link between two documents on the web carriesthe implication of related content.
– If different people authored the documents (differentdomains, therefore off-site links), then the firstauthor found the second document valuable.
• An author can-not be allowed to influence the linkagescore of documents within his/her domain.
– Off-site links (links between web sites) are moreimportant that links within websites or within documents.
- 52 -
Link Types
in-link to doc F : 5,8,9out-link from doc F : 4,6,10self-links: 2,11on-site links: 6,8,12off-site links: 1,3,4,5,9,10on-site in-links to doc F: ?off-site out-links of doc F: ?
- 53 -
Basic Linkage Analysis
A
D
C
G
E
F
B
J
I
H
Given a linkage graph (below), Page A is a betterpage than B because…
Off-site links only…
- 54 -
Expanding on this…
A
D
C
G
E
F
B
J
I
H
However, page B may actually be better…
CNN
So we use iterative processes… like PageRank or Kleinberg’s
Yahoo
- 55 -
Generating a linkage score
nn SP =
Let n be some web page and Sn be the set of web pages that linkinto n across off-site links:
In this case, the Pn score (Popularity score) is based purely on thein-degree of document n…
Could be the sole source of document ranking given a set of relevantdocuments (boolean IR) OR could work by integrating normal documentretrieval (TF-IDF / BM25 scores) to generate an overall weight.Once again, we let n be some web page and Sn be the set to pages thatlink into n:
assumes normalisation( ) ( )nn SnqSimSc !+!= "# ),('
parameters
- 56 -
More simple linkage techniques
• Weighted Citation Raking
• Spreading Activation & Co-citation Analysis– SA: Spreads a score across outlinks– CA: Passes a score back to hub document
- 57 -
Hubs & Authorities
• A Hub is a documentthat contains links tomany otherdocuments
• An Authority is adocument that manydocuments link to
• A good Hub links togood Authorities
• A good Authoritylinks to good Hubs
W
X
Y
Z
A
F
C
ED
- 58 -
What makes a good Hub…?
Hub
What makes a good hub for the query “web browsers”?
InternetExplorer
Netscape
Opera
Mozilla
Amaya
Firefox
MyBrowserNeoPlanet
- 59 -
What Makes a good Authority
Hub
What makes a good Authority for the query “web browsers”?
Internet Explorer
Netscape
Opera
Mozilla
Amaya
Firefox
MyBrowserNeoPlanet
HubHub
Hub
Hub
HubHub
Hub
Hub Hub
- 60 -
And What makes these authoritiesgood?
Hub
Good hubs that themselves link into good authorities… a self-re-inforcing relationship!
Internet Explorer
Netscape
Opera
Mozilla
Amaya
Firefox
MyBrowserNeoPlanet
HubHub
Hub
Hub
HubHub
Hub
HubHub
- 61 -
The Influence of Links
• A Document’s content can be represented bythe anchor text of the in-links (all) into thatdoc, not by the document itself.
• More in-links, means more content, betterchance of getting returned for a query.
• Very Simple, but effective!• Improved by windowing…
Document Anchor Text Doc
- 62 -
The Importance of Windows
- 63 -
The Importance of Windows
- 64 -
Iterative Linkage Algorithms
PageRank
- 65 -
PageRank
• Query INDEPENDENT score for every documents
• An important aspect of Google ranking…? It allocates aPageRank (query independent importance) score toevery document in an index, and this score is usedwhen ranking documents.
• Simple Iterative Algorithm– Until convergence
• A simulation of a random user’s behaviour whenbrowsing the web.– Equivalent to a user randomly following links, or getting
bored and randomly jumping to a random page anywhereon the WWW. In effect it is based on the probability of auser landing on any given page.
• This can be applied to other graphs than the WWWgraph… social networks, blog comments?
- 66 -
Key points…
PRA = 1 A
1/4
1/4
1/4
1/4
The PR of A is dividedequally among its out-links
B
The PR of B is equal tothe sum of thetransferable PR of all itsin-links
Z
PRB = 2¼
W
X
Y
PRW=1
PRX=1
PRY=1
PRZ=1
¼½½
+ 12¼
- 67 -
For Example…
the PageRank PRF of document F is equal to PRB divided the out-degree of B summed with PRD divided by the out-degree of D.
32DB
FPRPRPR +=
- 68 -
The Simplified Technique
NPRNinnallfor n
1, =
!"
#=nSm m
mn egreeoutd
PRc'PR
1, Calculate a pre-iteration PageRank score for each document
2, Calculate PageRank score for each document
3, Store new PageRank scores
4, If not convergence then goto 2
nn 'PRPR,Ninnallfor =
…assume c = 1
- 69 -
A Simple Web Graph
B
FG
C
D
E
A
- 70 -
PageRank – Sample Graph
1
11
1
1
1
1
Total = 7.0
- 71 -
PageRank – after Iteration 1
1
1.51
1
.5
.5
.5
Total = 6.0
- 72 -
PageRank – after Iteration 2
.75
1.51.5
.5
.25
.5
. 5
Total = 5.5
- 73 -
PageRank – Problem 1 (Dangling Links)
?
?
- 74 -
PageRank - Problem 2 (Rank-Sink)
- 75 -
PageRank – Problem 1 (Dangling Links)
?
?
- 76 -
PageRank – Problem 1 (Dangling Links)
removed
- 77 -
PageRank - Problem 2 (Rank-Sink)
- 78 -
PageRank - Problem 2 (Rank-Sink)
15%
15%
15%15%
15%
15%
15%
0.14
0.14
0.14
0.14
0.14
0.14
0.14
Doc 1Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
A Vector overAll Web Pages
Hence if all PageRankssum to 1.0, then
||E|| = 0.15
- 79 -
The two problems…
• Dangling Links: these are links that point to a pagewhich itself contains no outLinks…
• Docs which the system knows about (and has anchor textdescriptions for) but has not downloaded yet.
• Or just docs with no links out…– If the PageRank of the web pages associated with the
target of these links is not redistributed at each iterationand is lost from the system…
– SOLUTION : Remove page• or use Universal Document…
• Rank Sinks: these are two or more pages that haveoutLinks to each other, but to no other pages. Assumingwe have at least one inLink into these pages from apage outside of these pages then at each iteration rankenters these pages and never exits… accumulatesrank…– SOLUTION: using the E Vector with |E| = 0.15 or…– … the inclusion of a Virtual (Universal Document)
- 80 -
How to use this Vector?
• This vector has an entry for each document and is usedas an indicator of how to distribute any redundant rankback into the system.– Each documents entry in the Vector (E) represents the
proportion of rank to given to that document, but it isbelieved to be uniform with ||E|| = 0.15 if the sum of allpageranks sums to 1.
– But we can do personalisation…e.g. to focus on Formula1pages increase their weight in E.
• Letting En be some vector over the Web pages thatcorresponds to a source of rank, c is a constant which ismaximised and ||PR|| = 1 (sum of all PageRanks = 1),we have the following formula:
nSnm m
mn Ec
egreeoutdPR
cPR !"+!= #$
)1('
- 81 -
Alternate Solution!
UD
0.14
0.14
0.14
0.14
0.14
0.14
0.14
Vector
*
Doc 1Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Probability of a user being boredis now 1/(n+1) where n = numberof outlinks… not 0.15
- 82 -
Personalised PageRank
UD
0.10
0.05
0.05
0.35
0.25
0.10
0.10
Vector
*
Doc 1Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
- 83 -
Using PageRank…
ContentScore (n)
PageRankScore (n)
??? Formula ???
PageRank ArrayQuery
Final Document Score
- 84 -
Kleinberg’s Algorithm
Kleinberg’s algorithm is similar to PageRank, in that it is aniterative algorithm based purely on the linkage of the documentson the web. However it does have some major differences:
• It is executed at query time, and not at indexing time, withthe associated hit on performance that accompanies query-timeprocessing.
• Is it used in SE’s… not common!
• It computes two scores per document (hub and authority)as opposed to a single score.
• It is processed on a small subset of ‘relevant’ documents,not all documents as was the case with PageRank.
- 85 -
Recall Hubs and Authorities
HUB Page: a hub page is a page that contains a number of linksto pages containing information about some topic, e.g. a resourcepage containing links to documents on a topic such as ‘Formula 1motor racing’. Required pages have a hub score representingit’s quality as a source of links.
AUTHORITY Page: an authority page is one that contains a lot ofinformation about some topic, an ‘authoritive’ page. Consequently,many pages will link to this page, thus giving us a means ofidentifying it. Required pages also have an authority scorerepresenting its perceived quality by other people.
Documents with high authority scores are expected to containrelevant content, whereas documents with high hub scores areexpected to contain links to relevant documents.
- 86 -
HITS Process
ExpandedSet
RootSet
1
2
4
3
Focused subgraphof WWW
( )! "= pqallforHubAuth qp
( )! "= qpallforAuthHub qp
- 87 -
Hub Scores
( )! "= qpallforAuthHub qp
P
X
Y
ZQ containsX,Y and Z
- 88 -
Authority Scores
( )! "= pqallforHubAuth qp
P
X
Y
ZQ containsX,Y and Z
- 89 -
Kleinberg’s HITS Technique
• Iteratively calculates Hub & Authority scores• Begin with all Hubs & Authority scores = 1• 10+ iterations needed until convergence
– Hub scores based on Authority scores of off-siteoutLink docs.
– Auth scores based on Hub scores of off-site inLinkdocs.
• Return top X Hubs and/or Authorities• Once expanded set generated then no further
content analysis (Topic Independent).• Narrow Topic will diffuse to a Broader Topic
– Broad Topic may produce inaccurate results
- 90 -
Kleinberg’s Algorithm
)(
,','
'
':...,2,1
:11
convergednotwhileend
HubobtainingHubNormaliseAuthobtainingAuthNormalise
AuthHub
HubAuthNnfor
loopAuthHub
nn
nn
Tnon
Smmn
i
i
n
n
!
!
"
"
=
=
=
#
#
- 91 -
Wrapping up SEs
• SE’s now provide more than just searching and are“portals” - consumer-oriented gateway to webresources which is editorially controlled links to whatsearch engines, or their paying clients, believe you maybe interested in.
• Search engines are “for profit” ventures, not charities…– Some sell their indexes– Mostly advertising
• 10% to 15% of queries to the major search engines areon adult themes.
• Offer lots of extra’s including : media search,identification of names, Amazon links, related searcheslisting, page translation, language specific search…– then there is photo management, email, music…
- 92 -
Final thoughts
• Sub 1 second querying is essential– No time for interesting algorithms, Q&A, Manual
Query Expansion, …
• Belief is that searchers happy with sub-optimalresults as long as no delay in getting them.
• No industry standard benchmark forevaluation.