pagerank - information retrievalir.cis.udel.edu/~carteret/cisc689/slides/lecture20.pdf ·...

16
5/4/09 1 PageRank CISC489/689‐010, Lecture #20 Wednesday, April 29 th Ben CartereGe Web Search Problem: Web search engines easily hacked I want to sell something; I’ll just add a few popular keywords to my page over and over and over again All the retrieval models we’ve discussed will score that page higher for those keywords Other problems: No hackers, but top‐ranked pages are coming from deep within a site, or from pages that change oTen, or pages about very obscure topics Not really useful

Upload: others

Post on 01-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

1

PageRank

CISC489/689‐010,Lecture#20Wednesday,April29th

BenCartereGe

WebSearch

•  Problem:– Websearchengineseasilyhacked–  Iwanttosellsomething;I’lljustaddafewpopularkeywordstomypageoverandoverandoveragain

– Alltheretrievalmodelswe’vediscussedwillscorethatpagehigherforthosekeywords

•  Otherproblems:– Nohackers,buttop‐rankedpagesarecomingfromdeepwithinasite,orfrompagesthatchangeoTen,orpagesaboutveryobscuretopics

– Notreallyuseful

Page 2: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

2

PossibleSoluUon

•  Leveragelinkstructure•  Maybeifmanypagesarelinkingtoapage,thatpageismore“important”

•  Idea:– Countthenumberofinlinkstothepage– Assignit“importance”basedonthatnumber

•  Anyproblemwiththis?

HackingLinkCounts

•  Icanjustmakeabunchofpagesthatlinktomyspampage

•  Inlinkcountwillbehigheventhoughmypageisnotimportant

•  BeGeridea:– RecursivelyusetheimportanceofthelinkingpageswhencalculaUngtheimportanceofthepage

Page 3: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

3

PageRank

•  Google’sPageRankisprobablythebestknownalgorithm

•  IntuiUveidea:“randomsurfer”model–  Ifyoustartonarandompageontheinternetandjuststartclickinglinksrandomly,

– Whatistheprobabilityyouwilllandonpageu?–  Ifonepagehasahigherlandingprobability,thepagesitlinkstohavehigherlandingprobabiliUesaswell

– Higherprobability=moreauthority=beGerPageRank

IllustraUon

Page 4: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

4

PageRankDefiniUon

R(u) = c!

v!Bu

R(v)Nv

R(u) is the PageRank of u

Bu is a set of pages that link to u

v is one of the pages in Bu

R(v) is the PageRank of v

Nv is the number of pages that v links to

Sinks

•  Problem:–  Ihavetwopagesthatonlylinktoeachother,plusonepagethatlinkstooneofthem

– Whenthe“randomsurfer”getstooneofthosepages,hewilljustkeepalternaUngbetweenthem

– TheirPageRankwilldominateeverythingelse

•  SoluUon:onceinawhiletherandomsurferjuststartsoveratanewpage– OK,buthowdoIputthatinPageRank?

Page 5: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

5

RandomRestarts

R!(u) = c!

v"Bu

R!(v)Nv

+ cE(u)

E(u) is the probability that a random surfer jumps to u

CalculaUngPageRank

•  AsimpleiteraUvealgorithm:–  First,assignaPageRanktoeverypage

•  E.g.R0(u)=1/N–  IniUalizeE(u)=αforallu(0.15/Ninoriginalpaper)– OveriteraUonsi=1…,do

•  UpdateeachPageRankas:•  CalculatedasthesumofPageRanksfromthepreviousiteraUonminusthesumofPageRanksfromthecurrentiteraUon

•  UpdateeachPageRankas•  Calculateδasthesumof|PageRanksfromthecurrentiteraUonminusPageRanksfromthepreviousiteraUon|

•  Ifδ>ε,PageRankshaveconverged

Ri+1(u) =!

v!Bu

Ri(v)Nv

Ri+1(u) = Ri+1 + d!

Page 6: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

6

Scalability

•  CalculaUngPageRankrequiresavectorforeveryURL

•  AlongwithalistoftheURLsthatlinktothatURLandalistofURLsthatpagelinksto

•  SpaceusageisO(N2)

•  TimecomplexityalsoO(N2)

•  EvenforsmallcollecUons(likeWikipedia),itisnearlyimpossibletokeepallofthisinmemory

CalculaUngPageRankwithMapReduce

•  First:extractlinks– Foreverypage,Ineedtoknowallthepagesthatlinktoit

– MapReducesoluUon:•  Mapoperatortakespageuandoutputs(v,u)foreveryURLvinpageu

•  ReduceoperatortakesalltupleswithkeyvandreducesthemtoalistofuniqueURLsButhatlinktov–  (v,(u1,u2,u3,…))

Page 7: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

7

CalculaUngPageRankwithMapReduce

•  Next:calculatePageRankiteraUvely– First,iniUalizeallPageRanksto1/N– TheniteraUvelyMapReduce •  MapoperatortakesapageuiandtheURLsvj(j=1..n)thatitlinksto,andoutputs(vj,R(ui)/n)•  Reduceoperatortakespairs(vj,R(ui)/n)andoutputs

•  Calculatedeltaanddeterminewhetherconverged•  Ifnot,MapReduceagain

!vj , (1! d) + d

m"

i=1

R(ui)n

#

PageRankforIR

•  PageRankisquery‐independent– Theimportanceofapageisnotrelatedtoanyquery

– WecannotsimplyrankpagesbyPageRank

•  PageRankcanbeusedtore‐rankresultsthathavebeenretrievedforaquery

•  ItcanalsobeusedasafeatureintherankingfuncUon

•  Orasaweightonanchortextfeatures

Page 8: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

8

ExampleUseofPageRank

From“ThePageRankCitaUonAlgorithm:BringingOrdertotheWeb”,Pageetal.

WikipediaPageRanks

PageTitle PageRank

UnitedStates 2.9x10‐3

France 1.3x10‐3

UnitedKingdom 1.2x10‐3

England 1.0x10‐3

Germany 1.0x10‐3

Canada 0.9x10‐3

2007 0.8x10‐3

WorldWarII 0.8x10‐3

Australia 0.7x10‐3

2008 0.7x10‐3

BasedonlinksbetweenWikipediapages

Page 9: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

9

ABitofTheory

•  Markovchain:– NstateswithtransiUonprobabilitymatrixP

– AtanyUmeweareinexactlyonestate– Pijindicatesprobabilityofmovingfromstateitostatej

– Foralli,n!

j=1

Pij = 1

ABitofTheory

•  ErgodicMarkovchains– Ergodicmeansthereisapathbetweenanytwostates

– NomaGerwhatstateyoustartin,theprobabilityofbeinginanyotherstateaTerTstepsisgreaterthanzero(aTeraburn‐inUmeT0)

– OvermanystepsT,eachstatehassome“visitrate”:starUngfromanystate,wewillvisiteachstateaccordingtoitsvisitrate

– Thisisthesteady stateforthechain

Page 10: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

10

ABitofTheory

•  Letx0beavectorrepresenUngourcurrentstate

•  WhatistheprobabilityofeachpossiblestatewecantransiUontofromx0?

•  Andtheprobabilityofeachpossiblestatefromx1(twostepsfromx0)?

1inposiUoni,0severywhereelsex0 =

!0 0 0 · · · 1 · · · 0 0

"

x1 = x0P

x2 = x1P = (x0P )P = x0P2

ABitofTheory

•  ATerksteps,•  Askgoestoinfinity,xkconvergestothesteadystate

•  Whenxkisthesteadystate,•  ThesteadystateisaneigenvectorofP– Asitturnsout,itistheprinciple eigenvector– Eigenvalue=1

•  IfPisamatrixoflinksbetweendocuments,theprincipleeigenvectorholdsthePageRanks

xk = x0Pk

xkP = xk

Page 11: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

11

PageRankModificaUons

•  TheE(u)quanUUessolvethesinkproblem,butcanalsobeusedtoadjustPageRanks

•  Usually,E(u)assigneduniformly–  Equalprobabilitytojumptoanypage

•  Instead,biastocertainpages–  Onepossibility:assignonepageE(u)=α,allotherpages0–  Forexample,Yahoohomepage–  Thenwhenthe“randomsurfer”restarts,shealwaysrestartsatthesameplace

–  Result:YahoogetshighestPageRank,followedbypagesYahoolinksto

Topic‐BasedPageRank

•  Adifferentkindofrandomsurfer:– Firstpicksacategoryrandomly– Thenjumpstoapagerandomlywithinthatcategory

•  InsteadofcalculaUngasinglePageRankforeachpage,calculateMPageRanks,oneforeachcategory– CategoryPageRank=PageRankamongotherpagesinthesamecategory

Page 12: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

12

Topic‐BasedPageRankforPersonalizaUon

•  Eachindividualuserismoreinterestedinsomecategoriesthanothers

•  Calculatetheprobabilitythatauserisinterestedinacategorybasedonthefrequencytheyvisitpagesinthatcategory–  E.g.sports=0.7,finance=0.2,health=0.1,allothers=0

•  ThenthepersonalizedPageRankforapageuistheweightedsumofcategoryPageRanksforthatpage

R(u) =M!

i=1

piRi(u) piistheuser’scategoryiprobabilityRi(u)isthePageRankofuforcategoryi

Hyperlink‐InducedTopicSearch(HITS)

•  Anotherlink‐graphalgorithm•  Idea:–  Somepagesareauthorita6ve:theyareveryinformaUveaboutatopic

–  Otherpagesarehubs foratopic:theylinktoalotofpagesonthetopic

•  Example:–  CiteSeerlinkstoalotofcomputerscienceresearchpapers—it’sacomputerscienceresesarchhub

–  ThepapersitlinkstoarecomputerscienceresearchauthoriUes

•  FindbothhubsandauthoriUes

Page 13: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

13

HubsandAuthoriUes

•  HubsarepagesthatlinktoalotofauthoriUes•  AuthoriUesarepagesthatarelinkedtobyalotofhubs

•  AnotherrecursivedefiniUon

HITSAlgorithm

•  First,getaroot setofpages– Pagesthatmatchaquery,forexample

•  Fromtherootset,constructabase set– Pagesthatlinktotherootsetandpagesthattherootsetlinkto

From“AuthoritaUveSourcesinaHyperlinkedEnvironment”,J.Kleinberg

Page 14: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

14

HITSAlgorithm

•  IniUalize“hubscore”h(u)=1and“authorityscore”a(u)=1foreachpageuinthebaseset

•  TheniteraUvelyupdateh(u)anda(u)forallu:

•  ATereachiteraUon,divideh(u)anda(u)bysomeconstant

•  ATeronlyafewiteraUons,scoresconverge

h(u) =!

v!Fu

a(v)

a(u) =!

v!Bu

h(v)

Fu=setofpagesthatulinkstoBu=setofpagesthatlinktou

HITSExampleResults

FromhGp://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/Images/cars1.png

Page 15: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

15

MatrixForm

•  PisthetransiUonmatrix•  PP’isasortof“similarity”matrixintermsoflinkstootherpages– Entryi,jishigherifpagesiandjlinktothesamepages

•  P’Pisasortof“similarity”matrixintermsoflinksfromotherpages– Entryi,jishigherispagesiandjarelinkedtofromthesamepages

MatrixForm

•  Wecanwritethehubscoreasamatrix‐vectorproduct:– h=Pa(transiUonmatrixUmesauthorityscore)

•  Wecanwriteauthorityscoreas– a=P’h(transposeofthetransiUonmatrixUmeshubscore)

•  SubsUtuUng,weget– h=PP’h– a=P’Pa

Page 16: PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

5/4/09

16

MatrixEigenvectors

•  Ifh=PP’h,thenhisaneigenvectorofthe“outlinksimilaritymatrix”– Hubscoresareabasisvectorofaspacedefinedbyoutlinks

•  Ifa=P’Pa,thenaisaneigenvectorofthe“inlinksimilaritymatrix”– Authorityscoresareabasisvectorofaspacedefinedbyinlinks