relating web characteristics with link-based ranking
DESCRIPTION
TRANSCRIPT
Relating Web Characteristics
Ricardo Baeza-Yates
Carlos CastilloUniversidad de Chile
Relating Web Characteristics
Agenda
• Introduction
• Link-based ranking
• Web structure
• Web characteristics
• Web usage
• Web dynamics
• Conclusions
Relating Web Characteristics
Introduction: Sample
• Web sample: .CL domain on year 2000• 670,000 pages in 7,500 domains• 15kb average page size• Collection from the TodoCL web search
engine
Relating Web Characteristics
Introduction: Emphasis
• Broder et al.: Graph Structure on the Web (2000)– Page-based structure based on strongly
connected components
– The Web graph is not a random graph
– Process: cut & paste model
• Our is mostly a site-based analysis– Trying to make Web structure meaningful
Relating Web Characteristics
Introduction: The Empire
Relating Web Characteristics
Introduction: One Map
Relating Web Characteristics
Link ranking: Pagerank
∑=
−+=k
i
irPagerankqN
qpPagerank
1
)()1()(
Pages that pointto page p
Probability of a random jump over number of pages
Currently used byGoogleBrin & Page, 1998
Relating Web Characteristics
Link ranking: Hubs & Authorities
• HITS algorithm (Kleinberg, 1998)
• A good authority is a page pointed by good hubs, so we assume that it has good content
• A good hub is a page that points to good authorities, so we assume it is a good set of links
• Linear system calculated by numerical iteration
Relating Web Characteristics
Link ranking: Distribution
9% with relevanthub score 2-3% with relevant
authority score
<2% with relevant Pagerank
Relating Web Characteristics
Link ranking: Correlation
Hub score,authority scoreand Pagerankdo not seem
to be correlated
Relating Web Characteristics
Link ranking: Sites
• Which measure to use for sites ?
• Average score– But good sites can have lots of bad pages
• Maximum score– But one good page cannot be all that is
needed to be a good site
• Sum of the scores of all pages– Natural for Pagerank
Relating Web Characteristics
Link ranking: Sites Graph
90% relevant site-Pagerank
It’s harder to have a good hub than a good authority (site)
Relating Web Characteristics
Web Structure: Basis
• The Web graph has structure:
INOUT
MAIN
ISLANDS
Relating Web Characteristics
Web Structure: Basis (cont.)
• The MAIN component has structure:
INOUTMAIN NORM
MAIN IN
MAIN MAIN MAIN OUT
Relating Web Characteristics
Web Structure: Sketch
Relating Web Characteristics
Web Structure: Degree
Relating Web Characteristics
Web Structure: Sizes
Relating Web Characteristics
Web Structure: Preferences
Relating Web Characteristics
Web Structure: Preferences
OUT
MAINMAIN
MAINMAIN
OUTMAINOUT
Real ODP TodoCL
Relating Web Characteristics
Web Structure: Various
Relating Web Characteristics
Web Structure: Link Scores
Relating Web Characteristics
Web Dynamics: Ages
• The kernel of the Web comes from the past
Relating Web Characteristics
Web Dynamics: By Component
Relating Web Characteristics
Web Dynamics: Pagerank
Pagerank is biased against newer pages
Relating Web Characteristics
Web Dynamics: Hubs & Authorities
Age (months)
Aut
horit
y S
core
Hub
Sco
re
Relating Web Characteristics
Conclusions
• Pagerank/HITS do not seem to be correlated– And Pagerank is biased to older pages
• Site ranking can help to make good human-selected directories
• Finding good pages is not so simple
• Characterizing Web structure gives valuable insight– Web Graph Mining is just starting