ranking pages and the topology of the web

Upload: rockefellaz

Post on 06-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Ranking Pages and the Topology of the Web

    1/27

    arXiv:110

    5.1595v1[cs.DM]9May2011

    Ranking pages and the topology of the web

    Argimiro ArratiaDepartament de Llenguatges i Sistemes Informatics,

    Universitat Politecnica de Catalunya, Barcelona, Spain

    [email protected]

    Carlos Marijuan

    Departamento de Matematica Aplicada

    Universidad de Valladolid, Valladolid, Spain,

    [email protected]

    Abstract

    This paper presents our studies on the rearrangement of links from the struc-ture of websites for the purpose of improving the valuation of a page or group ofpages as established by a ranking function as Googles PageRank. We build ourtopological taxonomy starting from unidirectional and bidirectional rooted trees,and up to more complex hierarchical structures as cyclical rooted trees (obtainedby closing cycles on bidirectional trees) and PRdigraph rooted trees (digraphswhose condensation digraph is a rooted tree that behave like cyclical rooted trees).We give different modifications on the structure of these trees and its effect on thevaluation given by the PageRank function. We derive closed formulas for thePageRank of the root of various types of trees, and establish a hierarchy of thesetopologies in terms of PageRank. We show that the PageRank of the root of cycli-

    cal and PRdigraph trees basically depends on the number of vertices per leveland the number of cycles of distinct lengths among levels, and we give a closedvector formula to compute PageRank.

    Keywords: PageRank, world wide web topology, link structure.AMS Math. Subject Classification: 05C05, 05C99, 05C40, 68R10, 94C15.

    1 Introduction

    Google is still todays most popular search engine for the World Wide Web, and thekey to its success has been its PageRank algorithm [4], which ranks documents basedprimarily on the link structure of the web. Simply put, PageRank considers a link from

    a page H to another page J as a weighted vote from H in favour of the importanceof J, where the weight of the vote of H is itself determined by the number of links(or voters) to H. Therefore, part of the game of the electronic business today is tofind ways of lifting a pages link popularity, and specifically the PageRank, by eitherobtaining the vote of a very important page (which is unlikely) or manufacturing a

    Research partially supported by Spanish Government MICINN under projects: SESAAME(TIN2008-06582-C03-02) and SINGACOM (MTM2007-64007)

    Supported by Spanish Government MICINN project SINGACOM (MTM2007-64007)

    1

    http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1http://arxiv.org/abs/1105.1595v1
  • 8/3/2019 Ranking Pages and the Topology of the Web

    2/27

    large set of pages that would be willing to link to a clients page. For the lattersolution, known as link farms in the jargon of the Search Engine Optimisation (SEO)community, much care must be taken since it is widely believed that Google had tunedup its original PageRank algorithm to detect fictitious linking and similar forms ofspamming (e.g. the 2003 Florida update, see [9]).

    At the heart of the challenge of improving a page PageRank value is the roleplayed by the topology of the web. This is a widely recognised fact as there can befound in the internet many SEO analyses of link patterns, together with tips on howto rearrange these to raise the PageRank of specific pages. On the theoretical side,having acknowledged that the World Wide Web should be treated as a directed graph,there are various publications that propose different graph decompositions on regularpatterns, as a way to improve PageRank computation (e.g. [2], [11]), and news thatsuggests that newly acquired technology by Google, in the hope to enhance PageRank,is based on localisation of the computations on certain tree structures underlying theWeb ([10], [13]).

    Motivated by these graph combinatoric challenges particular to the Web, we havestudied the PageRank formula from a mathematical perspective, and its relation with

    the web sites topology, with the twofold goal of accelerating the computation of Page-Rank and maximising its value for an specific page or set of pages. We summarise hereall our findings starting from (unidirectional) rooted trees and up to more complexhierarchical structures. Ultimately, our academic goals are to disclose some of the graphcombinatorics underlying the World Wide Web and this popular ranking function, andto contribute to the mathematical foundation of many heuristics and ad hoc rulesin used by the SEO community in its attempt to tweak the valuations assigned byPageRank.

    2 Some preliminaries on Graph Theory

    In this paper we will use some standard concepts and results about directed graphs,which we detail in this section in order to fix our notation.

    By a digraph D we mean a pair D = (V, A) where V is a finite nonempty setand A V V \ {(v, v) : v V}. Elements in V and A are called vertices andarcs respectively. For an arc (u, v) we will say that u is adjacent to v, and we maysometimes also use uv to denote an arc (u, v). The order and the size of D are,respectively, Card(V) and Card(A). Ifv is a vertex, the in-degree, id(v), of v is thenumber of arcs (u, v) in A. Similarly, the out-degree, od(v), of v is the number ofarcs (v, u) in A.

    A sequence of vertices v1v2 . . . vq, q 2, such that (vi, vi+1) A for i = 1, 2, . . . , q1is a walk of length q 1 joining v1 with vq or more simply a v1vq walk. If the

    vertices of v1v2 . . . vq are distinct the walk is called a path. A cycle of length q or aq-cycle is a path v1v2 . . . vq closed by the arc vqv1. A digraph is acyclic if it has nocycle. By a semipath joining v1 with vq we mean a sequence of distinct verticesv1v2 . . . vq, q 2, such that (vi, vi+1) A or (vi+1, vi) A for i = 1, 2, . . . , q 1.

    A digraph is connected if for each pair u and v of distinct vertices, there is asemipath joining u with v. By a subdigraph of the digraph (V, A) we mean a digraph(W, B) such that W V and B A. The subdigraph is called a partial digraphwhen W = V. The induced subdigraph by the digraph (V, A) on W V is the

    2

  • 8/3/2019 Ranking Pages and the Topology of the Web

    3/27

    digraph (W, A/W) where A/W = A (W W).For an acyclic digraph there exists at least one vertex v (resp. u) such that od(v) = 0

    (resp. id(u) = 0). Such vertex will be called a maximal (resp. minimal) in thedigraph. Moreover, the vertices in an acyclic digraph (V, A) can be distributed bylevels N0, N1, . . . , where N0 = {v V : v is maximal in (V, A)} and, recursively for

    p > 0,

    Np = {v V \

    p1i=0

    Ni : v is maximal in the induced subdigraph on V \

    p1i=0

    Ni}.

    Thus one has a partition of V, V = N0 N1 Nh, h being the height of thedigraph, i.e. the last index such that Nh = .

    3 Short Introduction on PageRank

    The mathematical view of the World Wide Web is as a digraph W = (V, A), where a

    vertex represents any document posted on the web (a page), and an arc (b, a) indicatesthat there is a link from page b to page a. In this setting, Brin and Page proposed in[4] to evaluate each page in the Web with a positive real number, which they namedits PageRank, given by the formula (in its refined version from [5]):

    P(a) =1

    N+

    (b,a)A

    P(b)

    o(b)(1)

    where P(a) is the PageRank of page a, o(b) is the number of links going out of pageb (the out-degree of b), is a constant that can take any real value in the interval(0, 1) (although Brin and Page always prefer to set it to 0.85), N is the total numberof pages of the Web, and the sum is taken over all pages b that have a link to a. The

    motivation, given by the authors, is that formula (1) models the behaviour of a randomsurfer of the Web who, being at a certain page b, either follows one of the links shownin that page with probability , or jumps to any other page with probability 1 ,disregarding the contents of the pages. The probability of choosing a link in b thattakes him to page a depends on the number o(b) of links out of b; so P(b)/o(b) is thecontribution ofb to the PageRank ofa amortised by . In this setting, the PageRank ofa is the probability of a user reaching page a directly or after following all appropriatelinks, and the sum of the PageRank of all the pages is 1, and so, forms a probabilitydistribution over the Web (see [3] and [12]).

    Yet another view of PageRank is the analytical formulation given by Brinkmeier(see [7]), who conceived the PageRank function as a power series. In this setting, a

    formula is given that highlights the fact that the ranking of a vertex v, as assigned byPageRank, depends on the weighted contributions of each vertex in every walk thatleads into v, being these contributions higher in value for vertices that are nearer indistance from v.

    For a given walk = v1v2 . . . vn in the graph (V, A), define

    D() =1

    o(v1)o(v2) o(vn1)

    3

  • 8/3/2019 Ranking Pages and the Topology of the Web

    4/27

    Then, for any vertex a V, we have

    P(a) =1

    N

    wV

    :w

    a

    l()D() (2)

    where : w

    a denotes a walk starting at vertex w and ending in vertex a, andl() is the length of this walk .

    4 Ranking vertices on trees

    Our starting case study is the set of rooted trees, where a tree with root is an acyclicdigraph for which there exists a maximal vertex r (the root), such that for every vertexv = r there is a unique vr path. Then, trees are connected, its roots are unique andthe rest of the vertices have outdegree 1 and the indegree may vary. Vertices with indegree 0 are called leaves. The root is the targeted page for improving its PageRankvaluation. The height of a vertex in a rooted tree is the length of the path from the

    vertex to the root. The level k of a rooted tree is the set of vertices with height k; theroot is at level N0. The height of a rooted tree is the length of the longest path froma leaf to the root.

    Remark 4.1 Since we are interested in studying the behaviour of PageRank whenlocalised in certain subdigraphs of the Web digraph, we think, in particular, of our treesas local closed web sites. This means that the value of N in formula (1) is the numberof vertices in the tree.

    Our first result shows that to compute the PageRank of the root of a tree all weneed to do is count the number of vertices at each level of the tree.

    Theorem 4.2 If a rooted tree has N vertices and height h, then the PageRank of itsroot r is given by the formula

    P(r) =1

    N

    hk=0

    knk (3)

    where nk := |Nk| is the number of vertices of the kthlevel, Nk, of the tree.

    Proof: Below we use b Nk : b a to indicate that vertex b at level Nk has a link toa. Assume the first level of the tree N1 = {a1, . . . , an

    1}. Then, according to equation

    (1)

    P(r) =1

    N+

    aN1

    P(a) =1

    N+

    1

    N+

    bN2:ba1

    P(b)

    +

    . . . +

    1

    N+

    bN2:ban

    1

    P(b)

    4

  • 8/3/2019 Ranking Pages and the Topology of the Web

    5/27

    The index sets {b N2 : b ai}, for i = 1, . . . , n1, are pairwise disjoint; therefore,

    P(r) =1

    N(1 + n

    1) + 2

    bN2

    P(b)

    Repeating the above manipulations on levels N2, N3, and up to level Nh1, we have

    P(r) =1

    N

    h1k=0

    knk

    + hbNh

    P(b)

    At the last level Nh all vertices are leaves, which have no in-coming arcs, hence thePageRank of any b Nh is (1 )/N. Then

    hbNh

    P(b) =1

    Nhn

    h

    and the result follows.

    Remark 4.3 Theorem 4.2 shows that we can do any rearrangements of links betweentwo consecutive levels of a web set up as a rooted tree, and the PageRank of the rootwill be the same.

    Remark 4.4 Due to Theorem 4.2, we will from now on describe a rooted tree T, withh 0 levels, each of cardinality n0 = 1, n1, . . . , nh, as the string T = 1n1 . . . nh.Also the PageRank for the root r of T, or for any other vertex seemed as the root of asubtree in T, will depend on the height and the number of vertices at each level of T.Henceforth, we write PageRank of r in the tree T as a function of the height h, anddenote it P(h).

    For some regular topologies we can have nice closed formulas for their PageRank.Some examples follow below.

    4.1 m-ary trees

    For m, h 1 , let Tm(h) be the full mary tree of height h, i.e. a tree of heighth whose vertices, except by the leaves, have indegree m. The 1ary tree of heighth, T1(h), is a path of length h. For m > 1, Tm(h) has mk vertices at each levelk = 0, 1, . . . , h, and the total number of vertices is (mh+1 1)/(m 1). Using Theorem4.2 we can quickly calculate the PageRank for the root r (which depends on the height

    h and fixed arity m, and so we denote Pm(h)). This is

    Pm(h) = (1 )m 1

    mh+1 1

    hk=0

    mkk = (1 )

    m 1

    mh+1 1

    (m)h+1 1

    m 1

    and for the 1ary tree

    P1(h) =1 h+1

    h + 1

    5

  • 8/3/2019 Ranking Pages and the Topology of the Web

    6/27

    4.2 Binomial trees

    A binomial tree is at the core of fundamental data structures such as heaps, and hence,it qualifies as a good candidate for a websites topology.1

    We use Tb(h) to denote the full binomial tree of height h. We recall from [8,9.1] that Tb(0) consists of only one vertex the root and, inductively, Tb(h + 1) is

    two copies of Tb(h) joint with an arc from the root of one of the Tb(h) to the root ofthe other. At each level k = 0, 1, . . . , h, Tb(h) has

    hk

    vertices, and the total number of

    vertices in Tb(h) ish

    k=0

    hk

    = 2h. Using Theorem 4.2 we get a nice formula to easily

    calculate the PageRank for the root r of Tb(h), namely

    Pb(h) =1

    2h

    hk=0

    h

    k

    k = (1 )

    1 +

    2

    h.

    5 Rearrangements of vertices

    We begin our explorations on the kind of modifications on the tree structure that willimprove the valuation of PageRank. Our first result shows that completely erasing thevertices farthest away from the root improves the PageRank. This corroborates theknown fact that the optimal configuration is a star, i.e. a rooted tree of height 1 (seee.g. [3], [12]).

    Theorem 5.1 If in a tree T = 1n1 . . . nh of heighth 1, the last levelNh is completelyerased, then the PageRank of its root r, P(h), increases its value.

    Proof: After passing from the tree T = 1n1 . . . nh, with N = 1 + n1 + . . . + nh verticesand PageRank P(h), to the tree T = 1n1 . . . nh1 with N nh vertices and PageRankP(h), we get

    P(h) P(h) =nh

    (N nh)N(1 + n1 + . . . + nh1

    h1 (N nh)h)

    =nh

    (N nh)N(1 + n1 + . . . + nh1

    h1 (1 + n1 + . . . + nh1)h)

    =nh

    (N nh)N((1 h) + n1(

    h) + . . . + nh1(h1 h)) > 0

    because 0 < < 1 and h 1.

    Remark 5.2 Thus, in order to improve the PageRank of the root of a tree one candelete as many levels, from highest to lowest, as the context permits. Conversely, if a

    new level of vertices is added to a tree, then the PageRank of its root decreases.

    If it were the case that for practical, or any other reason, we were obliged to keepcertain height, then a natural question is how much can we prune the tree to improveon PageRank. The extreme situation is to prune all but one arc at each level, so wetake that structure as benchmark and called it queue tree.

    1Goodness as always is understood in terms of PageRank.

    6

  • 8/3/2019 Ranking Pages and the Topology of the Web

    7/27

    Definition 5.3 The queue tree of a tree T = 1n1 . . . nh is the tree

    Tq = 1n1 . . . nh12 1 . . . 1 h2+1

    Theorem 5.4 The PageRank of the root of a tree is smaller than the PageRank of theroot of its queue tree.

    Proof: We proceed recursively from the last level down to h12 .(a) The PageRank P(h) of the root r ofT = 1n1 . . . nh1nh is smaller than the PageR-ank P(h) of T = 1n1 . . . nh11. Indeed, let N = 1 + n1 + . . . + nh, then

    P(h) P(h) =(nh 1)(1 )

    (N (nh 1))N

    h1k=0

    nkk (N nh)

    h

    =(nh 1)(1 )

    (N (nh 1))N

    h1

    k=0nk(

    k h) > 0.

    Apply the same methodology for T = 1n1 . . . nh2nh11 and T = 1n1 . . . nh211, and

    so on, up to h/2. At this last step we have(b) T = 1n1 . . . nh1

    2nh+1

    2 1 . . . 1

    h2

    , and we shall see that its PageRank is less than

    that of the queue tree T = 1n1 . . . nh12 1 . . . 1 h2+1

    . We work separately the cases of h

    even or h odd.(b.i) If h = 2p 1 then T = 1n1 . . . np1np 1 . . . 1

    p1

    , T = 1n1 . . . np1 1 . . . 1 p

    and N =

    n1 + . . . + np + p. Let M =(np1)(1)(N(np1))N

    . Then

    P(h) P(h) = M

    1 + p1

    k=1

    nkk (N np)

    p +

    2p1k=p+1

    k

    = M

    (1 p) + p1

    k=1

    nk(k p) +

    2p1k=p+1

    (k p)

    = M

    (1 p) +

    p1k=1

    (nk pk)(k p)

    > 0.

    (b.ii) If h = 2p then T = 1n1 . . . np1np 1 . . . 1 p

    , T = 1n1 . . . np1 1 . . . 1 p+1

    and N =

    n1 + . . . + np + p + 1. One then shows P(h) P(h) > 0 by a similar argument as in

    (b.i).

    Remark 5.5 Theorem 5.4 can not be improved, in the sense that deleting furthervertices (but keeping the height) in a queue tree may or may not improve the PageRankof the root. For small values of h, the queue tree is the optimal pruning of a tree for

    7

  • 8/3/2019 Ranking Pages and the Topology of the Web

    8/27

    increasing PageRank. For example, if h = 4 the corresponding queue tree is Tq =1n1111 with PageRank P(h), and if n1 > 1 and we remove a vertex from level N1, weget the tree T = 1(n1 1)111 with PageRank P

    (h), and their difference is

    P(h) P(h) =1

    (n1 + 3)(n1 + 4)(1 4 + 2 + 3 + 4) < 0

    for any such that 0.27568 < < 1.For larger values of h, an improvement of PageRank will depend on and on the

    cardinalities of the levels N1, . . . , Nh12. There are also some improvements that can

    be done on queue trees of particular trees, such as mary and binomial.

    6 Hierarchies of trees by height and size

    In what follows we assume that 12 < < 1, an interval of useful values for in practice(see the analysis on this subject in [12]). We want to order the mary and binomialtrees with respect to their PageRank. Which tree structure is best for PageRank? Ourfirst result on this theme gives a hierarchy with respect to the height.

    Theorem 6.1 For values of the height h sufficiently large, we have

    P1(h) > Pb(h) > P2(h) > P3(h) > .. . > Pm(h) . . .

    Proof: We have to compute the appropriate limits:

    1. For 1 < k < m, limh

    Pk(h)

    Pm(h)=

    (k 1)(m 1)

    (m 1)(k 1)> 1, from where we conclude

    that Pk(h) > Pm(h).

    2.P2(h)

    Pb(h) = + 1

    2(2 1)

    (2)h+1 1

    ( + 1)h+12h+1

    2h+1 1h

    0.

    3.Pb(h)

    P1(h)=

    (1 )(h + 1)1+2

    h1 h+1

    h 0 .

    Next we classify the PageRank of the mary and binomial trees with respect totheir order. (Beware that we will express the order in terms of the height, and thereforewe keep the notation Pm(h) and Pb(h) that remarks the dependency of PageRank onthe height of the tree.)

    Theorem 6.2 For sufficiently large order N and 0.58, we have

    . . . Pm(h) > .. . > P5(h) > Pb(h) > P4(h) > P3(h) > P2(h) > P1(h)

    Proof: The proof splits into three cases.(a) If 1 < k < m and N >> 0 then Pm(h) > Pk(h) : A kary tree of height h

    has N =kh+1 1

    k 1many vertices. An mary tree of height h has the same number of

    vertices as in a kary tree if, and only if,kh+1 1

    k 1=

    mh+1 1

    m 1, or equivalently h+1 =

    8

  • 8/3/2019 Ranking Pages and the Topology of the Web

    9/27

    logm

    m1k1 (k

    h+1 1) + 1

    logmm1k1 (k

    h+1)

    , where the last relation indicates an

    equivalence among infinite large quantities. Then

    limh

    Pk(h)

    Pm(h)= lim

    h

    (m 1)

    (k 1)

    (k)h+1 1

    (m)h+1 1

    = limh

    (m 1)(k 1)

    1(m)logmm1

    k1

    logmk

    h+1 = 0since k < m implies logmk < 1, and in consequence

    logmk< 1.

    (b) If m 2 and N >> 0 then Pm(h) > P1(h):

    P1(h)

    Pm(h)=

    1h+1

    h+1

    (m1)(1)mh+11

    (m)h+11m1

    =(m 1)

    (1 )

    (1 mh+11m1 )

    (m)h+1 1h 0

    (c) If N >> 0 then P5(h) > Pb(h) > P4(h): A binomial tree of height h has 2h

    vertices. For m 2, an mary tree of height h has same cardinality of a binomial tree

    of height h if, and only if, mh+1 1m 1

    = 2h

    , or equivalently,

    h = log2

    mh+1 1

    m 1 log

    2

    mh+1

    m 1.

    We then show that

    limh

    Pm(h)

    Pb(h)=

    (1 + )log2 (m1)

    m 1

    m

    (1 + )log2 m

    h+1= L.

    The limit L is 0 or depending on m

    (1+)

    log2m being less than or greater than 1,

    respectively. We showed L = 0 if m 4 or m = 5 and < 0, where 0 is theirrational number solution of 50 = (1 + 0)

    log25, namely, 0 = 0.57016 . . .. On the

    other hand, L = if m 6 or m = 5 and > 0.

    6.1 Refining the hierarchy

    According to our results there are many non uniform ways in which one can improvethe PageRank in our binomial or mary trees, yet keeping the height as a constraintfor maintaining a hierarchically organised website: it is sufficient to remove verticesat farther distance from the root. However, there are also non trivial uniform treeswith better PageRank than the binomial, with respect to height (Theorem 6.1). We

    present one such possibility which can be seen as a basic backbone for a website withdifferent levels. Our intention with this example is to illustrate the following fact: tooptimise the PageRank values of certain pages of a web site is in general a hard task,as there could be exponentially many possible rearrangements. We will come back tothis point in section 7.

    We define the path tree of height h, denoted T(h), as a tree with a root fromwhich hangs h paths of h, h 1, . . . , 2, and 1 vertices. Observe that T(h) has hvertices at level 1 (connected to the root), h 1 vertices at level 2, h 2 vertices at

    9

  • 8/3/2019 Ranking Pages and the Topology of the Web

    10/27

    level 3, . . . , one vertex at level h. Hence, |T(h)| = 1 + h(h + 1)/2 and if r is the rootof T(h), we have its PageRank, P(h), is given by

    P(h) =2(1 )

    2 + (h + 1)h(1 + h + (h 1)2 + (h 2)3 + . . . + 2h1 + h)

    =2(1 )

    2 + (h + 1)h

    1 + h+1

    hk=1

    k

    1

    k

    =2(1 )

    2 + (h + 1)h

    1 +

    h+2

    (1 )2

    1 (h + 1)

    1

    h+ h

    1

    h+1

    We then show that

    P(h)

    Pb(h)=

    2

    2 + (h + 1)h

    2

    1 +

    h1 +

    h+2 + (1 )h 2

    (1 )2

    h

    Thus P(h) > Pb(h) for sufficiently large h, although we have checked the inequality

    computationally for values ofh 3 (for h < 3 the binomial tree and the chain tree arethe same).

    On the other hand, one can show thatP(h)

    P1(h)h 2, and hence the relative

    position ofP(h) and P1(h) in the hierarchy depends on whether is > 1/2 or < 1/2.

    6.2 Hierarchies for queue trees

    Surprisingly, for the queue trees ofmary and binomial trees, the same ordering of theirPageRank with respect to height and order holds. For the mary (resp. binomial) treeof height h, we use Pq,m(h) (resp. Pq,b(h)) to denote the PageRank of (the rootof) its queue tree. These values will depend on the parity of the height h, sinceby Definition 5.3, if h = 2p 1 then Tq = 1n1 . . . np1 1 . . . 1

    p

    , and if h = 2p then

    Tq = 1n1 . . . np1 1 . . . 1 p+1

    . Thus, for m > 1,

    Pq,m(2p 1) =1

    mp1m1 + p

    (m)p 1

    m 1+ p

    p 1

    1

    Pq,m(2p) =1

    mp1m1 + p + 1

    (m)p 1

    m 1+ p

    p+1 1

    1

    (the case m = 1 is trivial since the 1-ary queue tree coincides with the 1-ary tree).

    And,

    Pq,b(2p 1) =1

    22p2 + p

    p1k=0

    2p 1

    k

    k + p

    p 1

    1

    Pq,b(2p) =1

    22p1 122pp

    + p + 1

    p1k=0

    2p

    k

    k + p

    p+1 1

    1

    10

  • 8/3/2019 Ranking Pages and the Topology of the Web

    11/27

    Theorem 6.3 (i) For sufficiently large height h, we have

    Pq,1(h) > Pq,b(h) > Pq,2(h) > Pq,3(h) > . . . > Pq,m(h) . . .

    (ii) For sufficiently large order N and 0.58, we have

    . . . Pq,m(h) > .. . > Pq,5(h) > Pq,b(h) > Pq,4(h) > Pq,3(h) > Pq,2(h) > Pq,1(h)

    The proofs of (i) and (ii) are more involved and longer than previous theorems. Weshall give sufficient pointers so that readers may reproduce them.

    Part (i): We need to study separately the cases of h being even or odd. The crucialobservation is:

    Lemma 6.4 If 0 < < 1 then

    (1 + )2p2

    p1k=0

    2p 1

    k

    k + p

    p 1

    1 (1 + )2p1.

    (To show this observe that for 0 < < 1 then the quotient of (1 + )2p1 and

    p1k=0

    2p1k

    k + p p11 tends to a constant L, with 1 L 1 + , as p grows.)

    Using this lemma, we obtain the same limits 1), 2) and 3) in Theorem 6.1 for h = 2pand for h = 2p 1.

    Part (ii): As in (i) we need to study separately the cases of h being even or odd.Observe that for the same type of queue tree ( being m-ary, binomial, etc.), byTheorem 5.1, the queue tree of height 2p 1 is obtained by deleting the level N2p ofthe queue tree of height 2p, and hence,

    Pq,(2p 1) > Pq,(2p)

    Therefore, for any two queue trees of types 1 and 2 we have

    Pq,1(2p 1)

    Pq,2(2p)> max

    Pq,1(2p)

    Pq,2(2p),

    Pq,1(2p 1)

    Pq,2(2p 1)

    >

    Pq,1(2p)

    Pq,2(2p 1)

    and since all these quotients are positive, if any of them reduces to zero then allthose that are to the right hand side (the ones that are smaller) will also reduceto zero. Hence, to prove that Pq,1(h) < Pq,2(h), it will be enough to prove thatPq,1(2p 1)

    Pq,2(2p)

    p 0.

    Now to obtain the inequalities claimed in (ii) follow the scheme of Theorem 6.2:

    (a) If 1 < k < m then showPq,k(2p 1)

    Pq,m(2p)

    p 0.

    (b) If m > 1 then showPq,1(h)

    Pq,m(2p 1)

    p 0, where h = |Pq,m(2p 1)| (so this h

    depends on p).

    (c) To show Pq,5(h) > Pq,b(h) > Pq,4(h), due to the observation that Pq,(2p 1) >Pq,(2p), it will be enough to show that Pq,5(2p) > Pq,b(2p 1) and Pq,b(2p) >Pq,4(2p 1). These two inequalities are shown by taking analogous limits as inthe proof of part (c) in Theorem 6.2 and applying the same constraints about .

    11

  • 8/3/2019 Ranking Pages and the Topology of the Web

    12/27

    7 The problem of optimising the link structure

    As mentioned in the introduction, a theoretically as well as commercially importantproblem is to find a scheme for modifying the link structure of a local web in orderto improve its ranking, as set by PageRank or any other ranking function. Let usbegin with the most basic goal of designing a local web (or fixing an already existingone) with a treelike structure, where the PageRank of the main page, located at theroot of the tree, should have the highest possible value. By virtue of Theorem 4.2 thistranslates into the following optimisation problem.Main Objective: To maximise the function

    P(h, 1, n1, . . . , nh) =1

    1 + n1 + . . . + nh

    hk=0

    knk

    for fixed , such that 0 < < 1, and all trees T = 1n1 . . . nh with integer valuesh, ni 1, 1 i h. If the total number N of vertices is bounded then we can assurethat the maximum exists (P is continuous in a compact domain). The complexity of

    the problem is equivalent to the complexity of finding the positive integer partitions ofN; a problem with only known exponential solutions. This justifies approaching thesolution through heuristics. Here we give an ad hoc list of rules that clearly stem fromour theorems.Rule 1: Due to Theorem 5.1, the first thing to do is to reduce the height as much asthe context allows.Rule 2: Keep in mind that while applying Rule 1 (and deleting levels), links betweenconsecutive levels can be rearrange in any way you like, as long as the context is keptconsistent, and this has no effect on the roots PageRank value (by Theorem 4.2).Rule 3: Once the optimal height h > 1 is attained2, we delete (as much as possible)vertices from levels in the upper half of the tree, trying to get it close to its underlying

    queue tree (Theorem 5.4), and those vertices that cannot be deleted should be movedas closer to level 1 as possible (by Theorem 4.2).

    The above rules of general nature can be complemented by next working on theparticularities of the queue tree structure. For example, if the applications of rules 1to 3 give as a final result a 5ary queue tree, then Theorem 6.3(i) tell us that pruningmore vertices to convert this tree into a binary queue tree, or binomial queue tree (ofsame height) improves PageRank. The caveat is that we have proved Theorem 6.3using continuous calculus and, therefore, cannot be applied without doubt for smallvalues of the height. To remedy this deficiency, we have computed Pq,b(h) and Pq,m(h)for various m and many small integer values of h, and concluded the following facts,which strengthen Theorem 6.3(i):

    1. Pq,1(h) > Pq,b(h) > Pq,2(h), for h 17.

    2. Pq,b(h) > Pq,1(h), for 2 h 16.

    3. Pq,b(h) > Pq,2(h), for all h > 1.

    2Optimality here again depends on maintaining the context consistent. This height could mean theminimal levels of a hierarchy that we need to reflect in the web site; say, for example, of a corporationor a hypertext.

    12

  • 8/3/2019 Ranking Pages and the Topology of the Web

    13/27

    4. Pq,1(h) > Pq,2(h), for h 15.

    5. Pq,2(h) > Pq,1(h), for 2 h 14.

    6. Pq,2(h) > Pq,3(h) > Pq,4(h) > .. . > Pq,m(h) . . ., for h 9.

    7.Pq,2

    (h) 0

    1 : bh1 = 0

    Therefore,

    16

  • 8/3/2019 Ranking Pages and the Topology of the Web

    17/27

    ch1 =

    nh1 bh1 : bh1 > 0

    nh1 1 : bh1 = 0

    which is equivalent to ch1 = nh1 nh1. Then

    P(h) =1

    N

    h2k=0

    nkk + nh1

    h1 + bhh

    +

    N

    hk=1

    bkk

    P(h) =1

    N ch1

    h2k=0

    nkk + nh1

    h1 + bhh

    +

    N ch1

    hk=1

    bkk

    Their difference is

    P(h) P(h) =(1 )ch1

    (N ch1)N

    h2k=0

    nk

    k h1

    + (1 )ch1(N ch1)N

    bh

    h h1

    + ch1(N ch1)N

    hk=1

    bkk

    The last two summands can be rewritten as

    ch1(N ch1)N

    h1k=1

    bkk+1 + bh

    h1(2 1)

    and this expression is surely positive for 1/2.

    We can generalise the previous result as follows

    Theorem 9.5 Let B be a bidirectional tree of height h with only bidirectional arcs from the last level up to levelNp+1, for some 1 p < h. If the unidirectional arcsat the level Np are removed, then the PageRank of the root of B increases, providedp h 12 and = 0.85.

    Proof: We haveB = 1n1b1 . . . np1bp1npbpbp+1bp+1 . . . bhbh

    with bk 1 for k p + 1, and after the removal of unidirectional arcs at level Np weend up with

    B = 1n1b1 . . . np1bp1bpbpbp+1bp+1 . . . bhbh

    Recall that cp = np bp. Then

    P(h) =1

    N

    p1

    k=0

    nkk + np

    p +h

    k=p+1

    bkk

    +

    N

    hk=1

    bkk

    P(h) =1

    N cp

    p1

    k=0

    nkk + bp

    p +h

    k=p+1

    bkk

    +

    N cp

    hk=1

    bkk

    17

  • 8/3/2019 Ranking Pages and the Topology of the Web

    18/27

    Therefore, setting M = cp(Ncp)N, we have

    P(h) P(h) = M(1 )

    p1k=0

    nkk +

    hk=p+1

    bkk

    + M

    hk=1

    bkk + M(1 )p(np N)

    Since (np N) = (1 + n1 + . . . + np1 + np+1 + . . . + nh) and observing that byhypothesis for k p + 1, nk = bk, we have that P

    (h) P(h) is equal to

    M(1 )

    p1

    k=0

    nk(k p) +

    hk=p+1

    bk(k p)

    + M h

    k=1

    bkk =

    M(1 )p1

    k=0

    nk(k p) +

    p

    k=1

    bkk + p

    h

    k=p+1

    bk(kp + 1) .

    For = 0.85 and t > 12 the terms t + 1 are positive. Hence, for = 0.85 andvalues of p h 12, we have kp + 1 > 0, for p + 1 k h, and hence the sumis positive. 3

    Next, we wonder if we could have in the bidirectional setting a result similar toTheorem 5.1. Unfortunately we found that removing the last level does not necessarilygive us an improvement on PageRank for an arbitrary bidirectional tree. We do get apositive result, however, if we impose some regularity condition on our trees. In thefollowing theorem we consider bidirectional trees in which the number of bidirectionalarcs bk at each level Nk is at least proportional to the number of bidirectional arcs

    bh at the last level Nh. Formally, we say that a bidirectional tree of height h isproportional if

    bknk

    bhnh

    for all 1 k h.

    Theorem 9.6 If in a proportional bidirectional tree B of height h we remove the lastlevel, then the PageRank of its root increases, except for small values of h.

    Proof: We have that the tree B = 1n1b1 . . . nhbh withbknk

    bhnh , for all 1 k h,

    gets transformed into the tree B = 1n1b1 . . . nh1bh1. Takingbhnh

    = C and using that

    N = 1 + hk=1 nk we can bound the difference of the PageRanks of the roots of these3In the real world of the internet experimental studies have shown that whenever two pages are

    connected by a path, the average length of such a path is less than 19 (see [1]).

    18

  • 8/3/2019 Ranking Pages and the Topology of the Web

    19/27

    two trees as follows. Let M = nh(Nnh)N, then

    P(h) P(h)

    M(1 )

    1 +

    h1

    k=1nk

    k +h1

    k=1Cnk

    k+1

    1 (N nh)

    h + Ch+1

    1

    = M(1 )

    1

    h(1 ) + Ch+1

    1 +

    h1k=1

    nk

    k +

    Ck+1

    1 +

    h(1 ) + Ch+1

    1

    = M

    1 + (1 (1 C))

    h +

    h1k=1

    nk(k h)

    Assuming the worst case in which there is just one bidirectional arc at each level, weget the bound

    P(h) P(h) M1 + (1 (1 C))h +h1

    k=1k h

    = M

    1 + C2 h(1 (1 C))(1 + (1 )h)

    The function

    f(h) := 1 + C2 h(1 (1 C))(1 + (1 )h)

    has in h = 0 the value f(0) = C( 1) < 0, and is non decreasing since

    f(h + 1) f(h) = (1 (1 C))h(1 )2(1 + h) > 0

    Furthermore f(h) converges to 1 + C2 > 0 when h ; hence, there exists h0such that for all h h0 we have P(h) P(h) > 0. This proves the theorem. In fact,for = 0.85, the function f(h) and P(h) P(h) are positive, for h 4 and 0 C 1

    In particular, if B is a proportional bidirectional tree of height h and all the arcsat the last level Nh are bidirectional, then all arcs of the tree are bidirectional, and sowe have

    Corollary 9.7 If a bidirectional tree of height h has all its arcs bidirectional, and thelast level is removed, then the PageRank of its root increases, except for small valuesof h. (In fact, it increases for h 4 when = 0.85.)

    We have shown so far that either changing a unidirectional arc to bidirectional,or deleting unidirectional arcs from the last levels of the tree, increases the PageRankvaluation of the root. Furthermore, we have shown that for proportional bidirectionaltrees removing the last level completely also produces an improvement on PageRank.But if we are to preserve a certain height for a particular tree it is natural to considerup to what extend we can prune a tree and increase the PageRank value of the root,yet preserving the height. By the previous considerations our extreme situation shouldbe of a bidirectional queue tree consisting of only bidirectional arcs in the levels that

    19

  • 8/3/2019 Ranking Pages and the Topology of the Web

    20/27

    constitute the queue; these are the levels from one point of height to the end with onlyone bidirectional arc. Now, pruning an arbitrary bidirectional tree in such a way thatit gets transformed into a bidirectional queue tree may not result in an improvementon PageRank of the root. In order to obtain positive results we have to restrict ourmanipulations again to proportional bidirectional trees.

    Theorem 9.8 LetB be a proportional bidirectional tree of height h. If we remove fromthe last level of B all arcs but one (in order to keep the height h), then the PageRankof its root increases, except for small values of h.

    Proof: As in the proof of Theorem 9.6 we have the tree B = 1n1b1 . . . nhbh withbknk

    bhnh = C, for all 1 k h, which gets transformed into the tree B =

    1n1b1 . . . nh1bh111. For M =nh1

    (Nnh+1)N, we have that P(h) P(h) is bounded

    above by

    M1 h +

    h1

    k=1((1 )(k h) + (Ck h))nk

    and assuming the worst case in which nk = 1, for all 1 k h 1, we get the bound

    P(h) P(h) M

    1 + C

    hk=2

    k (h + 1)h

    (6)

    Since limh

    (h + 1)h = 0, for 0 < < 1, we can assure that (6) is positive for all h

    greater or equal to a certain positive ho which depends on C.(In particular, for C = 1 we are in the case of B having only bidirectional arcs

    and, for = 0.85, the corresponding expression 1(h) := 1 +h1

    k=2 k hh in equation(6) is negative for h = 2, 3 and positive for h 4. Likewise, for C > 1, setting

    C(h) := 1 + Ch

    k=2

    k (h + 1)h one can easily see that 23

    (h) > 0 for h 6,

    12

    (h) > 0 for h 8 and 13

    (h) > 0 for h 10.)

    In view of the previous result it is reasonable to work with the following form ofqueue tree. We denote by Bq a proportional bidirectional tree of height h witha queue of length q, i.e., a bidirectional tree of height h with a unique bidirectional

    arc in each of its q last levels and with bknk

    bhqnhq

    , for 1 k h q.

    Theorem 9.9 Let Bq be a proportional bidirectional tree of height h with a queue oflength q. If we remove from the level Nhq of Bq all arcs but one (in order to keep theheight h), then the PageRank of its root increases, except for small values of h.

    Proof: We have the tree Bq = 1n1b1 . . . npbp11 . . . 11 of height h and N vertices, withp = h q, and suppose we remove all but one of the arcs at level Np. Then we obtainthe tree Bq+1 = 1n1b1 . . . np1bp111 . . . 11 of height h, same root and N (np 1)

    20

  • 8/3/2019 Ranking Pages and the Topology of the Web

    21/27

    vertices. The corresponding PageRanks of the roots of these trees, namely, Pq(h) andPq+1(h), are given by the equations

    Pq(h) =1

    N

    p

    k=0nk

    k +h

    k=p+1k

    +

    N

    p

    k=1bk

    k +h

    k=p+1k

    Pq+1(h) =

    1

    N np + 1

    p1

    k=0

    nkk +

    hk=p

    k

    +

    N np + 1

    p1

    k=1

    bkk +

    hk=p

    k

    Since1

    N np + 1

    bpN

    1

    N np + 1

    npN

    =(1 np)(N np)

    (N np + 1)N

    and using that N = 1 + q +p

    k=1 nk and bk Cnk, we obtain the bound

    Pq+1(h) Pq(h)

    (1 )(np 1)

    (N np + 1)N p

    k=0(

    k

    p

    )np +

    hk=p+1

    (k

    p

    )+

    (np 1)

    (N np + 1)N

    p + p

    k=1

    (Ck p)nk +h

    k=p+1

    (k p)

    and assuming the worst case in which nk = 1, for all 1 k p = h q, we get thebound

    Pq+1(h) Pq(h)

    nhq 1

    (N nhq + 1)N1 + Chq

    k=2 k +

    h

    k=hq+1k (h + 1)hq (7)

    For q = 0, 1, 2, . . ., we have limh

    (h + 1)hq = 0, therefore, for each q there is a hq that

    depends on C such that Pq+1(h)Pq(h) is positive for all h hq. (For a more detailed

    analysis consider the expression qC(h) = 1+ Chq

    k=2 k +

    hk=hq+1

    k (h + 1)hq

    and fix = 0.85. Then, in the case that Bq consist only of bidirectional arcs, we have

    C = 1 and q1(h) = 1 +2h+1

    1 (h + 1)hq. This expression is p ositive for q = 0

    and h 4; it is also positive for q 1 and h 5. Likewise, for other values of theproportionality coefficient C (e.g. 13 ,

    23 ,

    12 , etc.) we find for each value of q small values

    of hq from which qC(h) > 0. )

    9.1 Rules for optimising bidirectional trees

    The results of the previous section can be summarised in the following set of rules foroptimising PageRank of the root of a bidirectional tree. Obviously these rules shouldapply insofar as the context allows.4

    4We are aware that some of the rules listed here (and more that could be derived from our results)are, to some extend, already in use by web masters and SEO analysts, but as far as we have seen,without much mathematical justification.

    21

  • 8/3/2019 Ranking Pages and the Topology of the Web

    22/27

    Rule 1 Make all connections bidirectional; or else

    Rule 2 Remove from the bottom level (Nh) and upwards some (or all) the unidirec-tional arcs; or else

    Rule 3 Turn your tree into a proportional bidirectional tree (by converting the ap-

    propriate number of unidirectional arcs into bidirectional, at all levels), and thenremove the last level, and successive levels bottom-up till level h = 4.

    10 More complex topologies

    In this section we carry on our studies to web configurations of higher complexitythan trees. A first step towards a generalisation of our results to other digraphs isto consider a bidirectional tree of height h > 1, on which we close permissible cyclesof any length obtained by joining vertices from level Nj with vertices from level Nk,for 0 j < k h. In this way we can transform bidirectional arcs vuv into cyclesvuvn . . . v1v of longer length, where the arc uvn close the new cycle inserted in the

    rooted tree. Also the arc uv of the bidirectional arc vuv can be substituted by a newarc ut closing a larger path t . . . v u in the tree. In Figure 2 we exhibit some examplesof these transformations.

    t t

    t

    t

    t

    r

    u

    v

    s t

    t t

    t

    t

    t

    t

    t

    r

    u

    v

    v2

    v1

    s t

    =

    t

    ttt

    t

    t

    t

    r

    u

    v

    v1

    v2

    s t

    t t

    t

    t

    t

    r

    u

    v

    s t

    t t

    t

    t

    t

    r

    u

    v

    s t

    Figure 2. Examples of cyclical trees.

    Let us call these classes of digraphs obtained by closing cycles on bidirectional treesas cyclical trees. Formally (and mimicking section 8), we define a digraph C = (V, A)as a cyclical tree with root r, if its set of arcs A can be partitioned in two disjointsets A1 and A2 such that:

    (V, A1) is a partial tree with root r (the underlying tree of C), and

    22

  • 8/3/2019 Ranking Pages and the Topology of the Web

    23/27

    if uv A2 then there is a path v0v1 . . . vmvm+1, beginning at v0 = v, ending atu = vm+1 and with intermediate vertices and arcs vivi+1 in A1, and in this casewe say that v is the origin of the cycle vv1 . . . vmuv.

    Analogously we define T the infinite tree associated to C as the tree rooted at rsuch that for each vertex v = r that is the origin of a cycle in C, substitute the arc uv

    by a countable infinite path rooted at v.Let nk be the number of vertices in the k-th level of C. Let C

    sk be the number of

    origins of cycles of length s between levels Nk+1s and Nk ofC (note that in particular,C2k = bk, for k > 0). Finally, let mk be the number of vertices in the k-th level of T,the infinite tree associated to C. Then, using (2) we have a result similar to Theorem8.1 for the case of cyclical trees.

    Theorem 10.1 IfC is a cyclical tree rooted atr withN vertices then, with the notationand provisos stated above, the PageRank of r is given by

    P(r) =1

    N

    h

    k=0

    nkk +

    N

    h

    k=1

    Ckk (8)

    where C1 = C21 , and for k > 1, Ck =

    k+1s=2 C

    sk.

    The proof follows the same line of argument as the proof of Theorem 8.1, where from

    the power series P(r) =1

    N

    k=0

    mkk, which applies to the root r of the cyclical

    tree C, we get equation (8).

    Remark 10.2 Note that from equation (8) we can recover the vector formulation forPageRank of the root of cyclical tree (cf. equation (5)), as follows: set the vector = (C0 = 0, C1, C2, . . . , C h), and as in section 8, let = (n0, n1, . . . , nh) and =(0, 1, . . . , h). Then

    P(r) =1

    N((1 ) + ) (9)

    Remark 10.3 It also follows from equation (8) that the PageRank of the root of acyclical tree only depends on the values of nk and C

    sk, for 2 s k + 1, and not on

    the distribution of links among vertices at level Nk to vertices at higher levels Nk+1s.

    From Theorem 10.1 (and equation (8)), we can derive further rules for improvingPageRank on bidirectional trees that are turned into cyclical trees.

    Rule 1 If in a bidirectional tree we build a cycle by closing the path vv1 . . . vs2uof length s 1, running from the level Nk to Nk+1s, with an arc uv whichconnects the level Nk+1s with level Nk, then the PageRank improves. Thisis due to the fact that in the PageRank formula of the new cyclical tree thatis been produced it appears the extra term 1

    N

    hj=k+1

    j = k+1h+1

    N, which

    is p ositive. Furthermore, observe that for 0 < < 1, the previous quantitydecreases as k increases; hence the PageRank gets better as shorter there are thecycles closing at level Nk+1s, been the bidirectional arcs (2cycles) the optimal.

    23

  • 8/3/2019 Ranking Pages and the Topology of the Web

    24/27

    Rule 2 The PageRank of the root of a bidirectional tree improves if each vertex uis linked with a bidirectional arc to each one of the vertices on the subtree withroot u (hence obtaining a cyclical tree). This is due to the fact that each newlink adds 1 to each term of the expression

    N(C1 + C2

    2 + . . . + Chh). One

    may interpret this action as linking an individual with all its subordinates in a

    hierarchical organisation.

    10.1 Further extensions

    The next natural step is to upgrade the preceding results on cyclical trees to finite cyclicstructures which can be modelled by our infinite trees. In order to achieve this furtherextensions we should then visualise an arbitrary digraph through its condensationdigraph as the acyclic digraph consisting of its strongly connected components.

    A digraph D = (V, A) is strongly connected if for each pair u and v of distinctvertices, there is a path joining u with v and a path joining v with u. Define u v provided there are paths joining u with v and v with u. This is an equivalencerelation and, in consequence, V is partitioned into equivalence classes V1, . . . , V p. The

    p subdigraphs Di = (Vi,A/Vi) induced on the sets Vi, i = 1, . . . , p, are the strongconnected components of D. The digraph D is strongly connected if and only if ithas exactly one strong component. The condensation digraph of the digraph D isthe acyclic digraph whose vertices are the strong connected components, or SCC, of D,and there is an arc from one SCC Di to another Dj, i = j, if a vertex of Vi is adjacentin D to a vertex of Vj .

    Now, the extension of our techniques and procedures to a more general digraphrequires that its corresponding condensation digraph be a rooted tree whose SCCbehave like the cyclical rooted trees. Not all SCC may have the required behaviour,but many do so, and the key is that each SCC should have a root through which itconnects to the rest of the tree (and by no other vertex), and two such SCC which are

    adjacent in the condensation digraph are linked by just one arc in the original digraph.The SCC which have these properties we shall called PRdigraph.Formally, a PRdigraph with root r is a strongly connected digraph with at

    least one vertex r (the root of the PRdigraph) such that for all vertex v (v = r),there is a unique path joining v with r. In this structure we would say that a vertexv is at level Nk if the path that connects v with r is of length k. Now, in essence, aPRdigraph is much like a cyclical tree in as much as it can be seen as a tree withcycles formed by adding arcs from one level up to another level down. The key is thata PRdigraph admits a corresponding infinite tree due to the fact that each vertex hasbeen assigned a unique level of the graph, or in other words, a unique path to r. Notethat the root as the rest of the vertices may have outdegree greater or equal to 1. Asan illustration of structures that can be PRdigraphs or not see Figure 3. There, inthe strongly connected digraph D the vertex 1 can not be the root of a PRdigraphsince vertex 2 has two paths towards 1. On the contrary, the vertices 2 and 3 can beroots of a PRdigraph D.

    24

  • 8/3/2019 Ranking Pages and the Topology of the Web

    25/27

    t t

    t

    1 2

    3D

    t

    t

    t

    2

    1

    3PR-digraphs of D

    t

    t

    t

    3

    2

    1

    Figure 3. Digraph D and the PR-digraphs that can be derived from it.

    Also the complete digraph {(1, 2), (2, 1), (1, 3), (3, 1), (2, 3), (3, 2)} can not be a PRdigraph for any of its vertices. But, the strongly connected digraph {(1, 2), (2, 1), (2, 3),(3, 2)} is a PRdigraph rooted at any of its three vertices.

    The condition characterising a PRdigraph must also apply to the connectionsamong SCC which are PRdigraphs. It can not b e the case that in a tree of SCCs,which are PRdigraphs, one such SCC connects to another SCC in the tree throughtwo arcs or more; that is, in the original digraph there must be a root (which itselfcould be the root of a PRdigraph) and it must be the case that each vertex v connectsto the root by a unique path. On the other hand, we must admit the possibility ofproducing cycles of length s > 1 in this structure by connecting a vertex at the levelNk+1s with a vertex at level Nk.

    We shall then define a PRdigraph tree with root r, as a digraph D = (V, A)whose set of arcs A can be partitioned in two disjoint sets A1 and A2 such that:

    (V, A1) is a partial digraph whose condensation digraph is a tree of SCCs whichare PRdigraphs, each pair of adjacent PRdigraphs are linked by a unique arcand the maximal PRdigraph contains the root r (the underlying digraph ofD);and

    if uv A2 then there is a path v0v1 . . . vmvm+1, beginning at v = v0, ending at

    u = vm+1 and with intermediate vertices and arcs vivi+1 in A1, and in this casewe say that v is the origin of the cycle vv1 . . . vmuv.

    Analogously we define T the infinite tree associated to D as the tree rooted at r,such that for each vertex v = r that is the origin of a cycle v . . . u v in D, substitutethe arc uv by a countable infinite path rooted at v. Note that this time the arc uv, aswell as the cycle v . . . u v, could be in the partial digraph (V, A1). We show in Figure4 a digraph D that can not b e a PRdigraph tree. This is due to the fact that theleftside SCC of D is not compatible with a PRdigraph tree, since the vertex u hasout degree 2 towards the root r. Also in the SCC on the right branch of D either oneof the arcs xy or zw represents a surplus that forbids D from being a PRdigraph tree.

    t

    t

    t

    t

    t

    t

    t

    t

    t

    t

    r

    u y w

    x z

    Figure 4.

    25

  • 8/3/2019 Ranking Pages and the Topology of the Web

    26/27

    Theorem 10.4 A PRdigraph tree with root r is a cyclical tree with root r.

    Proof: Let D = (V, A) be a PRdigraph tree with root r. It is sufficient to prove thatthe underlying digraph, C = (V, A1), ofD is a cyclical tree with root r. The set of arcsA1 can be partitioned in two disjoint sets: A11 = {vu : there is a path v u . . . r in C}and A12 = A1\A11. The digraph (V, A11) is a tree rooted at r because, by the definitionof the underlying digraph C, each vertex of V is joined with the root r by a uniquepath. Then (V, A11) is the underlying directed tree of C and, moreover, if uv belongsto A12 then uv is in a SCC that is a PRdigraph with some root r

    . By the strongconnection, there is a path joining v to u, and thus C is a cyclical tree.

    Therefore, with the notations as in the preceding section, relative to cyclical treestranslated to PRdigraph trees, we have

    Corollary 10.5 The PageRank of the root r of a PRdigraph tree can be computed bythe vector formula (9)

    P(r) =1

    N((1 ) + )

    where = (C0 = 0, C1, . . . , C h), = (n0, n1, . . . , nh) and = (0, 1, . . . , h).

    In Figure 5 we exhibit a PRdigraph tree D and its representation as a cyclical tree.

    t

    t

    t

    t

    t

    r

    w

    v u

    t

    t

    t

    t

    t

    t

    r

    w

    v

    u

    tFigure 5.

    The PRdigraph trees are the most general cyclical structures admitting an asso-ciated infinite tree, and on which we can apply the optimisation techniques displayedin this article by treating each SCC as one unit. This could also revert on a speed upon the PageRank calculation.

    More explicitly, the last point we want to call attention to is the following. Thereare several approaches in the literature to the task of speeding up the calculation ofPageRank, based upon the following general scheme (see, for example, [ 11, 2, 6]):

    Partition the web into local subwebs; then compute some independent rank-ing for each local subweb, which will apply to the whole subweb treated asa unit; and then compute the ranking of the graph of subwebs.

    In [2] and [6] the local splitting of the web is done in strongly connected compo-nents, and further in [6, Thm 2.1], it is shown that the PageRank can be calculatedindependently on each SCC, provided we know the PageRank of all vertices outsidethe SCC, but directly linking to vertices in the SCC. Our PRdigraph tree is the most

    26

  • 8/3/2019 Ranking Pages and the Topology of the Web

    27/27

    simple splitting of the web in the way of [2] and [6], namely as SCC, with the addi-tional strongest condition of having a single link between components, which by thepreviously mentioned result of[6] can have PageRank computed independently on eachSCC, and on a very simple way, provided we know the PageRank of their descendantsin the topological structure of the tree. This suggests computing PageRank in parallel

    and through layers, as it is proposed in [6, 3], following an iterated process on the treefrom a top level Nh down to the root at N0. The PRdigraph is a suitable structurefor the application of this process.

    References

    [1] R. Albert, H. Jeong and A.-L. Barabasi. Diameter of the World Wide Web, Nature401:130-131, Sep 1999.

    [2] A. Arasu, J. Novak, A. Tomkins and J. Tomlin, PageRank computation and thestructure of the Web: Experiments and algorithms. The Eleventh InternationalWorld Wide Web Conference, Posters. 2002.

    [3] M. Bianchini, M. Gori and F. Scarselli. Inside PageRank, ACM Transactions onInternet Technologies 4 (4), 2005.

    [4] S. Brin and L. Page, The anatomy of a large scale hypertextual web search engine.Computer Networks and ISDN Systems, 33 (1998), 107-117.

    [5] S. Brin, R. Motwami, L. Page and Terry Winograd, The PageRank citation rank-ing: Bringing order to the web. Technical Report, Comp. Sci. Dept., StanfordUniversity, 1998.

    [6] M. Brinkmeier. Distributed calculation of PageRank using strongly connectedcomponents, Proceedings of the I2CS05 in Paris (LNCS), 2005.

    [7] M. Brinkmeier. PageRank revisited, ACM Transactions on Internet Technologies,6 (3), 2006.

    [8] T. Cormen, C. Leiserson, R. Rivest and C. Stein. Introduction to Algorithms, TheMIT Press, 2nd. Edition, 2001.

    [9] P. Craven, Googles Florida Update. In: WebWorks(http://www.webworkshop.net/florida-update.html), Dec. 3, 2003.

    [10] Google Acquires Kaltix Corp. In Google Press(http://www.google.com/intl/ro/press/pressrel/kaltix.html), Sept. 30, 2003.

    [11] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploitingthe block structure of the Web for computing PageRank. Stanford UniversityTechnical Report, 2003.

    [12] A. N. Langville and C. D. Meyer, Deeper Inside PageRank, Internet Mathematics1 (3): 335380, 2005.

    [13] S. Olsen, Searching for the personal touch. In: CNET News.com(http://news.com.com/2100-1024-5061873.html), August 11, 2003.

    http://www.webworkshop.net/florida-update.htmlhttp://www.google.com/intl/ro/press/pressrel/kaltix.htmlhttp://news.com.com/2100-1024-5061873.htmlhttp://news.com.com/2100-1024-5061873.htmlhttp://www.google.com/intl/ro/press/pressrel/kaltix.htmlhttp://www.webworkshop.net/florida-update.html