network positioning for online nearest neighbors search xin qi
Post on 03-Feb-2022
2 Views
Preview:
TRANSCRIPT
BOSTON UNIVERSITY
GRADUATE SCHOOL OF ARTS AND SCIENCES
Thesis
NETWORK POSITIONING FOR ONLINE NEAREST
NEIGHBORS SEARCH
by
XIN QI
B.E., Beijing University of Posts and Telecommunications, 1997M.E., Beijing University of Posts and Telecommunications, 2000
Submitted in partial fulfillment of the
requirements for the degree of
Master of Arts
2005
Approved by
First Reader
Richard West, Ph.D.
Assistant Professor of Computer Science.
Second Reader
Shang-Hua Teng, Ph.D.
Professor of Computer Science.
Third Reader
George Kollios, Ph.D.
Assistant Professor of Computer Science.
Acknowledgments
Most special thanks to my advisor Richard West for all the inspiration and freedom
during the past three years of research work. These three years of working together
make me more appreciative about the beauty of systems research. The culture he
developed in the BOSS group has brought all of my good colleagues together. Gary
Wong and Gabrial Parmer taught me about the happy part of the work.
I would also like to thank Dihan, who initiated the idea of this thesis, and Professor
Shang-Hua Teng, who gave me many insightful comments on the work.
I could never get this if it were not under the encouragement of my wife. Her
support gave me the strength during all the hard times.
iii
NETWORK POSITIONING FOR ONLINE NEAREST
NEIGHBORS SEARCH
XIN QI
ABSTRACT
Current networking embedding schemes have focused on preserving a stable quantity,
thought of as “network distance”, in a corresponding embedding space. This quantity
is usually taken to be the minimum round trip time (minRTT). However, for appli-
cations such as peer-to-peer systems and content-delivery networks, the performance
generally depends on instantaneous round trip time as influenced by current network
conditions (e.g., congestion in routers). What’s more, rather than determining actual
distances between all hosts, it is usually only necessary to determine the closest hosts
among an instantaneous sub-group of the hosts sending requests.
In this thesis, we propose a network embedding scheme using Lipschitz embedding
into the L∞ norm by choosing landmark nodes from outside those in a specific group of
hosts. This scheme makes possible an on-line refinement algorithm to guide network
measurement so as to quickly find the closest host with the smallest instantaneous
RTT, while radically decreasing the amount of measurements necessary. Compared
with other embedding schemes, our scheme also achieves better accuracy for short
distances, thereby improving the likelihood of finding the true nearest neighbors.
Complementary to this scheme is an approach to compensate for violations of the
triangle inequality, which is the foundation of the metric embedding. We show that
with the aid of an off-line network embedding mechanism based on the L∞ norm,
using an on-line refinement algorithm to adjust for the instantaneous networking
fluctuation, a host in the network can effectively find its nearest neighbor, or neighbor
set, among a large number of candidates.
iv
Contents
1 Introduction 1
2 Network Embedding 5
3 Network Positioning for Nearest Neighbors 9
3.1 Problem Definition and Approach . . . . . . . . . . . . . . . . . . . . 9
3.2 Geometric Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Lipschitz Embedding . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Contractive Embedding . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 An On-line Refinement Algorithm . . . . . . . . . . . . . . . . 12
3.3 Landmark Selection in L∞ . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 The analysis for 2-d Euclidean space . . . . . . . . . . . . . . 16
3.3.2 The analysis for 3-d Euclidean space . . . . . . . . . . . . . . 23
3.4 Outside-max-distance algorithm . . . . . . . . . . . . . . . . . . . . . 25
4 Experimental Evaluation 27
4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Compensation for the Violation of Triangle Inequality . . . . . . . . . 28
4.3 Contractive Embedding, Tradeoff between Structure Error and Em-
bedding Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
v
4.4 Comparison with GNP . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Pairwise Distance Predicting . . . . . . . . . . . . . . . . . . . 34
4.4.2 Nearest Neighbor Searching . . . . . . . . . . . . . . . . . . . 35
4.5 Comparison Using INET Data . . . . . . . . . . . . . . . . . . . . . . 36
4.5.1 Pairwise Distance Predicting . . . . . . . . . . . . . . . . . . . 37
4.5.2 Nearest Neighbor Searching . . . . . . . . . . . . . . . . . . . 38
5 Related Work 42
6 Conclusions and Future work 44
vi
1
1 Introduction
The growth of the Internet has stimulated the development of applications that are
largely data- rather than CPU-intensive. Examples include streaming media delivery,
interactive distance learning, web-casting, group simulations, live auctions, and stock
brokerage applications. Additionally, peer-to-peer (P2P) systems such as Gnutella [7],
Freenet [5], Kazaa [11] and more recent variations (e.g., Chord [26], CAN [23], Pas-
try [24] and Tapestry [30]) have become popular due to their ability to efficiently
locate and retrieve data of particular interest.
While the Internet continues to grow, there is an enormous untapped potential
to use distributed computing resources for large scale applications. Some projects
such as SETI@home [17] have already realized this potential, by utilizing spare com-
pute cycles on off-the-shelf computers to process application-specific data. Similarly,
grid computing technologies (e.g., Globus [3]) are attempting to coordinate and share
computing resources on the scale of the Internet, to support emerging applications in
areas such as scientific computing. However, there has been only limited research on
the construction of scalable distributed systems to support the delivery and process-
ing of high bandwidth, and potentially real-time, data streams transported over the
Internet [10]. An Internet-wide distributed system for data stream processing would
be desirable for applications such as interactive distance learning, tele-medicine, and
live video broadcasts, bringing together potentially many thousands of people located
in different regions of the world. Further, such a system would pave the way for new
applications, such as a distributed and adaptive traffic management system, and,
generally, any application that requires the dissemination of sensor data.
Essentially, applications are emerging that have requirements that extend beyond
2
what a classical client/server paradigm can provide. In this model, a client issues
a remote procedure call (RPC) to a server, sending the request and receiving the
reply via IP. Some applications exist that would be better served by a pipelined,
publisher/subscriber model. The delivery of QoS constrained media streams to an
extremely large population of interested end-hosts is a task ideally suited for the
publisher/subscriber model.
One of the projects the BOSS (Boston University Operating System and Services)
group is focusing on [20, 6, 22] is to present such a system based on pipelined pro-
cessing of data streams. Generally, this system can be utilized by any application
domain that requires the scalable, QoS and resource aware delivery and processing of
data-streams.
Essentially, data streams can be transferred over an overlay topology of end-hosts
on the Internet and at each hop processing can be performed on the stream by Stream
Processing Agents (SPAs). These SPAs can take the form of a QoS and resource aware
router, a filter to extract from a stream relevant information, an entity to perform
transformations on the data, an agent to build multi-cast trees, a splitting agent that
could separate the stream over two links to distribute bandwidth usage, or an agent
to merge these streams.
For example, a number of sensors, perhaps cameras, could publish the raw video
they capture onto the distributed system. It would be routed through the overlay via a
sequence of end-host intermediary nodes. At each of these intermediaries, a number of
SPAs can be applied to the data-stream to, for instance, compress and filter the data
so that a mobile device can easily receive and display the sensor output. A certain
population will subscribe to these data-streams. These subscribers are the destination
3
end-hosts and will display the video. They will also act as intermediaries, possibly
further routing streams to which they are currently subscribed to other interested
end-hosts, or perhaps applying SPAs to other data-streams. The distributed system
is responsible for providing certain QoS levels to each of the subscribers and each
stream will be processed and routed according to these constraints.
One of the fundamental aspects of this work is to leverage off-the-shelf computers
on the Internet to build a scalable distributed system for the routing and processing of
data streams. Of particular interest is the use of overlay networks to deliver real-time
media streams from one or more publishing nodes to potentially many thousands of
subscribers, each with their own quality-of-service (QoS) requirements. While many
existing P2P systems form overlay networks on top of the underlying physical network,
they do not yet support the timely delivery of data streams between publishers and
subscribers. For example, Pastry, Chord, and CAN all build logical topologies for the
purpose of forwarding messages to locate an object in a decentralized system, rather
than transporting data streams. Moreover, these systems make no attempt to route
data in accordance with latency and bandwidth requirements.
Most current overlay technologies depend on the random assignment of logical ids
to physical hosts, to achieve a logarithmic bound on the number of hops to locate
information of interest. One important concern is how to correlate the logical position
of a host in the overlay with its physical or geographical position. To address this
concern, one may ask,“could we provide a GPS-like system, assigning a coordinate
to each node in the Internet so that when a node wants to join an overlay, it could
make use of it’s geographical information for further optimization?” The answer to
this question relates to the network positioning or network embedding problem. For
4
this reason, the rest of this thesis focuses on the network positioning problem and
embedding schemes that capture geographic information. Specifically, we present a
Lipschitz embedding scheme in the L∞ norm to capture physical distances between
hosts in an overlay. Our approach achieves better accuracy for determining short
distances, thereby improving the likelihood of finding the true nearest neighbors,
amongst a set of hosts that dynamically join and depart the overlay system
The rest of the thesis is organized as follows. Section 2 describes further mo-
tivation for network embedding. Our network positioning scheme in then analyzed
in Section 3, which also includes a more rigorous definition of the problem being ad-
dressed. The experimental evaluation of our approach is outlined in Section 4. This is
followed by a discussion of related work in Section 5. Finally, conclusions and future
work are mentioned in Section 6.
5
2 Network Embedding
The increasing growth of the Internet, coupled with the decreasing price-performance
ratio of off-the-shelf systems, has paved the way for applications that utilize end-
system multicast [10], content-delivery networks [6] and peer-to-peer (P2P) systems [24,
26, 23]. The scale of these applications has opened new areas of research concerning
the structure of overlay networks [15, 6], as well as efficient means of trading state
versus messages to locate and retrieve information. Consequently, the problem of
identifying “nearby” hosts via which data can be retrieved or propagated is central to
many emerging Internet-based applications. For systems encompassing thousands or
even millions of hosts on the scale of the Internet, it would be impractical to directly
measure (e.g., using ping-style ICMP 1 messages) the “distances” between all pairs
of hosts. This has led to research work that focuses on the estimation of distances
between hosts, without requiring probing messages to be sent to every other host in
the entire system.
Many emerging approaches to derive distances between hosts rely on geometric
embedding techniques, that map hosts to specific coordinates in a logical vector space.
Specifically, these “network positioning” schemes assign coordinates to nodes (e.g.,
hosts on the Internet) based on measurements to a fixed set of designated nodes, called
landmarks. Using such coordinates, the network distance (e.g., in terms of round-trip
propagation and/or transmission delay) between pairs of nodes is predicted without
explicit measurement.
Various embedding techniques, to reduce the error between real and derived dis-
tances of all nodes in a data set, include: (1) triangle-based heuristic solutions [4, 9]
1Internet Control Message Protocol
6
, (2) nonlinear optimization methods [18, 25], and (3) approaches using a combina-
tion of Lipschitz embedding and dimensionality reduction [27][13]. While all these
techniques attempt to derive actual distances between nodes, many applications only
require knowledge of their nearest neighbors, or nearest neighbor sets. For example,
in unstructured P2P networks such as Gnutella or Napster, it is desirable for a client
to download files from the closest peer or, at least, one that is not far away. Simi-
larly, in a content distribution network, the aim is to minimize both the hop count
and/or latency along the path over which data is exchanged. This has impact on the
design of overlay networks, that implement logical routing topologies such as k-ary
n-cubes [6] and deBruijn graphs [15]) over the underlying physical network. For such
overlay networks, especially those that attempt to route QoS-constrained data such
as voice and video to many thousands of destinations, it is important for data to be
propagated through intermediaries that are geographically close.
The contributions of this thesis are therefore concerned with the design of a net-
work positioning scheme, that accurately determines the nearest neighbor (or set of
neighbors) for a given host, amongst a dynamic subset of hosts taken from a global
set on the scale of the Internet. As with other work on network positioning, we lever-
age embedding techniques that map host positions into coordinates within a normed
vector space. The challenges faced by such a scheme include: (1) determining the
most appropriate normed vector space for deriving host coordinates, (2) deciding on
the number and location of landmarks used for coordinate derivation, and (3) dealing
with underlying network conditions that violate geometric requirements of the chosen
embedding method.
Considering all of the above problems, we propose a contractive embedding scheme,
7
in which embedded distances are always less than the corresponding real distances.
As will be shown, contractive embedding enables us to use an on-line refinement al-
gorithm, to compensate for “embedding errors” and thereby guarantee we find the
real nearest neighbor. In practice, this on-line refinement method can also be used
to send probing messages with other QoS related parameters such as bandwidth and
CPU utilization for further QoS-related optimization.
With respective to the above on-line refine algorithm, we find that L∞ is the best
normed vector space for our contractive embedding. In L∞ space, we provide theo-
retical analysis regarding the number and positions of the landmarks for an effective
embedding. We observe that with L∞ normed vector space, the prediction of the
nearest neighbor depends only on a single landmark that yields the smallest error.
This makes our approach more robust against failures and dynamic changes in land-
mark sets. Both of our analysis and experimental results also show that L∞ provides
smaller prediction errors for shorter real distances. Moreover, even though for pre-
serving the overall pairwise distance, our scheme is worse than with other methods
(e.g., non-linear optimization algorithms such as GNP [18]), the prediction of nearest
neighbors is as good as others, but more typically better when used with our on-line
refinement algorithm.
The final contribution of this work deals with underlying network conditions that
violate geometric requirements of embedding methods. Specifically, the “triangle
inequality” is a fundamental requirement of our embedding approach, and the na-
ture of physical networks such as the Internet occasionally lead to violations of this
constraint. We, therefore, propose a method to compensate for the violation of the
triangle inequality by adding a fixed offset to all of the edges of a corresponding met-
8
ric. In practice, this offset could be a global “remedy value”, which is added to the
probed “delay” between an end host and each landmark. However, if this “remedy
value” is too large, it will adversely affect the embedding errors. We therefore inves-
tigate the size of the remedy value to ensure the triangle inequality, while preventing
a further increase in embedding errors.
9
3 Network Positioning for Nearest Neighbors
3.1 Problem Definition and Approach
Before describing our network positioning scheme in detail, we begin by defining the
problem explicitly addressed in this thesis, as follows:
Given a set of hosts, S, in which each member, h∈S, has information about its
distance to all hosts in a subset, L | L⊂S, find the nearest neighbor, hj, to a specific
host hi such that hi, hj∈S.
One efficient practical method to solve this nearest neighbor query problem is
to use a coordinate-based network positioning (or embedding) scheme [18]. How-
ever, unlike prior approaches that preserve as much information as possible about
actual distances between all pairs of hosts in a given set, S, we only need to preserve
just enough information to determine nearest neighbors. Specifically, our approach
involves both an off-line landmark selection scheme, coupled with an on-line refine-
ment method to successively eliminate candidate hosts until the nearest neighbor is
found.
Having selected a subset of hosts, L, as landmarks, each host, h, in set S is given
a vector-based coordinate that represents the distance of h to each member of L. In
other words, each host, h, is given a coordinate based on the Lipschitz embedding
scheme. In the second stage, the on-line refinement algorithm first ranks each host
in S in increasing embedded distance from hi and then probes the actual distance to
successive hosts, starting with the host predicted to be nearest. Using a contractive
embedding scheme, the refinement algorithm is able to eliminate members of S, to
10
quickly converge the host, hj, nearest to hi.
Having briefly outlined the problem and our approach, we now describe the actual
details of our embedding scheme, along with the landmark selection method and
refinement algorithm.
3.2 Geometric Embedding
In this thesis, we follow the terminology used in [8]. A tuple (S, d) is said to be
a finite metric space if S is a finite set of cardinality N and d : S × S → R+ is
a distance metric. A great deal of work has been done on embedding finite metric
spaces into low-dimensional real-norm spaces which serves as the basis of a distance
metric. Usually, the norm is one of the Lp norms, ||x||p = (∑ |xi|p)1/p. Distance
metrics based on such a norm are often termed Minkowski metrics when p ≥ 1. The
most common Minkowski metrics are the Euclidean distance metric (L2), the City
Block distance metric (L1), and the Chessboard distance metric (L∞).
Formally, we call a finite metric space (S, d) an lp-metric if there is a mapping
F : S → Rk such that ||F (x) − F (y)||p = d(x, y); we will often denote this by d ∈ lp.
To denote the fact that the lp space has k dimensions, we will call the space lkp as
well.
Before going on, we need clarify several terms. The word “embedding” is a generic
term for any form of mapping from one space into another, it can be a mapping
from a distance metric into a norm space, or from a high dimensional norm space
to a low dimensional norm space. A distance-preserving embedding will be called
isometric. However, it is very rare to find cases where isometric embedding can be
11
found between two spaces of interest, and hence we often have to allow the mappings
to alter distances, thereby leading to some degree of distortion (or embedding error).
3.2.1 Lipschitz Embedding
Lipschitz embedding is a kind of geometric embedding, which is defined in terms of
a set R of subsets of S, R = {A1, A2, ..., Ak}. The subsets Ai are termed as reference
sets. Let d(o, A) be an extension of the distance function d to a subset A ∈ S such
that d(o, A) = minx∈A{d(o, x)}. An embedding with respect to R is defined as a
mapping F such that F (o) = (d(o, A1), d(o, A2), ..., d(o, Ak)). In other words, what
we are doing is defining a coordinate space where each axis corresponds to a subset
Ai ∈ S of the objects and the coordinate values of the object o are the distances from
o to the closest element in each of Ai. The intuition behind the Lipschitz embedding
is that, if x is an arbitrary object in the data set S, some information about the
distance between two arbitrary objects o1 and o2 is obtained with the aid of d(o1, x)
and d(o2, x), i.e., the value |d(o1, x) − d(o2, x)|. In particular, due to the triangle
inequality, we have |d(o1, x) − d(o2, x)| ≤ d(o1, o2).
In this thesis, as with other work in network embedding [27], we only consider
Lipschitz embeddings in which each reference set Ai is a singleton, which is defined
as the set of landmarks. Hence, the coordinates of a node in our scheme are the
network distance between that node and the landmarks.
3.2.2 Contractive Embedding
An embedding induced by a mapping F is said to be contractive with respect to S
if δ(F (o1), F (o2)) ≤ d(o1, o2) for all o1, o2 ∈ S.
12
L∞ is, by its nature, a contractive embedding because of the triangle inequality.
For other norms, p, p ∈ [1,∞), one candidate contractive embedding exists. It should
be noted that theoretically, you can make a contractive embedding for any distance
definitions as long as you know the distortion of the embedding. However the following
contractive embedding represents the general case [14], without a-priori knowledge of
embedding distortion:
δ(Fk(o1), Fk(o2)) =(∑
i |d(o1, Ai) − d(o2, Ai)|p)1/p
(k)1/p
Ai in the above definition is the reference set defined in Lipschitz embedding. The
proof of the contractive property depends on triangle inequality. For each Ai ∈ R,
we have |d(o1, Ai) − d(o2, Ai)| ≤ d(o1, o2), then when δ is an arbitrary Lp distance
metric,
δ(Fk(o1), Fk(o2)) =(∑
i |d(o1, Ai) − d(o2, Ai)|p)1/p
(k)1/p
≤ (k · d(o1, o2)p
k)1/p = d(o1, o2)
Moreover, the δ function described above strictly increases in terms of p. Thus con-
tractively embedding a data set into L∞ space should cause the least distortion. For
the nearest neighbor problem, given a fixed reference set, Ai, L∞ space should there-
fore lead to the best possible method for eliminating neighbors, that are clearly not
the nearest, as discussed next.
3.2.3 An On-line Refinement Algorithm
For a contractive embedding scheme, we can use the following on-line refinement pro-
cedure to adjust embeding errors and thus converge upon the “real” nearest neighbor.
This algorithm assumes the following prerequisites: (1) after the completion of some
13
off-line process, every end-host has a coordinate determined by its distance to each
of a known set of landmarks, and (2) during the on-line refinement stage, all hosts
in a set C piggyback their coordinates together with their requests for some content
(depending on the application), and now some server h (who is currently serving the
content) wants to find the “real” nearest neighbor among a candidate list to serve.
1. h sorts the end-hosts in increasing order of their distances from itself in the
embedding space. Suppose that point F (a) corresponding to host a is the
closest point to F (h) at the distance of δ(F (a), F (h)).
2. h physically sends a probing message to host a to get the real distance d(a, h)
between a and h. At this point, we know that any host x with δ(F (x), F (h)) >
d(a, h) cannot be the nearest neighbor of h since the contractive property then
guarantees that d(x, h) > d(a, h). Therefore, d(a, h) now serves as an upper
bound on the nearest-neighbor search in the embedding space, and we just
remove all such x from the candidate list.
3. h then finds the next closest point F (b) corresponding to node b, and physically
sends a probing packet to host b and gets d(b, h). Subject to the distance
constraint d(a, h). If d(b, h) < d(a, h), then b and d(b, h) replace object a and
d(a, h) as the current closest object and upper bound distance, respectively;
otherwise, a and d(a, h) are retained.
4. The previous step is repeated until we either converge on the nearest neighbor,
or we could simply stop after N probe messages have been sent (and assume that
we have the nearest neighbor at that point), thereby avoiding a large number
of probe messages being sent to neighboring hosts.
14
It is the nature of contractive embedding that enables us to quickly eliminate
neighbors from the search space of hosts that are potentially nearest. This enables
on-line probing to avoid unnecessary message transmissions to hosts that are relatively
far away.
3.3 Landmark Selection in L∞
`3
h1 h2c
a
`
b
θ`1
`2
Figure 1: Different Remote Landmarks
In this section, we’ll tackle the second challenge raised in section 1, deciding on
the number and location of landmarks used for coordinate derivation. [16] shows that
the intrinsic dimension of a real network data set is normally less than 7. So we study
the original data set intrinsically in high dimensional Euclidean space but embedded
into a Chessboard space(l∞), since with respect to our on-line refinement algorithm
for nearest neighbor searching, we find that contractively embedding the data set into
a Chessboard space(l∞) is the most efficient. Due to space limitation, we’ll only give
a detailed analysis for 2-d Euclidean space, and concisely extend to 3-d Euclidean
space, for higher dimension, we leave it as future work.
One nice property of L∞ is, by the definition of L∞, the accuracy of embedding
15
is determined by only one landmark which gives the least triangle predicting error.
As illustrated in fig. 1, (h1 h2) is the data set to be embedded, `1, `2, `3 are possible
landmarks (reference set). If triangle inequality holds, `1 and `3 gives us the isometric
embedding, while `2 gives us the largest distortion embedding. So, intuitively, we
should choose landmarks like `, such that the intersecting acute angle θ of `h1 and
h1h2 is small, and what’s more, the landmarks are better to be far away from all other
hosts.
Before our main result, we need the following corollary.
Corollary 1. As illustrated in figure 1, suppose h1h2 is embedded into l1∞ with ref-
erence set `. Then the distortion ε = |δ(d(`,h1),d(`,h2))−d(h1,h2)|d(h1,h2)
can be no worse than
1 − cos θ.
Proof. Denote the line segments Lh1, Lh2, h1h2 with a, b, and c respectively. so
δ(d(`, h1), d(`, h2)) = |a − b|, then,
ε =|δ(d(`, h1), d(`, h2)) − d(h1, h2)|
d(h1, h2)
= 1 − δ(d(`, h1), d(`, h2))
d(h1, h2)= 1 − |a − b|
|c|
From
|b|2 = |a|2 + |c|2 − 2|a||c| cos (180 − θ)
= |a|2 + |c|2 + 2|a||c| cos θ
Then
ε = 1 − |a −√|a|2 + |c|2 + 2|a||c| cos θ|
c
= 1 − |ac−
√1 + (
a
c)2 + 2(
a
c) cos θ|
16
Let n = ac, then,
ε = 1 − |n −√
1 + n2 + 2n cos θ|
ε(n) is a strictly increasing function. and limn→0 ε(n) = 0
limn→∞
ε(n) = 1 − limn→∞
|n −√
1 + n2 + 2n cos θ|
= 1 − limn→∞
| −1 − 2n cos θ
n +√
1 + n2 + 2n cos θ|
= 1 − cos θ
Note that the above corollary is applicable in all dimensions since three nodes
define a plane.
3.3.1 The analysis for 2-d Euclidean space
Before considering high dimension, we consider the 2-d plane. This contains many of
the ideas inherent in the more general case.
We draw a smallest circle of radius r containing all the hosts, and assume that
there are ”many” hosts scattered on or close to the circle. We use the standard notion
C(p, r) to denote a circle centered at p of radius r.
Definition 2. The arc length of line segment with angle θ in 2-d Euclidean
space is g1 + g2 illustrated in figure 2. The arc length of point with angle θ in
2-d Euclidean space is defined similarly.
Now comes our main theorem in 2-d Euclidean space.
17
θ
αβ
`
g1 g2
γ
h p
R
θ θθ
a b
Figure 2: Arch Length
(a) General Case
Theorem 3. Given C(p, r), randomly put two points a, b in C. To achieve 90% pre-
cision of ab, the expected number of landmarks is 7 if landmarks are evenly distributed
on or close to C or 43 if landmarks are randomly distributed on or close to C.
To prove our theorem 3, we need corollary 4 and theorem 5.
Corollary 4. Let θ denoted in figure 1 be equal to 25◦, the distortion ε = |δ(d(`,h1),d(`,h2))−d(h1,h2)|d(h1,h2)
can be no worse than 10%
Proof. Simply follow the proof for corollary 1.
Theorem 5. Given C(p, r), if randomly put two points a, b in C, the expectation of
the arc length of ab with angle θ = 25◦ is 0.92r.
Proof. : As illustrated in figure 2, the arc length of the line segment ab with angle θ
will change with l,h and p. Let g(l, h, p) denotes the expectation of the arc length of
any line segment with angle θ, the meaning of variables l,h, p are clearly illustrated in
18
figure 2. It’s obvious that L ∼ unif(0, r). If L is fixed, H is also uniformly distributed
from 0 to 2√
r2 − l2), so H ∼ unif(0, 2√
r2 − l2). In the same way, P is uniformly
distributed from 0 to 2√
r2 − l2 − h, that is, P ∼ unif(0, 2√
r2 − l2 − h). So,
E[g(H,L, P )]
=∫ r
0
∫ 2√
r2−l2
0
∫ 2√
r2−l2−h
0g(h, l, p)f(h, l, p) dp dh dl
+∫ r
0
∫ 2√
r2−l2
0
∫ 2√
r2−l2−h
0
g(h, l, 2√
r2 − l2 − h − p)f(h, l, p)dp dh dl (1)
where f(h, l, p) is the density function,
f(h, l, p)
= P (H = h, L = l, P = p)
= P (P = p|H = h, L = l)P (H = h|L = l)P (L = l)
= 12√
r2−l2−h× 1
2√
r2−l2× 1
r(2)
where 0 ≤ l ≤ r, 0 ≤ h ≤ 2√
r2 − l2, 0 ≤ p ≤ 2√
r2 − l2 − h, the value of f(h, l, p)
doesn’t depend on p, so
E[g(H,L, P )]
=∫ r
0
∫ 2√
r2−l2
0f(h, l, p)
∫ 2√
r2−l2−h
0g(h, l, p)dpdhdl
+∫ r
0
∫ 2√
r2−l2
0f(h, l, p)
∫ 2√
r2−l2−h
0g(h, l, 2
√r2 − l2 − h − p)dpdhdl (3)
let e = 2√
r2 − l2 − h − p, then
∫ 2√
r2−l2−h
0g(h, l, 2
√r2 − l2 − h − p)dp
=∫ 0
2√
r2−l2−hg(h, l, e)d(−e)
=∫ 2
√r2−l2−h
0g(h, l, e)de (4)
19
So, we can simplify it as,
E[g(H,L, P )]
= 2∫ r
0
∫ 2√
r2−l2
0f(h, l, p)
∫ 2√
r2−l2−h
0g(h, l, p)dpdhdl
= 2∫ r
0
∫ 2√
r2−l2
01
2√
r2−l2×
1r(∫ 2
√r2−l2−h
01
2√
r2−l2−hg(h, l, p)dp)dhdl (5)
The inner integration denotes for the expectation of the arc length when l and h are
fixed.
tan γ =√
r2−l2−pl
(6)
√(√
r2−l2−p)2+l2
sin(90+θ+β+γ)= r
sin(90+θ+γ)(7)
√(√
r2−l2−p)2+l2
sin(90−θ+α+β+γ)= r
sin(90−θ+γ)(8)
(7) ⇒
β + γ + θ = arccos(
√(√
r2−l2−p)2+l2
rcos(θ + γ)) (9)
(8) ⇒
α + β + γ − θ = arccos(
√(√
r2−l2−p)2+l2
rcos(γ − θ)) (10)
Let
M =
√(√
r2−l2−p)2+l2
r(11)
(9) − (10) ⇒
2θ − α = arccos(M cos(θ + γ)) − arccos(M cos(θ − γ)) (12)
20
⇒
α = 2θ − arccos(M cos(θ + γ)) + arccos(M cos(θ − γ))
So,
g(h, l, p) = α × r
= (2θ − arccos(M cos(θ + γ)) + arccos(M cos(θ − γ))) × r
We set θ = 25◦ and use Mathematica to calculate the above integration, to get
E[g(H,L, P )] = 0.92 × r (13)
Proof. : According to corollary 1, to get 90% precision of |ab|, θ should be less than
or equal to 25◦. According to theorem 3, the expectation of the arc length of ab with
angle θ = 25◦ is 0.92 × r, so 2πr0.92×r
= 7 landmarks evenly distributed on or close to
the circle or 43 landmarks randomly distributed on or close to the circle will offer us
the expectation.
(b) Best Case
In section 2.1, we mentioned that it’s unnecessary to preserve the distance of long
edges if the goal is to find the nearest neighbor, since the distance between a node
and its’ nearest neighbor is normally small. This inspires us that in practice we don’t
need that many landmarks deduced from theorem 3, since theorem 3 gives us the
expected number of landmarks in the case that any permitted length line segment
is allowed to put in C. The embedding algorithm we use along with the landmark
21
θ
β1
α1
α2
β2
P2
β3α3
P1
a4
a2
a3
b1b3
b4
b2
a1
`
Figure 3: Best Case
selection strategy can be seen as a hierarchical scheme based on the above thinking,
specifically, the shorter the edge is in the real time network, the lower the distortion
is after embedding. First, let’s see the best we can do.
Suppose the distance of any node with its nearest neighbor is small enough com-
pared to the diameter of C such that it can be simplified as a point. Correspondingly,
variable h in figure 2 is small.
Lemma 6. Given C(p, r) , put a point p randomly in C and draw two line segments
l1, l2 intersecting at p with acute angle α, then the arc length surrounded by l1, l2 is
2αr.
Proof. As illustrated in figure 3, let p1 be the point randomly put in C, a1a4,a2a3
be the two line segments intersecting at p1 with acute angle α = 2θ, a1a4 and a2a3
intersects with C at a1, a4 and a2,a3 respectively. l is the angle bisector of α and p2
is any point on l, draw b1b4 parallel to a1a4, b2b3 parallel to a2a3 intersecting at p2.
It’s clear that α3 = β3, so that α1 + α2 = β1 + β2 ⇒ a1a2 + a3a4 = b1b2 + b3b4. If
p2 moves rightwards until touches C, it’s obvious that now b1b2 = 4θr, b3b4 = 0. So,
22
b1b2 + b3b4 = 4θr = 2αr. That finishes the proof.
So, in the best case, we need 2πr4θr
= 4 landmarks when θ = 25◦.
(C) Practical Case (Fixed Length Case)
However, it’s rare to find applications bearing the idealistic property mentioned
in section 2.3(b), in other words, the edge distance between node and its nearest
neighbor is not small enough to be simplified as a point. In such cases, we can take a
sample from the real network space, and define a learning value ξ to be the maximum
of the distances between each node and its nearest neighbor. The following lemma
will give us some sense about the relation of the number of landmarks with distance
distortion.
Lemma 7. Given C(p, r), if we randomly put two points a, b in C with fixed length
ab = rδ. Let δ = 5, which means the learning value ξ is one fifth of r. To achieve
90% precision of ab, the expected number of landmarks is 5 if landmarks are evenly
distributed on or close to the circle or 28 if landmarks are randomly distributed on or
close to the circle.
Proof. of lemma 7:
The proof is very similar to that for theorem 3, however, here we fix h = r5, so,
E[g(L, P )] (14)
= 2∫ r
q
1− 1
4δ2
0 f(l, p)∫ 2
√r2−l2−h
0g(l, p)dpdl
= 2∫ r
q
1− 1
4δ2
01
rq
1− 1
4δ2
(
∫ 2√
r2−l2− rδ
01
2√
r2−l2− rδ
g(l, p)dp)dl
23
θ
R
g1 g2
DSC
DCA
θ
Figure 4: 3-D cap area
where δ = 5, according to 1, θ = 25◦ makes the distortion less than 10%, and we use
mathematica to calculate the above integration and get,
E[g(L, P )] = 1.51 × r (15)
This means 2πr1.51×r
= 5 landmarks evenly distributed on or close to the circle or 28
landmarks randomly distributed on or close to the circle will offer us the expectation.
3.3.2 The analysis for 3-d Euclidean space
We need extend the above analysis for the 2-d plane to high dimensional space. We
give a concise analysis for 3-d Euclidean space for the general case and best case in
this subsection, for higher dimension, we leave it as the future work.
We draw a smallest sphere of radius r containing all the hosts, and assume that
there are ”many” hosts scattered on or close to the sphere. We use the standard
notion B(p, r) to denote a sphere centered at p of radius r.
24
Definition 8. Degenerate Spherical Cone(DSC): The surface of resolution obtained
by cutting a conical ”wedge” with vertex at any point of a sphere out of the sphere. It
is therefore a degenerate cone plus a spherical cap.
Definition 9. Degenerate Cone Angle(DCA): Illustrated in figure (4)
Definition 10. The cap area of line segment with DCA θ in 3-d Euclidean
space is g1 +g2 (The sum of the two spherical caps area) illustrated in figure (4). The
cap area of point with DCA θ in 3-d Euclidean space is defined similarly.
Now comes to our main theorem for 3-d Euclidean space.
(a) General Case
Theorem 11. Given B(p, r) and randomly put two points a, b in B. To get 90%
precision of ab, the expected number of landmarks we need is 24 if landmarks are
evenly distributed on or close to B or 171 if landmarks are randomly distributed on
or close to B.
To prove our main theorem for 3-d Euclidean space, we need theorem 12.
Theorem 12. Given B(p, r), if we randomly put two points a, b in B, the expectation
of the cap area of ab with DCA θ = 25◦ is 0.54r2.
Proof. of theorem 12: The proof is the same as that for theorem 5, the only difference
is now
g(h, l, p) = 2πr2(1 − cos α2) (16)
which is the area of the spherical cap, where α is defined in equation (12).
25
Proof. of theorem 11: The proof is almost the same as that for theorem 3. The only
difference is in 3-d, the expectation of the cap area of the line segment with DCA
θ = 25◦ is 0.54r2, so the expected number of landmarks is 4πr2
0.54r2 = 24 if landmarks
are evenly distributed on or close to B or 171 if landmarks are randomly distributed
on or close to B.
(b) Best Case
Lemma 13. Given B(p, r), put a point p randomly in B and draw the degenerate
spherical cone d1, d2 at p with DCA θ in both directions, then the spherical cap area
covered by d1 and d2 is 4πr2(1 − cos θ).
Proof. The proof is the same as that for 6.
So, in the best situation, we need 4πr2
4πr2(1−cos θ)= 11 landmarks when θ = 25◦.
3.4 Outside-max-distance algorithm
From the above theoretical analysis, we use the following heuristic for selecting land-
marks for L∞.
1. Determine the candidate nodes, c, on the outside of a ring.
(a) Randomly choose a node in the data set, find a node who is furthest from
it, which is n1.
(b) From the data set, find another node which is furthest from n1, and call it
n2.
26
(c) Put n1 and n2 into a candidate node list. If the number of members in the
candidate list is equal or greater than c, then return c, else take n1 and n2
out of the data set, and goto (a).
2. Use the max-distance algorithm to find the nodes evenly distributed in the
outside ring of nodes. We just use the max-distance heuristic used in [28],
which works as follows, by iteratively selecting the set of landmarks, L: the
first landmark L1 is chosen from set S at random. For m(1 < m ≤ L), the
distance from a host hi to the set of already chosen landmarks L1, ..., Lm−1 is
defined as the minLjδ(F (hi), F (Lj)). The algorithm then selects as landmark
Lm the host that has the maximum distance to the set L1, ..., Lm−1.
Intuitively, the algorithm will try to evenly pickup the nodes from the boundary
of the data set.
27
4 Experimental Evaluation
In this section, we show the experimental results evaluating our scheme to find the
nearest neighbor in comparison with GNP[18]. We also show the tradeoff between
the structure error caused by the triangle inequality and the embedding error, with
respective to the contractive embedding into different norms.
4.1 Data Set
There are several data sets available recording round-trip time between Internet
nodes. [27] provides a good investigation on seven data sets, and shows that these
data sets have common intrinsic dimensions (around 5-7 dimensions). Six data sets
are provided in the form of a pairwise distance matrix of N×M , where N >> M , and
M < 30. Only AMP [1] data gives a N ×N matrix, where N > 100. The normal way
other embedding schemes verify the data is that they choose a subset of M columns
as landmarks, deriving coordinates for each node and using the remaining columns to
verify the result. In our experiments, we use AMP data [1], to verify the capability
of our embedding scheme to find the nearest neighbor among a set of neighbors. For
large data sets, we use the Inet 3.0 generator [29], which provides synthetic power-law
networks. We keep all default configurations, and generate 3050 nodes. These nodes
are placed in a square region and the delay of a link is the Euclidean distance between
the end points of that link. End-to-end delay is the shortest path delay.
28
0
10
20
30
40
50
60
L(1) L(2) L(infinity)Perc
enta
ge o
f N
odes F
aile
d t
o F
ind
Neare
st
Neig
hbor
before remedy after remedy
Figure 5: Error in finding nearest neighbor under different norm
4.2 Compensation for the Violation of Triangle Inequality
It has been found that Internet traffic does not always follow the shortest possible path
and that there is potential for violation of the triangle inequality due to the routing
policy [4]. In [27], the authors looked into the problem of whether the real networking
data obeys the triangle inequality, which is one requirement for the correctness of a
metric embedding. Particularly, for the AMP data, there is evidence that only 1.4% of
all combinations of d(i, k), d(k, j) and d(i, j) violate the triangle inequality, in which
d(i, k) + d(k, j) < d(i, j).
[13] tries to analyze the effectiveness of different norms with respect to their qual-
ity of representing “topological” information. Specifically, the authors compared the
effectiveness of embedding network nodes into different norms and came to the con-
clusion that the accuracy of representing topological information in a data space
depends heavily on the distance metric. However, they did not take the violation of
the triangle inequality into account.
Using the AMP data from [1], we performed the same experiment as in [13]. We
used Lipschitz embedding to assign coordinates to nodes, embedding them into L1,
29
L2 and L∞ normed space, and tried to calculate the nearest neighbor in the embedded
space for each node. As shown in Fig. 5, the x-axis represents different normed spaces,
and the y-axis shows the total number of nodes which fail to find the actual nearest
neighbor in the embedded space. Then as shown in the curve of “before remedy”, we
found that to find the nearest neighbor, even using L∞ norm, out of the total 101
nodes, 52 of them still cannot correctly find the nearest neighbor, which theoretically
is impossible. After further inspecting the data, we found that the main reason is the
violation of the triangle inequality.
Recall that the main problem we are solving is to find the nearest neighbor, which
means, we only care about the relative ordering of distances to all the neighbors, but
not the absolute distances. For a particular host h, if di is the distance to its ith
nearest neighbor, for all the distances before and after embedding, we only need to
keep the same sequence as:
d1 ≤ d2 . . . ≤ dn−1
To make the triangle inequality hold for the data, we add a fixed offset to every
distance, called the remedy value. In this case, we add the largest pairwise distance
of the whole metric as the remedy value. This way, we make the AMP data obey the
triangle inequality. Then as shown in the “after remedy” curve, the error using L∞
goes down to 0 after adjustment, which complies with the theoretical analysis.
This result tells us that, even though only a small portion (in this case 1.4%) of
cases a data set violates triangle inequality, big errors may occur in some situations
when finding nearest neighbors. We call this error caused by violation of the triangle
inequality the structure error. It also tells us that, if taking care of violation of the
triangle inequality correctly, L∞ norm gives us a possibility of isometrically embedding
30
the data. In practice, we could not have a node coordinate with dimension n, which
is the number of nodes in the data set. This is why we need landmarks for dimension
reduction.
0 10 20 30 40 50 600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Triangle Violation Length
Fra
ctio
n of
Nod
es
Figure 6: Distribution of triangle inequality violation
However, if we allow the remedy value to be too large, it will change the intrinsic
structure of the metric. Let’s define another parameter, the triangle inequality viola-
tion length d = c − (a + b) for any three edges a, b and c, which violate the triangle
inequality. If we add d/2 offset to these three edges, then they will obey the triangle
inequality. Fig. 6 shows the distribution of triangle inequality violation length of AMP
data. We can see that for 90% of all violations d is less than 10, which is to say, if we
add 5 as the offset to every distance, then we can revise 90% of triangle inequality
violations.
31
0
0.5
1
1.5
2
2.5
3
3.5
4
1 2 4 Infinity
NormP
roxim
ity
Max Probing 5 Max Probing 10 Max Probing 20
Figure 7: Proximity with max number of probing refinements
4.3 Contractive Embedding, Tradeoff between Structure Er-
ror and Embedding Error
As analyzed in section 3.2.2, for contractive embedding into Lkp
δ(Fk(o1), Fk(o2)) =(∑
i |d(o1, Ai) − d(o2, Ai)|p)1/p
(k)1/p
There exists a linear relationship among all the norms p, such that L∞ is the best one.
In this part, we examine the relationship between different contractive embedding
norms, more specifically, we use L1, L2, L4, and L∞ as examples. We also give the
tradeoff between triangle error, which is caused by violation of the triangle inequality,
and embedding error, which is cause by the embedding scheme.
First, we examine the capability in preserving proximity for different norms. In
fig. 7, the x-axis represents different norms, and the y-axis shows their capability
to preserve the proximity. In other words, this is the capability to preserve nearest
neighbor information. We give them different maximum number of probings shown
in different color bars. We can see that there is a clear linear relationship among all
the norms for contractive embedding in preserving proximity. All these norms allow
32
0
0.5
1
1.5
2
2.5
1 2 4 Infinity
Norm
Pro
xim
ity
remedy value 0 remedy value 5 remedy value 10 remedy value 15
Figure 8: Proximity with remedy value
0
10
20
30
40
50
60
70
80
90
100
1 2 4 Infinity
Norm
Pro
bin
g N
um
ber
remedy value 0 remedy value 5 remedy value 10 remedy value 15
Figure 9: Probing number with remedy value
33
on-line probing to compensate the embedding error. But since L∞ has the minimum
embedding errors, under a certain probing limitation, L∞ could always achieve best
proximity preserving.
Fig. 8 and fig. 9 show the tradeoff between “remedy value” and on-line probing
number. In the experiment for these two figures, we allow all norms the maximum
probing limitation, which is to say, they are allowed to probe as many times as
possible to correct the embedding errors. In Fig. 8, the x-axis represents different
norms, and y-axis shows the final proximity after probing. Theoretically, the best
value is 1. Different bar categories shows the different value of “remedy value”.
Fig. 8 shows that L∞ norm is more sensitive to the violation of triangle inequality
than other norms, which could almost get the real nearest neighbor through probing.
With respective to the L∞ itself, we can see that L∞ norm has better capability of
preserving proximity information with larger “remedy value”. However, from Fig. 9,
we can see that other norms have to undertake more probing steps before stopping.
Considering L∞ itself, a larger “remedy value” invokes more probing steps.
We define the errors caused by violation of the triangle inequality as the structural
error, and errors caused by embedding scheme itself as embedding errors. These two
figures are good examples showing the tradeoff between structure error versus the
embedding error. Basically, both of these two errors contribute to the final error
in finding nearest neighbor and on-line probing numbers. However, more “remedy
value” which compensates for the structure error, actually causes more embedding
errors; in turn, embedding errors will cause more probing times.
Considering this tradeoff, using a “remedy value” of 5, which is the d/2 of 90%
of the violations, and an embedding in the L∞ norm, an end host only needs (on
34
average) to probe 6.6 times, to find the 1.5th real nearest neighbor.
4.4 Comparison with GNP
4.4.1 Pairwise Distance Predicting
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative Predicting Error
Nod
es F
ract
ion
outside−max−distance 10Lrandom 10Lgnp 10L 8D
Figure 10: 10 Landmarks Predicting Error
First, fig. 10, shows a comparison of predicting error between GNP and our algo-
rithm. The error is defined as:
error =|predicted distance - measured distance|
predicted distance
It should be noted that this definition of error is different from that defined in [18],
in which the denominator is the minimum of predicted distance and real distance.
Since L∞ embedding is contractive, that definition in [18] will penalize the result.
We use two algorithms to choose landmarks: one is the outside-max-distance
algorithm mentioned before, the other is to randomly choose landmarks. For GNP,
we use the first 10 nodes as landmarks and dimension is 8. Since AMP data nodes
35
are arranged alphabetically, it is possible for such an arrangement to be one of the
configurations using randomly chosen landmarks. We can see that for the ability to
correctly predict distance between any two nodes, our angle-based L∞ embedding is
not as good as GNP. This is not a surprising result, which also complies with the
result from [18].
We can also see that in our scheme, randomly choosing landmarks, is almost as
good as the outside-max-distance algorithm in preserving distance.
4.4.2 Nearest Neighbor Searching
0 10 20 30 40 50 60 70 80 900
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative Proximity Error for Nearest Neighbor
Nod
es F
ract
ion
outside−max−distance 10L with probing && refineoutside−max−distance 10L with probingoutside−max−distance 10Lrandom 10Lgnp 10L 8D
Figure 11: Proximity
In contrast with the above results, if we consider the capability of predicting
nearest neighbors, our approach performs favorably. In what follows, let us define
proximity p as the the pth closest neighbor in real distance for the node which is in
embedded space the nearest neighbor. For example, if the calculated nearest neighbor
is actually the real nearest neighbor, the proximity p is 1.
Given the above, fig. 11 shows the capability to preserve proximity in embedded
36
space for the two schemes. From this figure, we can make several observations. First,
even though the random landmark selection algorithm achieves comparable results
to the outside-max-distance algorithm in preserving distance, it is much worse in the
capability of preserving nearest neighbors. This result complies with the claims in
[18], that triangle based schemes are sensitive to the positions of landmarks. For the
outside-max-distance algorithm, it can preserve proximity information as well as GNP.
Moreover, it is actually better at conserving the maximum proximity. Needlesstosay,
with our contractive embedding scheme, we can use: (1) on-line probing to adjust
embedding errors, and (2) a globally-defined “remedy value” to adjust the architecture
error. These are shown in green and red respectively in the figure.
4.5 Comparison Using INET Data
In this section, we compare the experimental results using an Inet [29] data set,
produced with the Inet 3.0 generator, with all parameters set to their defaults. This
data set consists of 3050 nodes, and represents a synthetic power-law network, in
which nodes are randomly placed in a square region, and the delay of a link is the
Euclidean distance between the end points of that link. We use the Floyd-Warshall
pairwise shortest path algorithm to generate pairwise network distances between every
node. Note that since this distance metric is derived from the shortest path algorithm
from a 2-D plane, it will obey the triangle inequality. We do not need to apply the
triangle inequality violation remedy algorithm to adjust the data.
37
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative Predicting Error
Nod
es F
ract
ion
outside−max−distance 50Lgnp median−cluster 50L 10Dgnp max distance 50L 10D
Figure 12: Predicting Error
4.5.1 Pairwise Distance Predicting
First, we compare the pairwise distance predicting error, which is defined in section
4.4.1. For the landmark selection algorithm, we use our two landmark selection
heuristics as outside-max-distance and convex-max-distance. Fifty landmarks are
used for these two algorithms, and for GNP, we setup the dimension to be 10.
We give two sets of landmarks to GNP. The first one is the set of landmarks
generated using the max-distance heuristic, while for the second, we choose the first
50 nodes as the landmarks. Surprisingly, using the first 50 nodes as landmarks gives
much better results than the maximum distance landmark selection method. After
further investigating the data, we found out that in Inet generated data, the nodes
are ranked decreasingly according to their degrees. So the first nodes have a higher
degree and are more likely to be the median nodes in cluster groups. In [18], they
have shown the N -cluster-median is the best landmark selection scheme for GNP.
This result complies with their discovery.
38
In Fig. 12, we show the predicting error of GNP and our scheme with respect to
four landmark selection algorithms. We can see that if GNP is used with the N -
cluster-median landmark selection algorithm, its predicting error is much less than
that of our algorithm. This result complies with the results from [19], in which
they show that with 40 landmarks, using 8-dimension Euclidean space, 90% of the
predictions have error less than 42%. With 10 more landmarks, in our experiment,
it’s a little bit better, 90% of the predicting has error less than 35%. Another reason
is that we use different error calculation function as expressed in section 4.4.1.
Neither of our two landmark selection heuristics are as good as GNP with N -
cluster-median in predicting distance for all, which is not a surprising result. However,
if we use GNP with maximum-distance heuristic to select landmarks, the predicting
capability is even worse than our scheme. These results shows that the landmark
selection algorithm depends on the the embedding schemes. In general, it is better
to choose landmarks according to the intrinsic property of the embedding scheme.
4.5.2 Nearest Neighbor Searching
Fig. 13 and Fig. 14 compare the capability of GNP and our algorithm in predicting
the nearest neighbor. In this case, we use the calculated distance derived from the
coordinates of each scheme to derive the nearest neighbor in the embedded space, and
we look into the real distance metric to check the errors. Since we use 50 nodes among
the 3050 nodes as landmarks, each node actually tries to find the nearest neighbor
among the other 3050− 50− 1 nodes. For GNP, we use the median-cluster landmark
selection algorithm which has the best predicting error.
From Fig. 13, if we consider the proximity in finding nearest neighbor, our scheme
39
0 500 1000 1500 2000 25000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative Proximity Error for Nearest Neighbor
Nod
es F
ract
ion
outside−max−distance 50Loutside−max−distance 50L probing 20outside−max−distance 50L probing 200gnp median−cluster 50L 10D
Figure 13: Proximity Error
is a little bit better than GNP, as shown in Fig. 13 with the legend of “gnp median-
cluster” and “out-side 50” respectively. Considering about 90% cases, the nearest
neighbor found in “calculated” space or “embedded” space, can only be guaranteed
to be less than actual 1200th closest neighbor. But if we could apply the online
algorithm to refine the embedding errors, then the results are much better. If we
allow maximum 20 real probes, with 90% probability, we can guarantee to find the
real ones with proximity less than 1000; and if we allow maximum probing to be
200; then with 90% probability, we can guarantee to find the nearest ones with the
proximity less than 270. Purely considering finding nearest neighbors among 3000
nodes, even if we allow at most 200 real probes, we could only guarantee less than
270 nearest neighbors.
If we look at the problem in another way, as shown in Fig. 14, if we define “dis-
tance error” as the ratio of the distance to the “calculated nearest neighbor” and the
distance to the “real nearest neighbor”, then distance error, which is equal to 1, will
show a significant result.
40
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative Distance Error for Nearest Neighbor
Nod
es F
ract
ion
outside−max−distance 50Loutside−max−distance 50L probing 20outside−max−distance 50L probing 200gnp median−cluster 50L 10D
Figure 14: Distance Error
We can see that a large portion of the predictions are less than 2 in all of the
four cases. In GNP, 50% of the cases, it finds the neighbor which is less than 2 times
further away than the real nearest neighbor. In our algorithm, the same result occurs
in 60% of the cases. Furthermore if we allow a maximum of 20 probings to adjust
the errors, the number of cases goes up to 80%, and if we allow a maximum of 200
probings, we can reach about 93%. This is due to the highly degree of clustering of
the Internet, which is also discussed in [28]. Using an embedding scheme to find the
nearest neighbor, the actual distance is within 2 or 3 times of that to the real nearest
neighbor. We believe this to be a reasonable approximation result.
One final issue concerns the question: “could we also let GNP use probing to ad-
just the embedding error?” Only a contractive embedding could guarantee the quick
convergence to the real nearest neighbor in the refinement stage. Table 1 shows the
tradeoff between probing numbers and error in predicting nearest neighbors. From
the table, we can see that even though we have the limitation of maximum probing
numbers, in a large portion of cases, the on-line probing algorithm stops before the
41
Max Allowed Real Probing Proximity Distance error
Probing
0 0 408 2.3
20 13.4 245 1.8
200 118.9 81 1.3
3050 429.8 1 1
Table 1: Probing Number v.s. Nearest Neighbor Finding Accuracy
maximum step due to the contraction (and, hence, elimination of unqualified can-
didates) property. If we allow 20 probings at most, on average it will stop after 13
probings, and if we allow 200 probings at most, on average it will stop after 119 steps.
In the extreme case, even if we do not give a limitation for the probings, it only needs
430 steps on average, which is far less than 3050. In this case, we are guaranteed to
find the real nearest neighbor.
42
5 Related Work
The paper by Ng and Zhang [25] is the first paper in the networking area to use a
non-linear optimization algorithm for network positioning [25, 13, 21, 27]. Similar
work has also been conducted in theoretical computer science [12] and also by Tang
and Crovella [27]. These bodies of work show that even though it is impossible to
represent the positions of Internet hosts in a purely 2D space, it is possible to embed
Internet hosts into a relatively-low (on the order of 5 to 7) dimensional Euclidean
space, using traditional dimension reduction algorithms like MDS (Multidimensional
Scaling) [2] and PCA (Principal Component Analysis). This makes it possible to
accurately embed the positions of hosts on the scale of the Internet using a relatively
small number of landmarks. Our work extends these ideas by focusing specifically on
the problem of determining nearest neighbors which, to our knowledge, has not been
the primary object of prior work in network positioning.
More recent work by Ng and Zhang [19], concerns the construction of a real
network-positioning system (NPS), to provide a positioning capability for hosts across
the Internet. In essence, this is similar to the Domain Name System (DNS). Princi-
pally, it involves a hierarchical network positioning architecture that maintains posi-
tion consistency while enabling decentralization and adaptation to network topology
changes. In effect, this is similar in concept to the off-line stage of our network
positioning approach.
Other work, such as the Big-Bang-Simulation [25], tries to simulate embedding
errors as force fields, and uses a multi-phase procedure to iteratively reduce such
errors. These iterative non-linear optimization-based algorithms are more sensitive to
input parameters and are more expensive to compute. For example, related work [13]
43
shows that under some circumstances, GNP may have non-unique coordinates which
would lead to estimation inaccuracy.
Lipschitz embedding with dimensionality reduction using PCA has been studied
by various other researchers [27, 13]. While it has been shown possible to reduce
the dimension of coordinate vectors from as much as 100 down to 20 using “virtual”
landmarks, it is still necessary for each end-host to probe as many as 100 “physical”
landmarks. In our solution, the dimension (or length) of coordinate vectors used
for positioning end-hosts is the same as the number of landmarks, implying that we
require no extra communication costs in the off-line derivation of these coordinates.
Finally, a global architecture for estimating Internet host distances, called the
Internet Distance Map Service (IDMaps), was first proposed by Francis et al. [4].
This architecture separates “tracers” (equivalent to our notion of landmarks) that
collect and distribute distance information from clients that use a corresponding dis-
tance map. A distance query interface allows an application to query IDMaps servers
to find out network distance between pairs of hosts. This is different from our two
stage service architecture in which landmark servers only participate in off-line co-
ordinate derivation, while end-hosts derive their nearest neighbors using an on-line
probing/refinement scheme.
44
6 Conclusions and Future work
In this thesis, we leverage geometric-based embedding techniques for the specific
objective of finding nearest neighbors. The nearest neighbor problem is of particular
importance to a large class of applications, in areas such as P2P systems, content-
distribution, overlay routing and end-system multicast. These large-scale applications
are now being deployed on scales that encompass many thousands of end-systems,
taken from a dynamic subset of all Internet hosts.
We propose a two-stage method for network positioning. In the first stage, which
is performed off-line, each host communicates with designated landmarks to derive
its coordinate. We use Lipschitz embedding in the L∞ normed vector space to assign
coordinates and derive distances between pairs of hosts. In the second stage, an on-
line refinement algorithm leverages the contractive property of L∞, to compensate
for embedding errors and quickly converge on the real nearest neighbor of a given
host. Once such a host is ascertained, it is possible to use a probe message (e.g., an
ICMP ping) used as part of our refinement algorithm, to capture distances, perhaps
in terms of latency, which may then be used in applications such as QoS-constrained
routing.
Our analysis shows that by careful selection of landmarks on the perimeter of all
hosts in a given set, it is possible to determine nearest neighbors with low error rates,
using an L∞ embedding scheme. Although geomtric embedding theory relies on the
triangle inequality, in practice this may be violated by the intrinsic properties of the
underlying network topology. We compensate for this by offsetting pairwise distances
between hosts using a “remedy value”. Care must be taken when using large remedy
values, since they may in turn increase embedding errors, which we were trying to
45
eliminate by asserting the triangle inequality.
Future work involves the design and implementation of a distributed version of
our embedding scheme, thereby making our method more scalable. We also intend
to investigate the landmark selection scheme for network topologies that have an
intrinsic dimensionality more than two or three. Specifically, the Internet is known
to have a higher dimensionality than two, which impacts the number and location of
landmarks necessary for an accurate embedding scheme that preserves information
about nearest neighbors.
46
References
[1] National laboratory for applied network research, Active measurement project
(AMP) http://watt.nlanr.net/.
[2] I. Borg and P. Groenen. Modern Multideminsional Scaling - Theory and Appli-
caions. Springer.
[3] I. Foster and C. Kesselman. Globus: A toolkit-based architecture. The Grid:
Blueprint for a New Computing Infrastructure, pages 259–278, 1999.
[4] P. Francis, S. Jamin, V. Paxson, and L. Zhang. An architecture for a global
internet host distance estimation service. In Proceedings of IEEE Infocom, 1999.
[5] Freenet, http://freenet.sourceforge.net/.
[6] G. Fry and R. West. Adaptive Routing of QoS-constrained Media Streams over
Scalable Overlay Topologies. In 10th IEEE Real-Time and Embedded Technology
and Applications Symposium (RTAS), May 2004.
[7] Gnutella, http://gnutella.wego.com/.
[8] G. R. Hjaltason and H. Samet. Properties of embedding methods for similarity
searching in metric spaces. IEEE Trans. Pattern Anal. Mach. Intell., 25(5):530–
549, 2003.
[9] S. Hotz. Routing information organization to support scalable interdomain rout-
ing with heterogeneous path requirements. In Ph.D. Thesis, Univ. of Southern
California, 1994.
47
[10] Y. hua Chu, S. G. Rao, and H. Zhang. A case for end system multicast (keynote
address). In Proceedings of the 2000 ACM SIGMETRICS international con-
ference on Measurement and modeling of computer systems, pages 1–12. ACM
Press, 2000.
[11] KaZaA, http://www.kazaa.com.
[12] J. Kleinberg, A. Slivkins, and T. Wexler. Trianglulation and embedding using
small sets of beacons. In 45nd IEEE symposium on Foundations of Computer
Science (FOCS’04), 2004.
[13] H. Lim, J. C. Hou, and C.-H. Choi. Constructing internet coordinate system
based on delay measurement. In Proceedings of the 2003 ACM SIGCOMM con-
ference on Internet measurement, pages 129–142. ACM Press, 2003.
[14] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some
of its algorithmic applications. In Proc. 35th IEEE Ann. Synp. Foundations of
Computer Science, pp. 577-591, 1994.
[15] D. Loguinov, A. Kumar, V. Rai, and S. Ganesh. Graph-theoretic analysis of
structured peer-to-peer systems: routing distances and fault resilience. In Pro-
ceedings of the 2003 conference on Applications, technologies, architectures, and
protocols for computer communications, pages 395–406. ACM Press, 2003.
[16] J. Matousek. Note on bi-Lipschitz embeddings into normed spaces
www.emis.de/journals/CMUC/pdf/cmuc9201/
(see matousek.pdf).
[17] SETI@home, http://setiathome.ssl.berkeley.edu.
48
[18] T. S. E. Ng and H. Zhang. Predicting internet network distance with coordinates-
based approaches. In Proceedings of IEEE INFOCOM’02, 2002.
[19] T. S. E. Ng and H. Zhang. A network positioning system for the internet. In
USENIX 2004, 2004.
[20] G. Parmer, R. West, X. Qi, G. Fry, and Y. Zhang. An Internet-wide Distributed
System for Data-stream Processing. In 5th International Conference on Internet
Computing (IC’04), 2004, 2004.
[21] M. Pias, J. Crowcroft, S. Wilbur, S. Bhatti, and T. Harris. Lighthouses for
scalable distributed location. In Second International Workshop on Peer-to-Peer
Systems(IPTPS’03), 2003.
[22] X. Qi, G. Parmer, and R. West. An Efficient End-host Architecture for Clus-
ter Communication Services. In the IEEE International Conference on Cluster
Computing (Cluster’04), September 2004.
[23] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker. A scalable
content-addressable network. In Proceedings of the 2001 conference on Appli-
cations, technologies, architectures, and protocols for computer communications,
pages 161–172. ACM Press, 2001.
[24] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and
routing for large-scale peer-to-peer systems. In IFIP/ACM International Confer-
ence on Distributed Systems Platforms (Middleware), pages 329–350, Heidelberg,
Germany, November 2001.
[25] Y. Shavitt and T. Tankel. Big-bang simulation for embedding network distances
in euclidean space. In INFOCOM’03, 2003.
49
[26] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek,
and H. Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for internet
applications. IEEE/ACM Transactions on Networking, 11(1):17–32, 2003.
[27] L. Tang and M. Crovella. Virtual landmarks for the internet. In Proceedings of
the 2003 ACM SIGCOMM conference on Internet measurement, pages 143–152.
ACM Press, 2003.
[28] L. Tang and M. Crovella. Geometric exploration of the landmark selection prob-
lem. In Paasive & Active Measurment Workshop(PAM2004), 2004.
[29] J. Winich and S. Jamin. Inet-3.0: Internet topology generator. Technical Report
UM-CSE-TR-456-02, University of Michigan, 2002.
[30] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure
for fault-tolerant wide-area location and. Technical report, 2001.
top related