ba thesis : semantic routing mechanism for p2p networks

POLITEHNICA University of TIMISOARA

Semantic routing mechanism for P2P networks

Diploma project

Session: June 2006

Project supervisors :

Prof. Dr. Eng. Ioan JURCA

Prof. Dr. Eng. Sorin MOGA

Prof. Dr. Eng. Miranda NAFORNITA

Student: Paul SABOU

1

Contents

1 Introduction 6

I P2P technology & The routing process in P2Pnetworks 8

2 P2P Technology : evolution & current state 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 The Pure Models . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 The Hybrid Models . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Major aspects of the current state of P2P . . . . . . . . . . . 172.5 Message routing in P2P networks . . . . . . . . . . . . . . . . 18

3 Research Question 193.1 The objectives of the research . . . . . . . . . . . . . . . . . . 193.2 A comparison between 2 routing protocols & the proposed

protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 PlanetP Routing Algorithm . . . . . . . . . . . . . . . . . . . 203.4 Why we have to change PlanetP routing mechanism . . . . . . 213.5 What should we change in PlanetP routing mechanism . . . . 22

4 A solution : the semantic routing mechanism 244.1 General considerations regarding the semantic routing algorithm 244.2 The algorithms used to calculate the importance of the words 25

4.2.1 The algorithm used to determine the local importanceof a word . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.2 The algorithm used to determine the global importanceof a word . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 The routing table . . . . . . . . . . . . . . . . . . . . . . . . . 284.3.1 The words table . . . . . . . . . . . . . . . . . . . . . . 284.3.2 The nodes table . . . . . . . . . . . . . . . . . . . . . . 28

2

4.3.3 The links between the words table and the nodes table 294.3.4 The decision table . . . . . . . . . . . . . . . . . . . . 29

4.4 The initialisation of the routing table . . . . . . . . . . . . . . 304.4.1 Initialisation of the words table . . . . . . . . . . . . . 304.4.2 Initialisation of the nodes table . . . . . . . . . . . . . 314.4.3 Initialisation of the links between the words table and

the nodes table . . . . . . . . . . . . . . . . . . . . . . 314.4.4 Initialisation of the decision table . . . . . . . . . . . . 31

4.5 Operations with the routing table . . . . . . . . . . . . . . . . 314.5.1 Operations with the words table . . . . . . . . . . . . . 314.5.2 Operations with the nodes table . . . . . . . . . . . . . 324.5.3 Operations with the links between words table and the

node table . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.4 Operations with the decision table . . . . . . . . . . . 34

4.6 The packets used in the routing process . . . . . . . . . . . . . 354.6.1 The query packets . . . . . . . . . . . . . . . . . . . . 354.6.2 The response packets . . . . . . . . . . . . . . . . . . 364.6.3 The feedback packets . . . . . . . . . . . . . . . . . . 37

4.7 The routing process . . . . . . . . . . . . . . . . . . . . . . . . 384.7.1 The search stage . . . . . . . . . . . . . . . . . . . . . 384.7.2 The response stage . . . . . . . . . . . . . . . . . . . . 424.7.3 The feedback stage . . . . . . . . . . . . . . . . . . . . 45

4.8 How to keep a constant size of the node table . . . . . . . . . 50

5 Important constants used by the routing mechanism 515.1 The initial multiplication factor . . . . . . . . . . . . . . . . . 515.2 The number of entries in the routing table . . . . . . . . . . . 535.3 The value of Time-To-Live . . . . . . . . . . . . . . . . . . . . 535.4 The minimal match degree . . . . . . . . . . . . . . . . . . . . 535.5 Routing factors . . . . . . . . . . . . . . . . . . . . . . . . . . 54

II The project implementation 56

6 Overview 576.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 The PeerSim simulator . . . . . . . . . . . . . . . . . . . . . . 60

6.2.1 A short description . . . . . . . . . . . . . . . . . . . . 606.2.2 PeerSim main advatages: . . . . . . . . . . . . . . . . . 606.2.3 How does PeerSim work . . . . . . . . . . . . . . . . . 60

6.3 The network overlay . . . . . . . . . . . . . . . . . . . . . . . 61

3

6.4 The routing mechanism . . . . . . . . . . . . . . . . . . . . . . 626.5 Statistics gathering system . . . . . . . . . . . . . . . . . . . . 626.6 Graphical visualisation system . . . . . . . . . . . . . . . . . . 62

7 Data gathering components 637.1 Observer 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.1.1 What it observes . . . . . . . . . . . . . . . . . . . . . 637.1.2 How it observes . . . . . . . . . . . . . . . . . . . . . . 637.1.3 The output generated . . . . . . . . . . . . . . . . . . . 63

7.2 Observer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2.1 What it observes . . . . . . . . . . . . . . . . . . . . . 647.2.2 How it observes . . . . . . . . . . . . . . . . . . . . . . 647.2.3 The output generated . . . . . . . . . . . . . . . . . . . 64

8 Graphical visualisation components 658.1 The vrml-graph-plus package . . . . . . . . . . . . . . . . . . . 65

8.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 658.2 VRML 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8.2.1 A short description . . . . . . . . . . . . . . . . . . . . 668.2.2 VRML design criteria . . . . . . . . . . . . . . . . . . . 668.2.3 VRML 2.0 - The main advantages : . . . . . . . . . . . 66

8.3 Our implementation of the display environment . . . . . . . . 678.3.1 The vrmlgraph package . . . . . . . . . . . . . . . . . . 678.3.2 The vrmlgraphspecial package . . . . . . . . . . . . . . 68

9 Data representation & generation components 699.1 The XML Writter package . . . . . . . . . . . . . . . . . . . . 69

9.1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . 699.1.2 How it is used . . . . . . . . . . . . . . . . . . . . . . . 69

9.2 The Input Data Generators . . . . . . . . . . . . . . . . . . . 699.2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . 699.2.2 Query distribution generator . . . . . . . . . . . . . . . 70

10 Statistics processing system 7110.1 Query Send & Hit Receive . . . . . . . . . . . . . . . . . . . . 71

10.1.1 What is does . . . . . . . . . . . . . . . . . . . . . . . 7110.1.2 The input processed . . . . . . . . . . . . . . . . . . . 7110.1.3 The output generated . . . . . . . . . . . . . . . . . . 71

10.2 Messages forwarded per cycle . . . . . . . . . . . . . . . . . . 7210.2.1 What is does . . . . . . . . . . . . . . . . . . . . . . . 7210.2.2 The input processed . . . . . . . . . . . . . . . . . . . 72

4

10.2.3 The output generated . . . . . . . . . . . . . . . . . . 7210.3 Average routing table size in every cycle . . . . . . . . . . . . 72

10.3.1 What is does . . . . . . . . . . . . . . . . . . . . . . . 7210.3.2 The input processed . . . . . . . . . . . . . . . . . . . 7210.3.3 The output generated . . . . . . . . . . . . . . . . . . 73

III Appendixes 77

5

Chapter 1

Introduction

RARE II is an improved version of the RARE project. Thus, RARE II hasmore ambitious objectives. The purpose of the RARE II project is to createa realistic environment for the research routing mechanism in P2P networks.Important aspects that this project tries to address are :

• the analysis of the content distribution stored on the peers and therequests distribution

• the analysis of the P2P networks, including a model of user behaviour

• the analysis of the efficiency of routing algorithms based on semanticrouting

My work in this project, consist in the design and implementation of asemantic routing protocol.

The first part of this document has 3 chapters (Chapters: 2,3 and 4).The second chapter contains an overview of the evolution and the currentstate of P2P networking. The first 3 sections will present an overall imageof the main steps in the evolution of P2P networks. After that we willdedicate 2 sections to the : current state of P2P networks and to a detailedpresentation of the problem of message routing in P2P network. The thirdchapter will present the objectives of the research and the starting point ofthe proposed routing mechanism. In this chapter, 2 popular routing protocolswill be analysed. Those protocols will represent a baseline in the evaluationof the proposed routing mechanism. In the fourth chapter we will presentthe proposed routing mechanism. This is the most important part of ourwork. In this chapter we will describe in detail the components used by theproposed routing algorithm and the algoritms used by it.

6

The second part of this document (Chapters: 5,6,7,8,9 and 10) will presentan overview of the entire simulator and of the components developed for thisproject1.

The last part contains the appendixes. In the appendixes we present asmall sample of the configuration files used to run the entire project andother things like: the Bloom filter system, packets structure & sizes. etc.

1For the complete code documentation, see the Appendix

7

Part I

P2P technology & The routingprocess in P2P networks

8

Chapter 2

P2P Technology : evolution &current state

2.1 Introduction

Because of the widespread usage of P2P networks, the P2P technology hasdeveloped very quickly over the last decade. The direction of the develop-ment was imposed by many factors, out of which, two types, were the mostimportant: the social factors and the technological factors. In this chapterwe will focus on the P2P technology that is mainly used for storage purposes(i.e. file sharing).

The first subsection, provides a description of two paradigmatic models(the ”pure models”) in the organisation of comunication in computer net-works. After that, in the second subsection, we will introduce three ”hybrid”organisation models, based on the models presented in the first subsection.In the second subsection, we will present the current state and trends of theP2P development and outline it’s main problems.

2.2 The Pure Models

1. The Client Server ModelThis was the most used model in computer networks. In this model,the comunication is realised between a node considered server and othernodes considered clients.

Main caracteristics:

9

Server

ClientClient

Client Client

Request

Reply

RequestReply

Request

Request

Reply Reply

Figure 2.1: The Client Server Model

(a) asymetric relation : the client sends requests, the server exe-cutes them and sends back replies

(b) centralised control: the server node, acts like a central nodein the communication network, assuring all the comunicationbetween client nodes

Main advantages:

(a) simple finding mechanism : the mechanism used to locatedata are very simple

(b) small communication overhead : there is minimal overheadregarding the localisation and transfer of data

Main weaknesses:

(a) poor scalability : the number of client nodes is limited by thecommunication and processing capabilities of the server node

(b) poor reliability : the reliability of the entire communicationsystem depends on the reliability of the server node

2. The Peer-to-Peer ModelThis model is an emerging paradigm, that matured over the last decade,now becoming one major model in computer networks.

10

Figure 2.2: The Peer to Peer Model

Main caracteristics:

(a) symetric relation : all the nodes have the same importancein the network, every node acting both as a client and as aserver

(b) no centralised control : the communication between the nodesin the network doesn’t depend on any specific node

Main advantages:

(a) extreme scallability :

• the nodes don’t need global knowledge of the networkentities

• the processing/communication tasks are distributed acrossthe entire set of nodes

(b) good reliability :

• there isn’t any central point of failure

• data can be easily replicated

Main weaknesses:

(a) communication overhead :

• all the node’s have to implement some techniques to keepan up-to-date image of their neighbours

11

• nodes receive many unusefull messages (that are just pass-ing by) and consume bandwidth

(b) processing overhead :

• all the node’s have to implement some minimal processingof all the incoming messages, to provide a method for therouting of messages

• all the node’s have to implement rather computationallyexpensive methods of data location

(c) complex localisation of data

2.3 The Hybrid Models

The hybrid P2P systems were developed to address different problems in-herent in the ”pure models”. As it will be shown, there isn’t any all-winsolution, so different ”hybrid models” will resolve some of the problems, butnot all of them. Below we will present a succesion of ”hybrid models”, thatrepresent the main steps in the development of the P2P networks:

1. Client-Server search & Peer-to-Peer file transfer

Example : The Napster Network[3]

Objectives: The Napster Network, tried to offer the possibility todifferent individuals to share files. The entire system had to shutdown because of copyright infringement.

Description: There was a central pool of servers, that kept a cen-tralised directory index with all the online Napster users and withthe names of all their shared files. The search was centralised, butonce the filename and the IP of the remote user were known, thetransfer was realised directly between the respective peers.

Advantages: A huge quantity of shared files could be accesed, usingrelatively litle bandwith/processing (the central pool of servers).

Disadvantages:

(a) the central pool of servers limited the growth of the network

(b) the entire system depended on the central pool of servers1

1After the lawsuit the Napster network was shut down, by shutting down the pool ofservers that kept the directory index

12

Figure 2.3: Client-Server search & P2P transfer

Observations: The system used a client-server model to resolve theproblem of resource localisation (the central directory index) andused a peer-to-peer model to resolve the problem of resource trans-fer.

2. Peer-to-Peer search & Peer-to-Peer file transfer

Example : The Gnutella Network[1]

Objectives : The Gnutella Network, tried to offer the same servicesas Napster, but avoiding the dependace of the network on somecentral point.

Description :

• When a node wanted to join the Gnutella network, it only hadto know the address of one node that was already a part of thenetwork. Once he was connected to one node, this node sentit a list of neighbours already in the network. This way, thefailure of one or more nodes in the network didn’t affect theoverall network. The nodes kept an up-to-date neighbour ta-ble by checking the availability of their neighbours and askingother nodes, their respective neighbours table.

13

Figure 2.4: P2P search & P2P transfer

• The search was performed using the flooded model. When anode searched for some keywords, it sent a query to k neigh-bours. Any node that received a search request, decreasedthe TTL of the request and sent the request to k of his neigh-bours, except the source node. After that, if it found anylocal files that matched the request, sent the source node alist of matched entries. When the TTL == 0 the receivingnode discarded the packet.

Advantages : There isn’t any single point of faillure. The Gnutellanetwork is very robust.

Disadvantages : The search inflicted heavy bandwidth/processing over-head, which made the network less scalable.

Observations : This system is the closest to the pure Peer-To-Peermodel.

3. Cached Peer-to-Peer search & Peer-to-Peer file transfer

Example : The Kaaza/Morpheus Network[2]

Objective : Obtain the same level of reliability as in the case ofGnutella Networks while trying to reduce considerably the band-width/processing overhead.

14

Figure 2.5: Cached P2P search & P2P transfer

Description : The improvements depended on the introduction of anew type of peers, called super-peers. In this new type of networkthere were two types of peers: the client-peers and the super-peers.The model of the relation between the client-peers and super-peerswas very close to the ”client-server” model. The relation betweenthe super-peers could be modeled aproximatively by the ”peer-to-peer” model. All the requests of the client-peers were hidden as ifwould come from their coresponding super-peer.

Advantages :

(a) the network is very reliable, because it benefits from the ad-vantages of the peer-to-peer architecture

(b) because of the small number of super-peers, the bandwidth/processingoverhead could be effectively reduced

Disadvantages :

(a) the searching/routing algorithms are more complex

(b) the connexion management is more complicated

4. Content Adressing Network - based search & Peer-to-Peer file transfer

Example : The Freenet Network[4]

15

Figure 2.6: Content Adressing search & P2P transfer

Objective : Improve the search method, by storing certain content incertain peers.

Description :

• Each peer in the network has an associated unique hash key(usually 160 bits long) and every file stored has another asso-ciated unique hash key.

• When a new file is inserted in the global storage system, it isplaced in the peer which has the closest key to the key of thefile.

• When a file is searched, all we need to find is the node withthe key closest to the searched file key.

Advantages :

• The search mechanism is very effective, with litle traffic/processingoverhead.

• The system offers a high degree of content anonymity. Yourfiles don’t necesary stay on your computer, but on many peo-ples computers.

Disadvantages :

• When a new file is inserted in the system, it causes (unneces-sary) traffic for some other peer in the network.

• Not well suited for frequent updates.

16

2.4 Major aspects of the current state of P2P

Bellow we will try to outline some major aspects of the current state of P2Pnetworks, that are important for our project2. After that we will state brieflysome of the inherent problems of each solution.

Each aspect outlined, can be treated separately, improving the overallperformance of p2p networking.

P2P network architecure In the current state of development, there are2 major architectural paradigms:

1. In the hierarchical p2p architecture peers set is divided in 2 subsets: the usual peers and the super-peers. The relation between super-peers is peer-to-peer and the one between peers and super-peersis a client-server relation. In this network architecture, the seper-peers act as chaches for the usual peers.

2. In the classic architecture, the peers can link to each other in theway they want. There isn’t any restriction in their topology.

data placement in P2P networks There are 3 major paradigms in dataplacement:

1. data and meta-data are distributed across nodes whitout any spe-cific organisation (Gnutella[1])

2. data and meta-data are distributed deterministic across the nodes(Freenet[4], Chord[8], Pastry[10], Tapestry[11])

3. meta-data is stored in the super-peers and the data is stored inthe normal peers (Kazaa[2])

message routing in P2P networks Some of the representative paradigmsare:

1. message flooding : the messages are sent to a number of k nodes(k = some constant) and in each step the TTL of the messagedrecreases. When the TTL of the message == 0, the message isdiscarded.

2. directed routing : the message is sent to some of the neighbours,according to a previously implemented routing algorithm

2See [6] for an in depth discusion

17

2.5 Message routing in P2P networks

The message routing mechanism, deals with the distribution of queries. Themain tasks of an routing mechanism are :

1. providing an efficient search mechanism

2. adapting itself to : the local node interests, the global trends of thenetwork

The design of a message routing mechanism must consider the followingfactors :

Node resources These are the resources used by every node to implementthe routing mechanism. The resources used, usually include :

1. memory used to store the routing table

2. CPU used to search/add/remove entries in the routing table

Network traffic The routing mechanism generated network traffic to up-date and improve the routing tables on every node. The traffic gen-erated by the routing mechanism, can be divided into the followingcategories:

querries traffic This is the traffic generated by the forwarding andthe multiplication of the queries. Each node along the path con-tributes in some way to this traffic, thus the distribution of thetraffic depends on the paths of the querries.

responses traffic This is the traffic generated by the responses to theinitiated querries. Usually the responses are sent directly by theresponding nodes to the nodes that created the queries.

feedback traffic This is the traffic generated by the feedback mech-anism, used by the routing mechanism to adjust the routing pa-rameters of the nodes that participated in the query routing.

Search efficiency The search efficiency depends on the degree of succes,for any query that is inserted into the system. When the system calcu-lates the succes of an individual query it must take into consideration,factors as: the number of responses, the relevance of the responses, thepreferences of the user, etc.

18

Chapter 3

Research Question

3.1 The objectives of the research

The research of our project will focus on the design of a routing mechanismthat :

1. uses relatively litle node resources

2. generates a small amount of traffic

3. has a high search efficiency

We will start with an investigation of the 3 characteristics in two popularrouting mechanisms : Gnutella1 and PlanetP. Those mechanism are used asbaseline in the evaluation of the proposed routing mechanism.

We want to propose a routing mechanism that (hopefully) will keep theadvantages of both while trying to avoid some of the disadvantages.

3.2 A comparison between 2 routing proto-

cols & the proposed protocol

Below, there is a syntetic view of both the routing mechanisms used as abaseline. As the table shows, Gnutella relies heavilly on network traffic tooffer a modest search efficiency. PlanetP, relies almost entirely on the heavyuse of node resources and offers a high efficiency of searching.

1We refer to Gnutella v. 1.0. When we refer other versions, that will be stated ex-plicitely.

19

2 popular routing mechanismsNode resources Network traffic Search efficiency

MAX PlanetP Gnutella PlanetP- - -

MED - PlanetP -- - Gnutella

MIN Gnutella - -

The routing mechanism that we are proposing, will try to keep the samesearch efficiency as PlanetP, but reduce hevily the node resources used. Atthe same time we will try to keep a low network traffic.

The proposed routing mechanismNode resources Network traffic Search efficiency

MAX - - -- - Our mechanism

MED Our mechanism Our mechanism -- - -

MIN - - -

We keep some features of PlanetP in our routing mechanism , (i. e. thesummarisation method used to represent the content of a node) so we describethe proposed routing algorithm (and it’s improvements over PlanetP), aftera brief description of PlanetP algorithm.

3.3 PlanetP Routing Algorithm

PlanetP[5] is a hibrid system, that uses Bloom filters2 to summarize nodecontents. I will present a short description of PLanetP3, because it is themain system we use to test the semantic routing mechanism.

The P2P system uses Bloom filters in the folllowing manner :

1. Every file that is shared on a node has an associated BloomFilter. TheBloomFilter associated with the file is constructed by OR-ing all theBloomfilters associated with each of the words from the filename.

2. All the BloomFilters associated with the files are OR-ed resulting anew BloomFilter. This is the BloomFilter of the node.

2See Apendix for details3for a complete description of PlanetP P2P system see [5]

20

The BloomFilter of the node is used to summarise all the names of thefiles that are stored on the respective node.

The default routing mechanism in PlanetP can be summarised as such :

The construction of the routing table:

• each node sends it’s nodeID (the BloomFilter) to it’s neighbours

• using neighbours BloomFilters, each node builds it’s own routingtable

• the routing tables are exchanged between neighbours, at a regularbasis (this is the Gossipping mechanism)

• different parameters of the routing tables update mechanism areupdated, so that when the routing tables have stabilised, the ta-bles exchange are made less frequent

The routing process:

• a user creates a request (that can be represented as a list of words)

• each from the request word is transformed in a BloomFilter

• each BloomFilter (from the words in the request) is compared withall the BloomFilters (from the routing table of the current node)

• a set of BloomFilters is obtained and the current node sends di-rectly the request to all those nodes

3.4 Why we have to change PlanetP routing

mechanism

The PlanetP protocol is very reliable and after an initial syncronisation pe-riod, all the nodes from the network have a global knowledge of the entirenetwork. After this initial period, all the requests inserted into the networkwill be fullfilled perfectly. This happens, because every node has a globalknowledge.

The main problem of the default routing mechanism is that it isn’t scal-able. It scales well only for a network size of a few thousand nodes. Whenthe number of the nodes in the network is over 10000, each node will need anunacceptable amount of memory and will require too much processing powerto provide the needed functionality. To make my point clear, i will providean example :

21

Initial considerations:

1. P2P network size == 10 x 1000 nodes

2. 500 files/every node (decent approximation)

3. 7 words/file name (decent approximation)

4. Bloom filter with 1000 words capacity : 1.9KB (see [5], p. 3)

Conclusions:

1. 3500 words/node : each node will have a Bloom filter size of 1.9KBx 3.5 = 6.65 KB

2. each node has a routing table with 10 x 1000 entries (BloomFil-ters)

3. routing table size (on each node) : 66,5 MB

There are 2 main problems with such a table size :

consumes to much CPU The node has to consume a lot of CPU to searchan entry in the table. If the node will organize the table in some way,this management will take a lot of additional CPU.

takes up to much memory The Bloom filters consume a lot of memoryand we should add as much as 20% for the data structures needed tobuild and manage the table.

3.5 What should we change in PlanetP rout-

ing mechanism

The main change we must do in PLanetP : we must keep a constantsize routing table.

The trivial solution would be : keep the routing table small with a simple(i.e. LRU) replacement algorithm. The trivial solution is inefficient, becauseit doesn’t take into account the interests of the local node, neither the globalinterests.

We propose a solution by which the routing table content will adapt tothe changing interests/contents of the nodes in the P2P network.

This solution will have to provide a good balance between:

22

”Routing traffic”vs.”Query traffic”

”Routing table size”vs.”Search efficiency”

23

Chapter 4

A solution : the semanticrouting mechanism

4.1 General considerations regarding the se-

mantic routing algorithm

1. the ”local interests”/”global interests” distinction :

The reasons: We consider this sepparation is usefull, because :

(a) there are wide differences between the interests of the nodesof the P2P network

(b) there are some general trends in the average interests of thenodes from the P2P network

(c) we can use the average interests of the nodes, as a baseline,to decide the degree of the local interests

(d) the local interests of the nodes, change in a different way thanthe average interests of the entire network

The Sources:

local interests To have a representation of the local interests ofa node, we must take into account:

(a) the responses considered succesfull by the local node

(b) the responses considered unsuccessfull by the local node

(c) the queries originating from the local node

(d) the files stored in the local node repository

global interests To have a representation of the global interests(from the perspective of a node), we must take into account:

24

(a) the queries forwarded by the current node

(b) the responses forwarded by the current node

2. the ”local responsability”/”global responsability” distinction :

The reasons: The routing process in our P2P architecture is a co-operative process. The process of forwarding a request from theoriginating node to the responding node involves more than onerouting decisions. The entire (global) responsability of a requestis divided into partial responsabilities, each node in the forward-ing chain having a local responsability. The local responsabilitydepends on the importance of the local routing decision, in thesuccess of the entire operation.

4.2 The algorithms used to calculate the im-

portance of the words

We use statistical methods to determine the importance of the words. Thesame algorithm is used for the global importance and local importance of thewords. The only thing that changes are the sources, used for input.

4.2.1 The algorithm used to determine the local im-portance of a word

In the following description, the following notations will be used:

The Input Sets The sets of words used by the algorithm:

1. The set of all the words that appear in the responses consideredsuccessfull by the current node

WSR

2. The set of all to words that appear in the responses consideredunsuccessufll by the current node

WUR

3. The set of all the words that appear in the queries, originatingfrom the current node

WQ

25

4. The set of all the words that appear in the names of the filescontained in the curent node

WLF

5. The set of all the words related to the curent node

WLocal = WSR ∪WUR ∪WQ ∪WLF

Factors Factors used to scale the importance of the sets of words:

1. The Relative Importance of word occurences in the responses con-sidered successfull by the current node

RISR, 0 ≤ RISR ≤ 1

2. The Relative Importance of word occurences in the responses con-sidered unsuccessfull by the current node

RIUR, 0 ≤ RIUR ≤ 1

3. The Relative Importance of word occurences in the queries origi-nating fromy the current node

RIQ, 0 ≤ RIQ ≤ 1

4. The Relative Importance of word occurences in the names of thefiles contained in the curent node

RILF , 0 ≤ RILF ≤ 1

5. Condition

0 ≤ RISR + RIUR + RIQ + RILF ≤ 1

Operators We will use the following operators :

1. The number of ocurences of a word in a set of words

Occ(α, S) = Card{β ∈ S, β = α}

2. The density of a word in a set of words

ρ(α, S) =Occ(α, S)

Card(S)

The algorithm used to determine the local importance of a wordis :

1: for all α ∈WLocal do2: αLI ← RISR∗ρ(α, WSR)+RIUR∗ρ(α, WUR)+RIQ∗ρ(α, WQ)+RILF ∗

ρ(α, WLF )3: end for

26

4.2.2 The algorithm used to determine the global im-portance of a word

In the following description, the following notations will be used:

The Input Sets The sets of words used by the algorithm:

1. The set of all the words that appear in the queries forwarded bythe current node

WQF

2. The set of all to words that appear in the responses forwarded bythe current node

WRF

3. The set of all the words related to the global context

WGlobal = WQF ∪WRF

Factors Factors used to scale the importance of the sets of words:

1. The Relative Importance of word occurences in the queries for-warded by the current node

RIFQ, 0 ≥ RIFQ ≤ 1

2. The Relative Importance of word occurences in the responses for-warded by the current node

RIRF , 0 ≥ RIRF ≤ 1

3. Condition0 ≥ RIFQ + RIRF ≤ 1

Operators We will use the following operators :

1. The number of ocurences of a word in a set of words

Occ(α, S) = Card{β ∈ S, β = α}

2. The density of a word in a set of words

ρ(α, S) =Occ(α, S)

Card(S)

The algorithm used to determine the global importance of aword is :

1: for all α ∈WGlobal do2: αGI ← RIQF ∗ ρ(α, WQF ) + RIRF ∗ ρ(α, WRF )3: end for

27

4.3 The routing table

The routing table has the following components :

1. The words table

2. The nodes table

3. The recomandation graph (that links (1) to (2) )

The routing mechanism is based on the relation between the words tableand the nodes table.

4.3.1 The words table

The words table is used to store all the words known by the current node.Each entry in the words table has the following components :

local importance The importance of the word from the perspective of theinterests of the current node.

global importance The importance of the word from the perspective ofthe global trends of the network.

word The word (a set of characters).

For each node, the set of words that appear in the word table is obtainedin the following way :

WTotal = WLocal ∪WGlobal

The global/local importance of every word is obtained by applying theprevious mentioned algorithms.

4.3.2 The nodes table

The nodes table is used to store all the nodes known by the current node.Each entry in the nodes table has the following components :

Counters :

reference counter Keeps the number of times the node was reco-manded by a routing decisions.

history counter Keeps a history (of the last n cycles) when the nodewas recomanded by a routing decisions.

28

Indicators :

local importance The importance of the node, from the perspectiveof the local node.

global importance The importance of the node, from the perspectiveof the entire network.

global knowledge The global knowledge of the node, from the per-spective of the local node.

Others :

Node ID The identificator of the node (i.e. the IP address, in aTCP/IP network), used to contact the node directly.

Bloom filter The Bloom filter associated with the node, that sum-marises all the content of the respenctive node.

4.3.3 The links between the words table and the nodes

table

This is used to keep track of all the associations between the words table andthe nodes table. Each entry in the links table has the following components:

Node ID The node ID of the node.

Word The word.

local importance the importance of the association between the word andthe node, from the perspective of the local node

global importance the importance of the association between the wordand the node, from the perspective of the global node

The links have a major role in the adaption mechanism of the semanticrouting mechanism .

4.3.4 The decision table

The decision table is used by every node in the feedback mechanism . Everynode has it’s own decision table, where it keeps track of all the routingdecision made before (over a certain period of time).

Each entry in the decision table has the following components :

29

decision ID A unique string of characters that identifies every routing de-cision locally.

creation time A timestamp, indicating the moment this routing decisionhas been made.

expiration time A timestamp, indicating the moment when this routingdecision isn’t relevant for the feedback mechanism .

LRC The proportion of local trust in the total trust of the routing decision.

GRC The proportion of global trust in the total trust of the routing decision.

word list The list of words that recomanded the current routing decision.It is important to keep them as words (not as a list of indexes in WTotal

set because WTotal can change between the moment when the routingdecision was made and the arriving of the feedback packet

global trust list The list of global trust degrees associated with the linksbetween the words in the word list and the destination node.

local trust list The list of trust degrees associated with the links betweenthe words in the word list and the destination node.

4.4 The initialisation of the routing table

It consist in different initialisation of every component used in the routingprocess:

4.4.1 Initialisation of the words table

When a new node connects to the P2P network it has an empty word table.The initialisation consist in the following stages:

1. the word table is populated with the words from WLocal

2. the local importance of the words is determined with the local impor-tance algoritm

3. the global importance of the words is initialised with a random valuefrom the interval [GWImin, GWImax]

30

4.4.2 Initialisation of the nodes table

When a node connects to the P2P it has a node table with only one entry.This entry contains the node by which the new node connects to the network.The fiels of this entry have the following values:

Counters :

reference counter = 0 (The node has never been reference)

history counter = 0

Indicators :

local importance = LNImax

global importance = a random value from the interval [GNImin, GNImax]

global knowledge = 0

Others :

Node ID = the identificator of the node

Bloom filter = the Bloom filter associated with the contents of thenode

4.4.3 Initialisation of the links between the words table

and the nodes table

When the node connects to the P2P network, the links table has no entries.We add only the links to the only existing node. (See ”Adding a link betweenthe words and the nodes”)

4.4.4 Initialisation of the decision table

When the node connects to the P2P network, the decision table has noentries.

4.5 Operations with the routing table

4.5.1 Operations with the words table

There are 2 operations that can be used with the words table :

Adding a word This operation takes 1 argument :

31

1. Word

The operation is done in the following steps:

1. construct a word entry γ with :

• γWord = Word

• γLocalImportance ← Local Importance for Word (See algorithmabove)

• γLocalImportance ← Global Importance for Word (See algorithmabove)

2. add the word entry to the word table WTotal

WTotal ← WTotal ∪ γ

Deleting a word This operation takes 1 argument :

1. Word


1. locate word entry γ

γ ∈WTotal, γWord = Word

2. remove the word entry from the table WTotal

WTotall ← WTotal γ

4.5.2 Operations with the nodes table

There are 2 operations that can be used with the nodes table :

Adding a node This operation takes 1 argument :

1. NodeID


1. construct a node entry γ

2. add the node entry to the table NKnown

NKnown ← NKnown ∪ γ

Deleting a node This operation takes 1 argument :

32

1. NodeID


1. locate node entry γ

γ ∈ NKnown, γNodeID= NodeID

2. remove the node entry from the table NKnown

NKnown ← NKnown γ

4.5.3 Operations with the links between words table

and the node table

There are 2 operations that can be used with the words table :

Adding a link This operation takes 1 argument :

1. NodeID


1. Obtain the list of words from the word table, that are containedin the Bloom filter of the word

ListWords(NodeID) = {α ∈ WTotal, ∃β ∈ NKnown, α ∈ βBloomFilter}

2. For each word γ in the word list ListWords(NodeID), add a linkthe LWord/Node set

LWord/Node ← LWord/Node ∪ λ, where

λLocalImportance ∈ [LLImin, LLImax]

λGlobalImportance ∈ [GLImin, GLImax]

λNodeID= NodeID

λWord = γ

Deleting a link This opeation takes 2 arguments:

1. NodeID

2. Word

33


1. Localise link λ

λ ∈ LWord/Node, λNodeID= NodeID, λWord = Word

2. Remove the λ link from the LWord/Node

LWord/Node ← LWord/Node/λ

4.5.4 Operations with the decision table

There are 2 operations that can be used with the decision table :

Adding a decision This operation takes 3 arguments :

1. Word List

2. Global Trust List

3. Llocal Trust List


1. construct a decision entry γ with :

• γWordList ← Word List

• γGlobalTrustList ← Global Trust List

• γLocalTrustList ← Local Trust List

• set the timestamp with the local time

• complete the routing factors for the decision

2. add the word entry to the decision table DecisionRouting

DecisionRouting ← DecisionRouting ∪ γ

Deleting a decision This operation takes 1 argument :

1. Decision ID


1. locate decision entry γ

γ ∈ DecisionRouting , γDecisionID = DecisionID

2. remove the decision entry from the table DecisionRouting

DecisionRouting ← DecisionRouting/γ

34

4.6 The packets used in the routing process

In the mechanism we propose, the following types of packets will be used:

query packets Those packets contain the search pattern issued by the re-quester node . The packets are used in the search stage of the routingprocess.

response packets Those packets contain the responses created by the nodesthat have informations relevant for the search pattern . The packetsare used in the response stage of the routing process.

feedback packets Those packets contain information needed for the ad-justment of the routing parameters of all the nodes that participatedin the search stage . The packets are used in the feedback stage of therouting process.

4.6.1 The query packets

Each query packet has the following components:

request ID A unique string of characters, that identifies every request is-sued in the network. This is used in the feedback stage to associate aquery packet with a feedback packet .

requester node ID A unique identifier, that identifies the requester node. This is used to communicate directly with the requester node . 1

creation time The cycle when the request was issued by the requester node.2 This is used in the feedback stage to calculate tha maximum timea node can wait for a feedback packet.

creating TTL The inital Time-To-Live of the query packet . This is theupper limit of the length of the forwarding path, for the query packet.

TTL The Time-To-Live of the query packet . It is decreased by 1, everytime the query packet is forwarded by a node.

1In our simulation, PeerSim identifies every node uniquely by an ID. In a TCP/IPnetwork this would be replaced by the IP of requester node .

2When not in a cycle based simulation, this can be replaced by a timestamp

35

multiplication factor This represents the number of nodes that can re-spond to the query packet . The most important role of this factor isto permit a fair distribution of the global succes of a request, along thenodes that forwarded the request.

search pattern The search pattern is represented as a list of couples(Word,Word importance). The components of every couple have the followingsignification :

Word A string of characters (i.e. ”hello”)

Word importance A number, representing the importance of theword for the requester node .

packet history The packet history is represented as a list of triplets(NodeID, multiplication factor , Decision trust). The history is used in thefeedback stage , to adjust the routing parameters of every node alongthe forward path of the query packet . The components od every triplethave the following signification (for a node X):

Node ID The Node ID of the node X.

Decision ID The ID of the routing decision taken by node X.

Multiplication factor The multiplication factor assigned to the querypacket by node X.

Decision trust This represents the degree of trust, asociated by nodeX with the routing decision.

4.6.2 The response packets

Each response packet has the following components:

response ID A unique string of characters, that identifies every responsein the network. This is used in the response stage to associate a querypacket to every response packet .

request ID The request associated with this response.

responder node ID The Node ID of the responder node .

relevant documents The list of relevant documents on the responder nodethat satisfy the search pattern . Each element in the list is a triplet(Document ID, Document name, Document size). The components ofevery triplet have the following signification:

36

Document ID The identifier of the document on the responder node. On the responder node , this this uniquely identifies every doc-ument. This ID is needed to refer the document on the respondernode , in case it is selected for download.

Document name The name of the document on the responder node .This information is needed by the user (or user model mechanism)on the requester node . One of the main factors that determine adocument selection is it’s name.

Document size The size of the document on the responder node .This information is needed by the user (or user model mechanism)on the requester node . Another main factor that determine adocument selection is it’s size.

packet history The packet history is represented as a list of triplets(NodeID, multiplication factor , Decision trust). The history is used in thefeedback stage , to adjust the routing parameters of every node alongthe forward path of the query packet . The components of every triplethave the following signification (for a node X):





4.6.3 The feedback packets

Each feedback packet has the following components:

feedback ID A unique string of characters, that identifies every feedbackpacket in the network.

request ID The ID of the request associated with this feedback packet .

success of route The total success of the current route (represented in thepacket history ).

packet history The packet history is represented as a list of triplets(NodeID, multiplication factor , Decision trust). The history is used in thefeedback stage , to adjust the routing parameters of every node along

37

the forward path of the query packet . The components of every triplethave the following signification (for a node X):





4.7 The routing process

The routing process has 3 stages :

search stage In this stage the query packets are propagated through thenetwork. Each node along the path makes a routing decision for everyquery packet .

response stage In this stage the node/nodes that received the query pack-ets and have relevant documents will create responses.

feedback stage In this stage the routing parameters of the nodes (alongthe routing path) will be adjusted so that routing process will improveit’s efficiency.

4.7.1 The search stage

Overview

In the search stage , every node executes 3 steps:

1. receives a query packet

2. determines a list of nodes, where to forward the query packet

3. forwards the query packet

[Step 1] and [Step 3] won’t be explained further because are trivial.

38

[Step 2] How does the node determine where to forward the query

In the following description, we will use the notations:

The Input Sets: The sets used by the algorithm:

1. The set of all the words, that are contained in the search pattern(m words)

YSearch = {s1, s2, ..., sm}

2. The set of all the nodes, that are contained in the node table (nnodes)

NKnown = {n1, n2, ..., nn}

3. The set of all the words, that appear in the local word table (kwords)

WTotal = WLocal ∪WGlobal = {w1, w2, ..., wk}

4. The set of all the links between the words in the local words tableand the nodes from the nodes table (n X k links)

LWord/Node = {l11, l12, ..., l1n, l21, ...lkn}

5. The set where we keep all the routing decision made on the localnode (over a period of time)

DecisionRouting = {decision1, ..., decisionn}

6. The set of nodes, where a query packet must be forwarded

NForward = {n1, n2, ..., np}

7. The set of total trusts in the query packets

TForward = {t1, t2, ..., tp}

8. The set of multiplication factor s for the query packets to be for-warded

MForward = {m1, m2, ..., mp}

Factors:

1. The Local Routing Coeficient.

LRC , 0 ≤ LRC ≤ 1

39

2. The Global Routing Coeficient.

GRC , 0 ≤ GRC ≤ 1

3. The query packet Multiplication Coeficient

QMC

Operators: We will use the following operators :

1. The set of links between the word set Y and node α:

SubsetLink(α, Y ) = {λij , λij ∈ L, 0 ≤ i ≤ Card(WTotal), 0 ≤ j ≤Card(NKnown), ∃wi, wi ∈ L, wi ∈ Y, ∃nj , nj ∈ NKnown, α = nj}

2. The local trust set of the link set P:

TrustSetLocal(P ) = {ti ∈ ℜ, ti = li,LocalImportance, l ∈ P, 0 ≤ i ≤ Card(P )}

3. The global trust set of the link set P:

TrustSetGlobal(P ) = {ti ∈ ℜ, ti = li,GlobalImportance, l ∈ P, 0 ≤ i ≤ Card(P )}

4. Recomandation degree of trust set X:

Recomandation(X) =Card(X)∑

i=0

xi

5. Global trust degree of word set Y, in node α

TrustLocal(α, Y ) = Recomandation(TrustSetLocal(SubsetLink(α, Y )))

6. Global trust degree of word set Y, in node α

TrustGlobal(α, Y ) = Recomandation(TrustSetGlobal(SubsetLink(α, Y )))

7. The total trust stored in a trust set X:

TotalTrust(X) =Card(X)∑

i=0

xi

The algorithm by which the node determines where (and in howmany copies) to forward the query :

40

1: {Calculate the total trust for all the known nodes}2: for all α ∈ NKnown do3: T [α]← LRC ∗ TrustLocal(α, Y ) + GRC ∗ TrustGlobal(α, Y )4: end for5: {Calculate the multiplication factor for each known node}6: for all α ∈ NKnown do7: M [α]← T [α]

TotalTrust(T )∗QMC

8: end for9: { Add all the routing decisions in the local routing table}

Adjustments for the node table entries

After the algorithm selects a set of nodes, where to forward the query re-ceived, we must adjust different parameters for those nodes. Those parame-ters that are associated with every entry, will be used by a component thatmanages the size of node table.

For every node α that was selected, we must adjust parameters of it’sentry E[γ] in the node table:

1. Adjust the reference counter

E[γ]ReferenceCounter ← E[γ]ReferenceCounter + 1

2. Shift the reference history

E[γ]ReferenceHistory ← E[γ]ReferenceHistory << 1

3. Increase the local importance of the node with the total TrustLocal

associated with his routing decision

E[γ]LocalImportance ← E[γ]LocalImportance + TrustLocal

4. Increase the global importance of the node with the total TrustGlobal

associated with his routing decision

E[γ]GlobalImportance ← E[γ]LocalImportance + TrustGlobal

Observations regarding the search stage

The maximum number of query packets The multiplication factor andthe initial TTL controls the maximum amount of query packets thatwill be generated in the network by a certain request. The relation is :

PacketsNumber/Query = TTLInitial ∗QMC

41

The multiplication factor This factor is used primarily in a mechanismof responsability of distribution. When a node takes a routing decision,the responsability associated with it is directly proportional with theQMC associated with the forwarded query packet .

4.7.2 The response stage

Overview

In the response stage , every node executes 3 steps :

1. receives a query packet

2. creates a response packet (if it has relevant information for the searchpattern )

3. sends the response packet

The third step won’t be discussed separately, because it is determined bythe way we choose to execute [Step 1].

[Step 1] Receiving a query packet

The main question here is :

How does a node determine which is the correct query packet ?

The problem we are discussing here, appeares because the system is opento the possibility of duplicated queries packet, arriving at the same respondernode .

In such a situation a responder node receives (over a period of time)several copies (query packets ) of the same search.

There are several ways in which the system can treat the problem:

Process everything With this solution the responder node processes everyquery packet as soon as it arrives.

Advantages The requester node decides which of the paths were bet-ter. The feedback mechanism , is controled completely by therequester node .

Disadvantages

42

Figure 4.1: Duplicated queries

1. Additional response traffic due to duplicated responses.

2. Less information relevant for the search pattern is gathered.(less paths, less documents, etc.)

Process one, ignore others With this solution the responder node willchose only one packet from the query packets .

Advantages There is less overhead response traffic.

Disadvantages

1. The feedback mechanism is controled partially by the re-quester node .

2. Less information relevant for the search pattern is gathered.(less paths, less documents, etc.)

Process first, forward others With this solution the responder node willprocess the first query and will forward the others.

Advantages

1. The requester node decides which of the paths were better.The feedback mechanism , is controled completely by the re-quester node .

43

2. The requester node has the maximum amount of information,that could be gathered in the specified network conditions.

When we treat this problem, we have to take into consideration the fol-lowing constraint :

the entire feedback mechanism must be controled by the requesternode

We cannot chose the second solution (”Process one, ignore others”) be-cause we want the entire feedback mechanism to be controled by the requesternode . At last, we choose the third solution (”Process first, forward others”),because it offers almost the same trade-off as the situation in which we haveno duplicated querries.

[Step 2] The creation of a response packet

The main task of this step is the construction of a response, that containsa list of the resources available on the responder node . Those resourcesmust match the search pattern , present in the query packet . The matchingalgorithm searches the pattern only in the names of the files.

In the following description, we will use the following notations:

The Input Sets:

1. The set of locally stored content (triplets) :

DLocal = {(dID, dName, dSize)1, ..., (dID, dName, dSize)h}

where :

dID The ID of the document.

dName The name of the document.

dSize The size of the document.

2. The set of all the words, that are contained in the search pattern(m words)

YSearch = {s1, s2, ..., sm}

3. The set of documents (triplets) that will be sent in the response

DResponse = {(dID, dName, dSize)1, ..., (dID, dName, dSize)q}

Operators:

44

1. The subset of DLocal that contain the word α

Contains(α) = {β, β ∈ DLocal, ∃γ ∈ βName, α = γ}

The algorithm by which the node determines what are the doc-uments to be included in the response packet :

1: for all α ∈ YKnown do2: DResponse ← DResponse ∪ (Contains(α)− (Contains(α) ∩DResponse))3: end for

4.7.3 The feedback stage

Overview

The feedback stage , consist of 4 steps :

1. create feedback packet

2. receive feedback packet

3. adjust routing parameters

4. forward the feedback packet

The first step of the feedback stage is performed only by the requesternode . The following 3 steps (steps : 2,3 and 4) are executed by every nodethat appears in the forward list of the feedback package.

[Step 1] The creation of the feedback package

For every response packet received, the requester node sends a feedbackpacket along the forwarding path of the query packet . Every feedback packetcontains a ”success of route” field. This success of route depends on the usermodel .

We will use the following notations :

The Input Sets:

1. The set of documents (triplets) contained in the response packet


2. The set of documents selected (as successfull) by the user model

DFeedback = {(dID, dName, dSize)1, ..., (dID, dName, dSize)k}

45

3. The set of feedback packages, constructed by the requester node

PFeedback = {PacketFeedback(1), ..., PacketFeedback(n)}

Factors:

1. Success factor associated with the (selected) document x

Succesx

2. The maximum success factor

MAX SUCCESS

The user model used at this time is :3

1. Consider the entire set of documents, that appear in the response packet


2. Select the first 2 documents from DResponse

DFeedback = {(dID, dName, dSize)1, (dID, dName, dSize)2}

3. Asssociate the maximum succes degree to each document selected

Succes1 = MAX SUCCESS,

Succes2 = MAX SUCCESS

4. Construct a feedback package for each document selected

PacketFeedback(1) = {feedbackID1, requestID, Succes1, packethistory}

PacketFeedback(2) = {feedbackID2, requestID, Succes2, packethistory}

5. Construct the set of feedback packets for the current response

PFeedback = {PacketFeedback(1), PacketFeedback(2)}

46

Figure 4.2: Conflicting feedback packets

[Step 2] Receiving the feedback packet

The problem we are discussing here, appears because there is the possibilityto receive different feedback packets for the same routing decision (made bythe current node).

It is very important to treat this problem, because it affects the feedbackmechanism , thus affecting the way the routing parameters are adjusted.When we treat this problem, we have to take into consideration the followingconstraint :

there isn’t any limit for the time interval between succesive feed-back packets

Because of the constrain, we cannot take into consideration a corelationbetween succesive feedback packets , neither on the requester node , neitheron any on the nodes along the forwarding path.

To make my point clear, consider the example: (from the figure above)

3See Apendix for different observation regarding an improved version of the user model

47

• N1 issues a search

• different query packets arrive at N3 and N5

• both N3 and N5, create response packets (different)

• the response packets arrive at N1

• at t1 N1 sends a feedback message (F1) along the forward path (P1),of the successful query packet

• at t2 N1 sends another feedback message (F2) along the forward path(P2), of the successful query packet

• t2 − t1 =?

Thus we will have to treat every feedback packets independent of eachother. The problem of succesive feedback packets cannot be solved, whentaking into consideration the constrain mentioned.

We will treat separately every feedback packet that arrives on a node.

[Step 3] Adjustment of the routing parameters

This is the core component of the routing mechanism . The goal of this stepis to adjust the local routing parameters so that the efficiency of the routingmechanism is improved over time.

The Input Sets:

1. The feedback packet processed

PFeedback = {feedbackID, requestID, successRoute, packetHistory}

where the packetHistory is a list of triplets, each with the followingfields



Multiplication factor The multiplication factor assigned to thequery packet by node X.

Decision trust This represents the degree of trust, asociated bynode X with the routing decision.

48

2. The set where we keep all the routing decision made on the localnode (over a period of time)

DecisionRouting = {decision1, ..., decisionn}

3. The set of all the links between the words in the local words tableand the nodes from the nodes table (n X k links)

LWord/Node = {l11, l12, ..., l1n, l21, ...lkn}

Operators:

1. Cummulative multiplication factor of feedback packet

Λ() =Card(PF eedback)∑

i=0

packetHistory[i](Multiplicationfactor)

2. Total Success for node α

Γ(α) =packetHistory[α](Multiplicationfactor)

Λ()∗ successRoute

3. Total Success for node α in routing decision β

Υ(α, β) = Γ(α) ∗ decisionIDβ(Decisiontrust)

4. Total Local Success for node α in routing decision β

Ψ(α, β) = Υ(α, β) ∗ decisionβ(LRC)

5. Total Global Success for node α in routing decision β

Φ(α, β) = Υ(α, β) ∗ decisionβ(GRC)

6. Local Success of word α in decision β on node γ

∆(α, β, γ) = Ψ(γ, β) ∗ decisionβ(LocalTrust[α])

7. Global Success of word α in decision β on node γ

Θ(α, β, γ) = Φ(γ, β) ∗ decisionβ(GlobalT rust[α])

The algorithm by which the node adjusts it’s routing parameters:

1: γ ← nextNode2: for all α ∈ decisionpacketHistory[currentnode](DecisionID)(wordlist) do3: for all λ ∈ LWord/Node do4: λαγ(LocalImportance)← ∆(α, DecisionID, γ)∗decisionDecisionID(LocalTrust[α])5: λαγ(GlobalImportance)← Θ(α, DecisionID, γ)∗decisionDecisionID(GlobalT rust[α])6: end for7: end for

49

4.8 How to keep a constant size of the node

table

This is the main goal of our intervention over the PlanetP protocol. So, wewill use some node entry replacement algorithms that take into considerationthe importance and frequency of use, of every node in the node table. Whenwe add a node entry and the maximal size of the node table is exceeded wewill replace a old node with the new one.

At this moment we use the Aging Replacement Algorithm. This algorithmcompares the usage over the last n cycles for every entry in the node table.The latest uses will have precedence over new newer uses.

The aging algorithm selects for deletion a node entry from the nodetable. The algorithm has the following steps:

1. Find the node α with the minimal history counter

NodeSelected = {α ∈ NKnown,∀β∈NKnown,αHistoryCounter<βHistoryCounter}

The replacement algorithm is applied on a subset of NKnown. This subsethas a size of 10% of the NKnown. The subset is obtained with the followingalgorithm:

1. Sort NKnown in ascending order byαLocalImportance+αGlobalImportance

2

2. Select a subset Σ of NKnown with the first 10% entries

Card(Σ) =1

10∗ Card(NKnown)

So, the complete replacement algorithm we used, has the following steps:

1. Obtain Σ, a subset of NKnown

Card(Σ) =1

10∗ Card(NKnown)

2. Select the node to be deleted from Σ with the Aging algorithm

NodeSelected = {α ∈ Σ, ∀β ∈ Σ, αHistoryCounter < βHistoryCounter}

50

Chapter 5

Important constants used bythe routing mechanism

Several constants are of crucial importance for the optimal functioning of theproposed routing mechanism. Below, we will give a detailed description ofthe way the constants are determined.

5.1 The initial multiplication factor

The multiplication factor is used in a responsability management mechanism.The responsability management mechanism is an important factor in theadaptability of our routing mechanism.

There are 3 important factors that influence the value of the multiplica-tion factor:

1. TTL (Time-To-Live)

2. DegreeLinkAverage

3. MultiplicationMinimum

To see the importance of the TTL, i will provide an example, where themultiplication factor isn’t correctly adjusted to respond to the TTL of thequery packets.

In the case shown in the picture, we have a situation in which after TTL= 3, the multiplication factor does not reflect the trust of the decision ofthe nodes ( N3 and N4) because it has reached the minimum value of 1. Atthis point the multiplication factor cannot be further divided (in subunitaryvalues).

51

Figure 5.1: A case when the multiplication factor is too small

For each routing decision we must take into account the overall trustfactor of the decision. Thus it is vital to give a minimum interval of multi-plication for each routing decision.

The link degree expresses the numbers of neigbours a node has. Thenumbers of neighbours for a node is determined by the size of the routingtable for that node. We must take this factor in consideration, because in theWorst Case Situation, we must consider that every node on the packetpath will decide to distribute the received packet, among all it’s neighbours,with an equal multiplication factor. In this case we must be assured that thelast nodes (i.e. TTL 5 level nodes) will receive a packet with the minimummultiplication factor.

If this condition is fullfilled, the multiplication factor associated with thesearch packets received by the last nodes will be able to represent accurately,the trust associated (by the last nodes in the forward chain) with the respec-tive routing decision.

Thus we would have the equation:

MultiplicationInitial = TTL ∗DegreeLinkAverage ∗MultiplicationMinimum

We will choose MultiplicationMinimum = 100. We consider that such ascale is sufficient for a optimal representation of the trust in the routingdecision.

In this situation, we can calculate the value of MultiplicationInitial :

MultiplicationInitial = 10 ∗ 500 ∗ 100 = 500000

52

5.2 The number of entries in the routing ta-

ble

This factor influences directly the resources usage on every node on the net-work. The size of the routing table depends mostly on :footnotethe word table, and the links between words and nodes, have lessthat 1

1. the size of an entry in the node table

2. the number of entries in the node table

The studies made with PlanetP have shown that a Bloom filter associatedwith a node has an average size of 1.9KB. Adding other flags/counters wecan estimate that the average size of an entry in the node table is 2.0KB. Atthe same time we appreciate that a P2P application shuould not use morethan 3 MB of RAM. In this case we should set as a default size of 1MB forthe routing table. In this case :

EntriesNodeTable =TotalSize

EntrySize=

1MB

2KB= 500

In this case we must limit the number of nodes in the node table to 500.

5.3 The value of Time-To-Live

This factor determines the numbers of nodes that can forward a query packet.Statistical studies P2P networks have shown that an TTL=10 is sufficient toprovide an efficient search in large scale P2P networks.

At this moment, we will use a TTL=10.

5.4 The minimal match degree

The comparison of Bloom filters is a central operation in our system. When2 Bloom filters are compared, the algorithm returns a match degree. Wemust define a threshold, that will allow us to make the distinction betweenthe :

1. the filters that don’t match (DegreeMatch < DegreeThreshold)

2. the filters that match (DegreeMatch ≥ DegreeThreshold)

We propose a statistical method for the determination of this factor :

53

1. Take each word from the WTotal

α ∈WTotal

2. Construct an arbitrary, but typical filename1

NameAverageF ile = {α, w2, w3, ..., w6}, where

wi 6= wj, wi 6= α

3. Construct a Bloom filter associated with α

FBα

4. Construct a Bloom filter associated with NameAverageF ile

FBAverageF ile

5. Compare FBα with FBAverageF ile and obtain the match degree γi

γi = Compare(FBα, FBAverageF ile)

6. Compute the average match degree β

DegreeThreshold =

∑Card(WTotal)i=0 γi

Card(WTotal)

At this stage of the project we calculate the DegreeThreshold in the wordtable initialisation phase.

5.5 Routing factors

The relative importance of the local words are the following:

1. RISR = 0.5

2. RIUR = 0.2

3. RIQ = 0.2

1A average filename has 7 different words

54

4. RILF = 0.1

The relative importance of the global words are the following:

1. RIFQ = 0.2

2. RIRF = 0.8

The relative importance of the routing recomandations:

1. LRC = 0.8

2. GRC = 0.2

3. MAXSUCCESS = 1

Intervals used to initialise different structures:

1. GWImin = 0, GWImax = 1

2. LNImin = 0, LNImax = 1

3. GNImin = 0, GNImax = 1

55

Part II

The project implementation

56

Chapter 6

Overview

6.1 Introduction

The development process of this project has modularity as a major goal. Wewant to have a modular project because:

• modularity will improve the development process: code writting, de-bugging, documenting

• modularity will allow easy replacement of different parts (very usefullfor testing different approaches to the same problem)

• modularity will allow us to reuse code in other projects

Across the entire project we try to keep the input data (configurationfiles, input files) and output data (log files, statistics files, graphical files,etc.) as modular as possible.

We will try to have portable, modular and expressive content for all ourfiles (configuration/input/output files), so we will use XML as standard.

At this stage of development the RARE II project has 5 main components.Below, we will briefly describe the relation between the components.

The PeerSim component[2] is a simulation engine for large scale networks.At this stage of the project we use the latest Peersim version (version 1.0).ThePeerSim is very minimalistic, accepting on top of it a modular stack of pro-tocols. On top of PeerSim we have the network overlay (Gnutella overlay)consisting of several stacked protocols. Some of those protocols are Gnutellaspecific, others are implemented for logging/observing purposes.The routingmechanism links the PeerSim module and the network overlay module. Wechoosed a sepparate routing module, because most of our research is localisedhere.

Apart from these, we have two more components :

57

1. the statistics gathering system

2. the graphical visualisation system

The statistics gathering system is nothing more than a collection of probes,that report various parameters from our network at every cycle of the sim-ulation. The graphical visualisation system processes the outputs providedby the statistics gathering system and generates different graphics.

58

Figure 6.1: Overview of the project’s components

59

6.2 The PeerSim simulator

6.2.1 A short description

PeerSim is a network simulator. It is very flexible, allowing both event drivenand cycle driven simulation. We use it, to simulate large scale networks(usually >1500 nodes).

6.2.2 PeerSim main advatages:

PeerSim was choosed from a wide variety of available simulators, for thefolloing reasons:1

1. implemented in a high level language (Java)

2. it is very modular

3. it is parametrisable

4. the development of protocols is relatively easy

5. relatively well documented

6.2.3 How does PeerSim work

In PeerSim, we have 2 types of entities:2

Protocols The protocol entities define the behaviour of the nodes(in eachcycle/event). The protocols are defined in the protocol section of theconfiguration file. The most important parameters of every protocolare:

name Each protocol is identified by a String

lnk The name of the protocol that is below the current protocol

class The class that implements the functionality of that protocol

Dynamics The dynamic3 entities are use the implement the initialisers com-ponents and the observers. Those components are defined in the controland the init section of the PeerSim configuration file. Their parametersare customised for each component in part.

1For the criteria used to evaluate network simulators, see [10]2To understand in detail, how PeerSim works and how to implement new Proto-

cols/Dynamics entities, see the tutorials [8],[9]3There have been major changes from PeerSim-0.4 to PeerSim-1.0 in this section. For

a detailed description of the major changes, see [8]

60

The behaviour of the entire PeerSim is defined in the global configurationfile4. In the configuration file we define :

• simulation related parameters (number of cycles, simulation type, etc.)

• the protocol stack for each node

• protocol specific parameters

• network initalisers (that define the initial state of the network : archi-tecture, documents, queries & replies)

6.3 The network overlay

The network overlay was developed in the previous stage of RARE II project.It was developed according to Gnutella specifications. This component has3 main subcomponents:

the overlay subcomponent This subcomponent stays at the base of theNetwork overlay, implementing the routing mechanism, by providingon request a list of neighbours for the current node of the network.

the query subcomponent This subcomponent manages all the messagequeues for the current node (incoming, outgoing and forwarding). Ituses the overlay subcomponent to forward/send messages.

the query hit subcomponent This subcomponent stays at the top of theNetwork overlay, implementing the downloading of files and other not-so-important functionality.

On initalisation it will have to read the following configuration files:

documents definitions A list of documents that will be spread across theentire network.

documents distribution A list of documentid, peeridpairs, used to specifythe distribution of the documents on the peers.

queries definitions A list of queries & responses (with their associatedcycle number) that will be inserted in the network.

queries distribution A list of queryid, peerid, used to specify the distribu-tion of the documents on the peers.

4See Apendix X for a sample configuration file

61

6.4 The routing mechanism

The routing subcomponent interacts with the network overlay, providing anlist of neighbours, that are optimum for the routing process. Through thiscomponent, every node has a routing table. We will implement differentrouting mechanisms, that will manage :

• updating an entry in the routing table

• deleting an entry in the routing table

• adding an entry in the routing table

On the initialisation, we can select from different routing algorithms avail-able. This is one core subcomponent of our project. An important proportionof the development will be done here.

6.5 Statistics gathering system

The statistics gathering system is developed in close relation to the observingmechanisms provided by the PeerSim. All the statistics from the system areimplemented as observers5 (PeerSim terminology).

Each observer, to work properly, has to have a coresponding protocol inthe protocol stack of each node. This gives it the abillity to gather informa-tion from each node, at each cycle.

The initialisation of the observers is defined in the global configurationfile of PeerSim.

6.6 Graphical visualisation system

This component is somehow separated from the entire project. It has differentsubcomponents, for different types of graphical representation. It uses asinput different XML files, that are obtained from the translators (that takesas input the reports obtained from the observers). At this staga of the projectwe support 2 types of graphical outputs :

1. VRML format (.wrl files)

2. PNG format

5for a practical documentation, on how to implemet an observer in PeerSim, see [8]

62

Chapter 7

Data gathering components

7.1 Observer 1

7.1.1 What it observes

This observer, is runned in every cycle and collectes the following informationfrom each node (in every cycle) :

total forwarded messages the sum of the number messages that were for-warded in the cycle by every node

cycle number the total number of messages forwarded until every cycle

7.1.2 How it observes

In every cycle it uses the myGNutella Variables class to obtain the totalnumber of forwarded messages in that cycle.

This observer must be runned as a control in PeerSim, so it must bedeclared in the control section of the PeerSim configuration file.

7.1.3 The output generated

After the last cycle, this observer generates an XML file with the couple(total forwarded messages,cycle number). 1

1See Apendix Xxx for an output example

63

7.2 Observer 2

7.2.1 What it observes

This observer, is runned in every cycle and collectes the following informationfrom each node (in every cycle) :

average routing table size the average size of the routing tables of all thenodes in the network

cycle number the cycle number

7.2.2 How it observes

In every cycle it uses the RoutingTableList from myGnutella Variable class.From this it finds out the current size of the routing table in the current cycleof all the peers in the network.

This observer must be runned as a control in PeerSim, so it must bedeclared in the control section of the PeerSim configuration file.


After the last cycle, this observer generates an XML file with the couple(routing table average size,cycle number). 2


64

Chapter 8

Graphical visualisationcomponents

8.1 The vrml-graph-plus package

8.1.1 Introduction

Because, in our experiments we use large scale networks, (with thousandsof nodes) a graphical representation of our architectures would give us thepossibility to make intuitive judgements regarding the results of our work.To achieve a high level of expressiveness, we propose the development of a3D display module. The graphical representations generated by our module,should allow the representation of :

1. the nodes in the network (with the following caracteristics):

• shape

• size

• transparency

• colour

• caption

2. links between the defined nodes (with the following caracteristics):

• shape

• size

• colour

• transparency

65

After studying different ways to implement the 3D display enviroment,we choosed VRML.

8.2 VRML 2.0

8.2.1 A short description

The VRML 2.0 standards1 were specified in August 1996. Those standardsdescribe how to construct a 3D world, were both static and dynamic aspectscan be specified. All the specifications of the 3D world are described in a.wrl file, in a tree like manner.

To be able to view and explore the VRML world, the .wrl file must loadedin a VRML-aware browser (i.e. Cosmo Player[1],White Dune[7], etc..

8.2.2 VRML design criteria

Authorability 2 Enable the development of computer programs capable ofcreating, editing, and maintaining VRML files, as well as automatictranslation programs for converting other commonly used 3D file for-mats into VRML files.

Composability Provide the ability to use and combine dynamic 3D objectswithin a VRML world and thus allow re-usability.

Extensibility Provide the ability to add new object types not explicitlydefined in VRML.

Be capable of implementation Capable of implementation on a widerange of systems.

Performance Emphasize scalable, interactive performance on a wide vari-ety of computing platforms.

Scalability Enable arbitrarily large dynamic 3D worlds.

8.2.3 VRML 2.0 - The main advantages :

Platform independece VRML is platform independent. All you need isthe file that describes the VRML World (the .wrl file) and a VRMLbrowser that runs on your OS.

1See the VRML specifications at [4]2The VRML design criteria are taken from the VRML 1996 specifications at [3]

66

Interpreted script VRML is interpreted script. Thus, it is very easy : tounderstand, to debug, to extend.

Very powerfull VRML is a very powerfull graphical enviroment, permit-ting :

1. implementation of static and dynamic graphics

2. shape renderisation =¿ graphical expressiveness

3. easy navigable 3D environment

Maturity VRML is a very mature technology. It is well understood andmany applications are using it.

Free VRML is free to use.

8.3 Our implementation of the display envi-

ronment

8.3.1 The vrmlgraph package

As a starting point, we used the vrmlgraph3 which is a 3-D VRML graphdrawing package in Java. This package offers the following facilities:4

• Storage of Nodes and Edges of a 3-D graph in a single GraphDataobject.

• Parsing descriptions of connected nodes from a graph program text fileand populating a GraphData object with them. The nodes do not needto specify their x,y,z locations.

• Performing 3-D spring embedding calculations to produce (often) aes-thetically pleasing graphs from any input.

• Center any 3-D graph about the origin.

• Output a text file that describes the currently stored 3-D graph.

• Output a VRML file that shows a 3-D view of the current graph.

3The homepage of this package is [4]4The list of vrmlgraph facilities were taken from vrmlgraph homepage, at [5]

67

Althrough the package is versatile, we needed to create graphs with morecustomisation. We extended the vrmlgraph package and obtained a newpackage vrmlgraphspecial.5

8.3.2 The vrmlgraphspecial package

The vrmlgraphspecial package has the following additional functionality:

• Storage of the description of the nodes appearence in XML configura-tion file. For each node, we can specify the following:

– shape

– size

– transparency

– colour

– caption

• Storage of edges description in XML configuration file. For each edge,we can specify the following:

– shape

– size

– colour

– transparency

• Storage of the default appearence in XML configuration files, for thenodes and the edges not explicitely defined.

Due to these extensions we can represent very expressive graphical rep-resentation.6

For a detailed documentation if the vrmplgraphspecial (including config-uration examples) see Appendix Y.

5Another extension of the vrmlgraph package, has been done by Ramon Wartala. Forhis implementation see [6]. But it is not for public use, so we had to create an extensionfrom scratch.

6See Appendix X, for screenshots

68

Chapter 9

Data representation &generation components

9.1 The XML Writter package

9.1.1 Description

This component implements a tree, where each node in it represents an XMLelement. The tree can be written out in an xml file. THe component is veryusefull, because it allows us to write XML files very conveniently.

9.1.2 How it is used

The component implements different metods for tree data structure man-agement. Because we use this component, mainly for XML file generation,we use only the method for node adding (AddNode). After the tree is con-structed, we specify the path of the XML file. (SetFilePath) Finally we callthe method for XML generation, specifing as argument, the Tree constructedin the previus step (TreeNodeParser).

9.2 The Input Data Generators

9.2.1 Description

This component, generated the input data for our simulations. This way wecan control very well, how data is distributed on the peers of the network andto simulate users behaviour. This will prove more usefull in a future stage ofdevelopment, when we will develop different models of users behaviour.

69

9.2.2 Query distribution generator

This subcomponent creates a queries distribution input file. It is highlycustomizable. Its parameters are:

outputFile the path to the file where it should output the configurationfile

cycleNumber defines the cycles interval [0,cycleNumber] in which it candistribute the queries

networkSize define the peers interval [0,networkSize] in which it can dis-tribute the queries to peersIDs

queriesNumber define the peers interval [0,queriesNumber] in which it candistribute the queries IDs

minQueriesPerNode defines the minimal number of queries distributedon a node

maxQueriesPerNode defines the maximal number of queries distributedon a node

minQueryTTL defines the minimal TTL of a query

maxQueryTTL defines the maximal TTL of a query

Using those parameters the generator outpus a file that will describe the waythe queries are distributed on the peers from the network.

70

Chapter 10

Statistics processing system

10.1 Query Send & Hit Receive

10.1.1 What is does

It uses the logs describing the simulation, that were generated by the LogHan-dler component. From this log file, it extracts the couple (query send time,hitreceive time) for each succesfull query.

10.1.2 The input processed

The log file processed is an XML file. In this file, every query has an asso-ciated time of sending and an associated time of response (hit time). Thiscomponent processes only the successfull queries. It constructs a bidimen-sional representation of the set of couples.


It uses the GnuplotWrapper to generate files that can be parsed by Gnuplot.The files generated contain lines, that will be draw by gnuplot.

The graphical representation has the following characteristics:

OX axis it represents the sending time for the query

OY axis it represents the receiving time for the hit

1


71

10.2 Messages forwarded per cycle

10.2.1 What is does

It uses the logs describing the simulation, that were generated by the LogHan-dler component. From this log file, it extracts the couple (messages for-warded,cycle number) for each succesfull query.


The log file processed is an XML file. From this file we extract the numberof messages that where forwarded in every cycle (got with an observer) Itconstructs a bidimensional representation of the set of couples.




OX axis it represents the number of messages forwarded in the current cycle

OY axis it represents the number of the cycle

2

10.3 Average routing table size in every cycle

10.3.1 What is does

It uses the logs describing the simulation, that were generated by the LogHan-dler component. From this log file, it extracts the couple (average routingtable size,cycle number) for each succesfull query.


The log file processed is an XML file. From this file we extract the aver-age routing table size in every cycle (got with an observer) It constructs abidimensional representation of the set of couples.


72




OX axis it represents the average size of the routing table

OY axis it represents the number of the cycle

3


73

Bibliography - OverallPresentation

[1] Cosmo player homepage. http://www.ca.com/.

[2] The peersim project page at sourceforge.net.http://sourceforge.net/projects/peersim/.

[3] The virtual reality modeling language - design criteria.http://tecfa.unige.ch/guides/vrml/vrml97/spec/part1/introduction.html.

[4] The virtual reality modeling language specification.http://www.graphcomp.com/info/specs/sgi/vrml/spec/.

[5] Vrmlgraph package homepage. http://vrmlgraph.i-scream.org.uk/.

[6] Vrmlgraphplus package homepage. http://www.wartala.de/projects.html.

[7] White dune player homepage. http://www.csv.ica.uni-stuttgart.de/vrml/dune/.

[8] Gian Paolo Jesi. Peersim howto: Build a new protocol for the peersim1.0 simulator. PeerSim Homepage - Documentation.

[9] Gian Paolo Jesi. Peersim howto:build a topology generator for peersim1.0. PeerSim Homepage - Documentation.

[10] Abbas Slimani. Generation d’un jeu de donees pour un simulateur pair-a-pair modulaire. Rapport de stage - Master2 Recherche Universite ParisXI - Orsay.

74

Bibliography - P2P Technology& The routing process in P2Pnetworks

[1] The gnutella homepage. http://www.gnutella.com/.

[2] The kazaa homepage. http://www.kazaa.com.

[3] The napster homepage. http://www.napster.com.

[4] Ian Clarke. A distributed decentralised information storage and retrievalsystem, October 02 1999.

[5] Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P. Mar-tin, and Thu D. Nguyen. PlanetP: Using Gossiping to Build ContentAddressable Peer-to-Peer Information Sharing Communities. In TwelfthIEEE International Symposium on High Performance Distributed Com-puting (HPDC-12), pages 236–246. IEEE Press, June 2003.

[6] Neil Daswani, Hector Garcia-molina, and Beverly Yang. Open problemsin data-sharing peer-to-peer, October 24 2002.

[7] L. Garc Es-erice, E. W. Biersack, P. A. Felber, and K. W. Ross. Hier-archical peer-to-peer systems, June 02 2003.

[8] Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrish-nan. Chord: A scalable peer-to-peer lookup service for internet applica-tions, June 18 2001.

[9] Beverly Yang Hector ; Garcia-M olina. Designing a super-peer network.

[10] Antony Rowstron. Pastry: Scalable, decentralized object location androuting for large-scale peer-to-peer systems, September 23 2001.

75

[11] Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D.Joseph, and John D. Kubiatowicz. Tapestry: A resilient global-scaleoverlay for service deployment, June 04 2003.

76

Part III

Appendixes

77

Appendix A

The Bloom filter

What is a Bloom filter?

A Bloom filter is a very efficient way to make searchable summaries of texts.The particularity of these technique is that it cannot be used to retrievestored data. The only ways you can interact with a Bloom filter are :

1. add a word in the Bloom filter

2. check if a word exists in the Bloom filter

The advantages of Bloom filters :

1. a constant number (= size of Bloom filter) of steps are needed to adda word in the filter

2. a constant number (= size of Bloom filter) of steps are needed to checkif a word in the filter

3. the filter cannot give false negatives (say that an inexistent word is inthe Bloom filter)

There are 2 factors, that characterise any Bloom filter:

filter size the size of the bit array, used to represent the Bloom filter

saturation factor the number of setted bits out of the size of the array4

4If this factor gets to high, the Bloom filter will become unusable, because of the highrate of false positives

78

Algorithms used for Bloom filters management

Creating a new Bloom filter

1: F = array < bits > {the array of bits associated with the Bloom filter}2: for i = 0 to Card(F) do3: F [i] = 0 {mark as empty all the bits in the array}4: end for

Adding a word in a Bloom filter

1: α {the word to be added in the filter}2: H = array < HASHFUNCTIONS > {an array of hash functions}3: V = array < bytes > {a temporary array, used for calculations}4: for i = 0 to Card(H) do5: V [i] = Hi(α) {calculate the output of each hash function, applied on

the word}6: end for7: for i = 0 to Card(H) do8: F [V [i]%Card(F )]← 1 {set the appropiate bits in the Bloom filter}9: end for

Check if a word is in a Bloom filter

1: α {the word to be looked after in the filter}2: T = array < bits >{a temporary Bloom filter use to store the word we

are looking after}3: H = array < HASHFUNCTIONS > {an array of hash functions}4: V = array < bytes > {a temporary array, used for calculations}5: for i = 0 to Card(H) do6: V [i] = Hi(α) {calculate the output of each hash function, applied on

the word}7: end for8: for i = 0 to Card(H) do9: T [V [i]%Card(F )] ← 1 {set the appropiate bits in the Bloom filter

associated with the searched word}10: end for11: counterBoth← 012: counterW ord← 013: for i = 0 to Card(F) do14: if F[i] = T[i] = 1 then15: counterBoth← counterBoth + 1 {how many bits are set in }

79

16: end if17: if T[i] = 1 then18: counterW ord← counterW ord + 119: end if20: end for

80

Appendix B

The query packet : components type & sizeComponent name Data type Fixed/Variable size Data sizerequest ID string Fixed size 32 bytesrequester node ID unsigned long Fixed size 32 bytescreation time unsigned long Fixed size 32 bytesTTL byte Fixed size 1 bytesearch pattern list Variable size -packet history list Variable size -

The response packet : components type & sizeComponent name Data type Fixed/Variable size Data sizeresponse ID string Fixed size 32 bytesrequest ID string Fixed size 32 bytesresponder node ID unsigned long Fixed size 32 bytesrelevant documents list Variable size -packet history list Variable size -

The feedback packet : components type & sizeComponent name Data type Fixed/Variable size Data sizefeedback ID string Fixed size 32 bytesrequest ID string Fixed size 32 bytessuccess of route unsigned long Fixed size 32 bytespacket history list Variable size -

81

Appendix C

The global configuration file

#----------------------- Global variables------------------

simulation.cycles 20

network.size 20

random.seed 20

requete.classe coucheRequeteReponse.GeneraleRequete

reponse.classe coucheRequeteReponse.GeneraleReponse

information.classe coucheReseauLogique.BloomFilter

# ------------------------ Protocols ------------------------

# Gestion Reseau Logique Semantic

protocol.0 coucheReseauLogique.ProtocolSemanticReseauLogique

# Gestion documents

protocol.1 coucheGestionDocuments.ProtocolSemanticPairDocument

protocol.1.semanticReseauLogique 0

# Gestion des requetes et des reponses

protocol.2 coucheRequeteReponse.ProtocolGestionRequeteReponse

#Prototcol de Gossiping

protocol.3 coucheReseauLogique.gossiping.ProtocolGossiping

protocol.3.from 0

protocol.3.until 19

protocol.3.step 1

protocol.3.nodesByGossiping 1

protocol.3.gestionReseauLogiqueSemantic 0

# Recherche d’informations

82

protocol.4 com.get.peersim.SystemesRechercheP2P.PlanetP2

protocol.4.degree 3

protocol.4.gestionRequeteReponse 2

protocol.4.gestionDocuments 1

protocol.4.gestionReseauLogique 0

# ------------------------ Initializers ----------------------

#Initialiser le reseau logique

init.0 peersim.dynamics.WireFromFile

init.0.protocol 0

init.0.file network.txt

#Initialiser les documents

init.1 coucheGestionDocuments.InitPairDocument

init.1.documentsDefinitionFile documents_definition.xml

init.1.documentsDistributionFile documents_distribution.xml

init.1.protocol 1

#Initialiser les requetes

init.2 coucheRequeteReponse.InitRequeteReponse

init.2.requetesDefinitionFile queries_definition.xml

init.2.requetesDistributionFile queries_distribution.xml

init.2.protocol 2

#Initialiser l’Index local

init.3 coucheReseauLogique.InitReseauLogiqueSemantic

init.3.protocolReseauLogiqueSemantic 0

init.3.gestionDocuments 1

include.init 0 1 2 3

#----------------------Dynamics----------

#Gestion des changements au niveau du reseau physique

control.0 coucheReseauPhysique.DynamicReseauPhysique

control.0.file noeuds_modifications.xml

#Gestion des changements au niveau du reseau logique

control.1 coucheReseauLogique.DynamicReseauLogique

control.1.protocol 0

83

control.1.file voisins_modifications.xml

#Gestion des changements au niveau des resources locales

control.2 coucheGestionDocuments.DynamicsPairDocument


control.2.file documents_modifications.xml

#Gestion des changement au niveau des requetes et reponses

control.3 coucheRequeteReponse.DynamicRequeteReponse

#Gestion des changements aleatoires au niveau des resources locales

#control.4 coucheGestionDocuments.DynamicsAleatoirePairDocument


control.4.randomNodes coucheGestionDocuments.GeneralRandomNodes

control.4.randomNodes.noeudParCycle 5

control.4.randomDocuments coucheGestionDocuments.GeneralRandomDocument

control.4.randomDocuments.documentRemoved 5

control.4.randomDocuments.documentAdded 5

#----------------------Observers-------------

#afficher des informations sur le routage des requetes et des reponses

control.5 coucheRequeteReponse.ObserverRequeteReponse2


control.5.file ResultatsSimulationGnutella.xml

control.5.from 20

control.5.FINAL

control.5.distanceLimit 0.5

#afficher des information sur l’evolution du reseau logique

control.6 coucheReseauLogique.ObserverReseauLogique


control.6.file ResultatsEvolutionReseauLogique.txt

control.6.FINAL

#----------------------------------------------------

The documents definition file

<?xml version="1.0" encoding="UTF-8"?>

84

<documents xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation=’./schemas/listeDocuments.xsd’>

<document document_id="1">

<words>

<word>the</word>

<word>doors</word>

<word>light</word>

<word>my</word>

<word>fire</word>

<word>mp3</word>

</words>

<bloom_filter>

000000100000000000000000001000000

0000000000000010000000000000000000000001

000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000

000000000000000000000000000000000000

0000000000000000000000000000000000000000000000000

0000000000000000000000000000100010

000000000000000000000000000000000000000000000000000

0000000000000000000000000000000

000000000000000000000000000000000010000000000000000000

0000000000000010000000000000000

000000000000000000000000000000010000000000000000000000

00000000000000000000000000000000

00000000000000000100000000000000000000000000000000000

00010000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000100000000000000000000000000000000000000000000000

</bloom_filter>

</document>

<document document_id="2">

<words>

<word>the</word>

85

<word>doors</word>

<word>love</word>

<word>me</word>

<word>two</word>

<word>times</word>

<word>mp3</word>

</words>

<bloom_filter>

000000100000000000000000001000000

0000000000000010000000000000000000000001

000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000

000000000000000000000000000000000000

0000000000000000000000000000000000000000000000000

0000000000000000000000000000100010

000000000000000000000000000000000000000000000000000

0000000000000000000000000000000

000000000000000000000000000000000010000000000000000000

0000000000000010000000000000000

000000000000000000000000000000010000000000000000000000

00000000000000000000000000000000

00000000000000000100000000000000000000000000000000000

00010000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000100000000000000000000000000000000000000000000000

</bloom_filter>

</document>

........... lots of document entries ommited ...........

</documents>

86

The documents distribution file


<peers xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation=’./schemas/documentsDistrib.xsd’>

<peer peer_id="0">

<document document_id="6"/>

</peer>

<peer peer_id="1">


</peer>

<peer peer_id="14">



............ lots of entries ommites ........



</peer>


<peer peer_id="14">






</peer>

</peers>

The documents modification file


<documents xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation=’./schemas/documents_modifications.xsd’>

<documents_added>

<document document_id="24" peer_id="4" cycle_number="1"/>

</documents_added>

<documents_removed>

<document document_id="24" peer_id="4" cycle_number="8"/>

</documents_removed>

</documents>

87

The queries definition file


<queries xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation=’./schemas/listeQueries.xsd’>

<query query_id="1">

<words>

<bloom_filter>

000000100000000000000000001000000

0000000000000010000000000000000000000001

000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000

000000000000000000000000000000000000

0000000000000000000000000000000000000000000000000

0000000000000000000000000000100010

000000000000000000000000000000000000000000000000000

0000000000000000000000000000000

000000000000000000000000000000000010000000000000000000

0000000000000010000000000000000

000000000000000000000000000000010000000000000000000000

00000000000000000000000000000000

00000000000000000100000000000000000000000000000000000

00010000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000100000000000000000000000000000000000000000000000

</bloom_filter>

.... lots of entries ommited ......

<word>ann</word>

<bloom_filter>

000000100000000000000000001000000

0000000000000010000000000000000000000001

000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000

000000000000000000000000000000000000

88

0000000000000000000000000000000000000000000000000

0000000000000000000000000000100010

000000000000000000000000000000000000000000000000000

0000000000000000000000000000000

000000000000000000000000000000000010000000000000000000

0000000000000010000000000000000

000000000000000000000000000000010000000000000000000000

00000000000000000000000000000000

00000000000000000100000000000000000000000000000000000

00010000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000100000000000000000000000000000000000000000000000

</bloom_filter>

</words>

<responses>

<response distance="0.020000" document_id="24"/>

















</responses>

<bloom_filter>

89

000000100000000000000000001000000

0000000000000010000000000000000000000001

000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000

000000000000000000000000000000000000

0000000000000000000000000000000000000000000000000

0000000000000000000000000000100010

000000000000000000000000000000000000000000000000000

0000000000000000000000000000000

000000000000000000000000000000000010000000000000000000

0000000000000010000000000000000

000000000000000000000000000000010000000000000000000000

00000000000000000000000000000000

00000000000000000100000000000000000000000000000000000

00010000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000000000000000000000000000000000000000000000

00000000000000000000000000000000

0000100000000000000000000000000000000000000000000000

</bloom_filter>

</query>

</queries>

The queries distribution file


<peers xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance

" xsi:noNamespaceSchemaLocation=’./schemas/queriesDistrib.xsd’>

<peer peer_id="4">

<query ttl="20" cycle_number="11" query_id="1"/>

</peer>

</peers>

90

The nodes modification file


<connections xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation=’./schemas/voisins_modifications.xsd’>

<connections_added>

<connection peer1_id="0" peer2_id="11" cycle_number="1"/>

</connections_added>

<connections_removed>

<connection peer1_id="2" peer2_id="4" cycle_number="1"/>

</connections_removed>

</connections>

91

List of Figures

2.1 The Client Server Model . . . . . . . . . . . . . . . . . . . . . 102.2 The Peer to Peer Model . . . . . . . . . . . . . . . . . . . . . 112.3 Client-Server search & P2P transfer . . . . . . . . . . . . . . . 132.4 P2P search & P2P transfer . . . . . . . . . . . . . . . . . . . . 142.5 Cached P2P search & P2P transfer . . . . . . . . . . . . . . . 152.6 Content Adressing search & P2P transfer . . . . . . . . . . . . 16

4.1 Duplicated queries . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Conflicting feedback packets . . . . . . . . . . . . . . . . . . . 47

5.1 A case when the multiplication factor is too small . . . . . . . 52

6.1 Overview of the project’s components . . . . . . . . . . . . . . 59

92

Index

3D, 65

advantages, 66

cached search, 14CAN search, 15Client Server model, 9

expressive, 65

feedback mechanism, 18, 29, 30, 42–44, 47

feedback packet, 30, 35, 37, 45–49, 81,93

feedback packet , 35feedback stage, 35–38, 45

hash key, 16

multiplication factor, 36–39, 41, 48,49

packet history, 36, 37Peer-to-Peer Model, 10

query packet, 35–45, 48, 81, 93query packet , 35

requester node, 35–37, 42–47, 81responder node, 36, 37, 42–44, 81response packet, 35, 36, 42, 44–46, 48,

81, 93response packet , 35response stage, 35, 36, 38, 42routing mechanism, 18–21, 28, 29, 48

search mechanism, 10, 16, 18

search pattern, 35, 36, 39, 42–44search stage, 35, 38, 41super peers, 15

TTL, 14, 17, 35, 41

user model, 37, 45–47

VRML 2.0, 66vrmlgraph, 67vrmlgraphspecial, 68

93