trustworthy distributed search and retrieval over the internet

60
Yung-Ting Chuang Electrical and Computer Engineering University of California, Santa Barbara May 3, 2013 Committee Members: Professor P. Michael Melliar-Smith, Chair Professor Louise E. Moser Professor Timothy P. Sherwood Professor Volkan Rodoplu 5/3/2013 1 Yung-Ting Chuang's Ph.D. Defense

Upload: amos-roberts

Post on 03-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Trustworthy Distributed Search and Retrieval over the Internet. Yung-Ting Chuang Electrical and Computer Engineering University of California, Santa Barbara May 3, 2013 Committee Members: Professor P. Michael Melliar-Smith, Chair Professor Louise E. Moser Professor Timothy P. Sherwood - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Trustworthy Distributed Search  and Retrieval over the Internet

Yung-Ting Chuang

Electrical and Computer Engineering

University of California, Santa Barbara

May 3, 2013

Committee Members:

Professor P. Michael Melliar-Smith, Chair

Professor Louise E. Moser

Professor Timothy P. Sherwood

Professor Volkan Rodoplu

5/3/2013 1Yung-Ting Chuang's Ph.D. Defense

Page 2: Trustworthy Distributed Search  and Retrieval over the Internet

OutlineMotivationTrustworthy Distributed Search and RetrievalProtecting against Malicious Attacks in iTrustMembership Management for iTrustStatistical Inference and Dynamic Adaptation for iTrustConclusions and Future Work

5/3/2013 2Yung-Ting Chuang's Ph.D. Defense

Page 3: Trustworthy Distributed Search  and Retrieval over the Internet

MotivationInformation is accessed over the Internet using centralized

search enginesBenefits - efficient, robust, and scalableDrawbacks – depends on administrators remaining benign

Thus, we present a decentralized and distributed search and retrieval systemBenefits – prevent censorship and filtering of informationDrawbacks –

Need more network bandwidth Difficult to infer membership size and malicious nodes

5/3/2013 3Yung-Ting Chuang's Ph.D. Defense

Page 4: Trustworthy Distributed Search  and Retrieval over the Internet

1. Related Work

2. Design of iTrust

3. Implementation of iTrust

4. User Interface of iTrust

5. Performance Evaluation of iTrust

6. Summary

5/3/2013 4Yung-Ting Chuang's Ph.D. Defense

Page 5: Trustworthy Distributed Search  and Retrieval over the Internet

1. Related WorkSurvey by Mischeke and Risson on distributed search:

Structured – Require nodes to be organized in an overlay network Distributed Hash Table (DHT), Ring, Tree, Skip Lists

Unstructured – Typically gossip-based, and use randomization Flooding / Broadcast => Gnutella Random walk and data replication => Sarshar, GIA, Lv Key-based routing => Freenet Direct routing => Pub-2-Sub Square root function => Cohen, Zhong, Ferreira

P2P systems concerned with security, privacy, and trustQuasar–Uses a structured overlay and protects user’s sensitive

informationOneSwarm– Uses a combination of trusted and untrusted peers and

protect the privacy of the usersGOSSPLE – Fully decentralized system for social acquaintances using

a gossip protocol. Yung-Ting Chuang's Ph.D. Defense5/3/2013 5

Page 6: Trustworthy Distributed Search  and Retrieval over the Internet

2. Design of iTrusta) Distribution of Metadata

Source ofInformation

5/3/2013 6Yung-Ting Chuang's Ph.D. Defense

Page 7: Trustworthy Distributed Search  and Retrieval over the Internet

Source ofInformation

Requester ofInformation

RequestEncounters

Metadata

2. Design of iTrust b) Distribution of a Request

5/3/2013 7Yung-Ting Chuang's Ph.D. Defense

Page 8: Trustworthy Distributed Search  and Retrieval over the Internet

Source ofInformation

Requester ofInformation

RequestMatched

2. Design of iTrust c) Retrieval of Information

5/3/2013 8Yung-Ting Chuang's Ph.D. Defense

Page 9: Trustworthy Distributed Search  and Retrieval over the Internet

apachePHP

public interface

delete nodes

leave membership

query

search

inbox

statistics

user settings

tools

metadata inbox

tika / lucene / dictionary

metadata functionsmetadata xml engine

register metadata list

apply xml

publish xml list

helper functions

nodes wrapper

keywords wrapper

resource wrapper

tag keyword resource

search functions

globals / navigation

cURL

SQLite

session

log

PECL http

(a) (b) (c)

3. Implementation of the iTrust System

5/3/2013 9Yung-Ting Chuang's Ph.D. Defense

Page 10: Trustworthy Distributed Search  and Retrieval over the Internet

4. User Interface of iTrust

5/3/2013 10Yung-Ting Chuang's Ph.D. Defense

Page 11: Trustworthy Distributed Search  and Retrieval over the Internet

4. User Interface of iTrust

5/3/2013 11Yung-Ting Chuang's Ph.D. Defense

Page 12: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluation of iTrusta) Analytical ModelNotation

Membership contains n participating nodesx is the proportion of participating nodes that are operational Metadata are distributed to m nodesRequests are distributed to r nodesk nodes report matches to a requesting node

(for the same metadata and the same request)

5/3/2013 12Yung-Ting Chuang's Ph.D. Defense

Page 13: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluation of iTrusta) Analytical ModelProbability of k matches is:

Probability of one or more match is:

5/3/2013 13Yung-Ting Chuang's Ph.D. Defense

Page 14: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluation of iTrusta) Analytical Model

5/3/2013 14Yung-Ting Chuang's Ph.D. Defense

Page 15: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluation of iTrustb) Analysis vs. Emulation

5/3/2013 15Yung-Ting Chuang's Ph.D. Defense

Page 16: Trustworthy Distributed Search  and Retrieval over the Internet

6. SummaryProblem we are trying to solve:

Centralized search engines can be tampered with to bias the results, or to conceal or censor information

Our solutions and contributions:We have implemented iTrust, which is a decentralized

distributed search and retrieval system with no centralized mechanisms and no centralized control

We have demonstrated that the match probability is high, even if some participating nodes are subverted or non-operational

5/3/2013 16Yung-Ting Chuang's Ph.D. Defense

Page 17: Trustworthy Distributed Search  and Retrieval over the Internet

1. Background

2. Related Work

3. Foundations

4. Detecting Malicious Attacks

5. Defending against Malicious Attacks

6. Performance Evaluation

7. Summary

5/3/2013 17Yung-Ting Chuang's Ph.D. Defense

Page 18: Trustworthy Distributed Search  and Retrieval over the Internet

1. BackgroundPotential attacks:

Nodes do not match requests Nodes do not return responses to requester

Effect of such attacksProbability of a match is decreased

Existing work that addresses attacks:Place nodes on a blacklist (Jesi)Maintains a reputation or trust score (Condie)

Our solution to such attacks is:Estimate the proportion of malicious nodes Increase the number of nodes to which requests are distributed in

order to restore match probability

5/3/2013 18Yung-Ting Chuang's Ph.D. Defense

Page 19: Trustworthy Distributed Search  and Retrieval over the Internet

2. Related Work

5/3/2013 19

Work related to our detection algorithmExponential Weighted Moving Average (EWMA)

Roberts et al. - For discovering anomalies and issuing alerts

Chi-squared test Goonatilake - For detecting intrusions Press et al. - For balancing weights of buckets Belen and Heckert – For determining similarity between two models

EWMA and Chi-squared test Ye and Chen - For anomaly detection and intrusion detection

Work related to our defensive adaptation algorithm:Morselli – Uses feedback mechanism to adjust the replicas to

improve search resultLeng – Uses maintainer to determine, update, and eliminate the

data replicasYung-Ting Chuang's Ph.D. Defense

Page 20: Trustworthy Distributed Search  and Retrieval over the Internet

3. Foundationsa) NormalizationWe cannot use requests that return k=0 responses

Because there might be no metadata to matchProbability of k matches is negligibly small, when k is large

Thus, we exclude requests for k=0 and for k > KOur normalization equation is:

where

5/3/2013 20Yung-Ting Chuang's Ph.D. Defense

Page 21: Trustworthy Distributed Search  and Retrieval over the Internet

3. Foundationsb) Exponential Weighted Moving AverageThe EWMA method is computed as follows:

where c is the weighting factor for the EWMA method

5/3/2013 21Yung-Ting Chuang's Ph.D. Defense

Page 22: Trustworthy Distributed Search  and Retrieval over the Internet

3. Foundationsc) Chi-Squared vs. Modified Chi-SquaredPearson’s chi-squared statistic:

Pearson’s modified chi-squared statistic:

where:ok : the actual number of observations that fall into kth

bucketek: the expected number of observations for the kth bucketK: the number of buckets into which the observations fall5/3/2013 22Yung-Ting Chuang's Ph.D. Defense

Page 23: Trustworthy Distributed Search  and Retrieval over the Internet

3. Foundationsd) Chi-Squared vs Modified Chi-Squared

5/3/2013 23Yung-Ting Chuang's Ph.D. Defense

Page 24: Trustworthy Distributed Search  and Retrieval over the Internet

4. Detecting Malicious Attacksa) Detection Algorithm1. Collects responses for its request using EWMA method

2. Normalize empirical probabilities

3. Uses modified chi-squared test to compare the empirical probabilities against the analytical probabilities for x=1.0, 0.7, 0.4, and 0.2

4. Chooses the smallest value of chi-squared to estimate x’

5/3/2013 24Yung-Ting Chuang's Ph.D. Defense

Page 25: Trustworthy Distributed Search  and Retrieval over the Internet

4. Detecting Malicious Attacksb) Example

5/3/2013 25Yung-Ting Chuang's Ph.D. Defense

Page 26: Trustworthy Distributed Search  and Retrieval over the Internet

5. Defending against Malicious Attacksa) Defensive Adaptation Algorithm

5/3/2013 26

1. Initialize r 0

2. Calculate yo based on current r with given n, m, and x.

3. Determine whether the yo is greater than the expected match probability.

A. If not, increase r by 1 and go back to step 2

B. If so, return r

Yung-Ting Chuang's Ph.D. Defense

Page 27: Trustworthy Distributed Search  and Retrieval over the Internet

5. Defending against Malicious Attacksb) Example

5/3/2013 27Yung-Ting Chuang's Ph.D. Defense

Page 28: Trustworthy Distributed Search  and Retrieval over the Internet

6. Performance Evaluationa) Varying the number of nodes

5/3/2013 28Yung-Ting Chuang's Ph.D. Defense

Page 29: Trustworthy Distributed Search  and Retrieval over the Internet

6. Performance Evaluation

5/3/2013 29Yung-Ting Chuang's Ph.D. Defense

Page 30: Trustworthy Distributed Search  and Retrieval over the Internet

7. Summary

5/3/2013 30

Problem we are trying to solve in this chapter:Absence of centralized control makes it difficult to

determine the proportion of non-operational nodes in the network

Our solution and contributions:A node can estimate the proportion of non-operational

nodes in the network based on the responses to its requestsA node calculates the number of nodes to which the

requests are distributed to maintain a high match probabilityA node infers useful but unobservable information about the

network as a whole by observing aspects of the behaviors of individual nodes that are visible to it

Yung-Ting Chuang's Ph.D. Defense

Page 31: Trustworthy Distributed Search  and Retrieval over the Internet

1. Background

2. Related Work

3. iTrust Membership Protocols

4. Foundations

5. Performance Evaluation

6. Extended Scenario

7. Summary

5/3/2013 31Yung-Ting Chuang's Ph.D. Defense

Page 32: Trustworthy Distributed Search  and Retrieval over the Internet

1. BackgroundChurn – Nodes joining and leaving the membershipChallenging tasks

Estimating membership and membership sizeEstimating churn

Existing work that addresses churn:Passive Monitoring (Sen et al., Gummadi et al.)Active Probing (Chu et al., Liang, Bhagwan et al.)Gossiping (Bizenhofer, Pruteanu et al)

Our approach to address churn:Nodes don’t predict churn characteristics in advanceEach node maintains its local view of the membership and

uses statistical inference to update its view

5/3/2013 32Yung-Ting Chuang's Ph.D. Defense

Page 33: Trustworthy Distributed Search  and Retrieval over the Internet

2. Related Work

5/3/2013 33

Work related to membership management:Zage – Biases neighbor selections toward beneficial nodesSCAMP – Nodes discover joining and leaving nodes through gossiping CYCLON – Nodes maintain a small and fixed-size neighbor list, with a

shuffling protocol for large networksNewcast – Each node periodically selects a peer to exchange and update its

membership listWork related to churn:

Bizenhofer and Pruteanu et al. - Estimate the churn rate through gossipingStutzbach & Rejaie - Study churn characteristics, highlight problems that

cause biased peer selections.Paulo et al. – Maintains dynamic mapping of flows according to the

current set of neighborsLiu – Presents an age-based membership protocol with a conservative

neighbor maintenance scheme under churn Horowitz et al. – Relies on the departure and arrival of nodes to estimate

the current network size, without requiring any additional communication

Yung-Ting Chuang's Ph.D. Defense

Page 34: Trustworthy Distributed Search  and Retrieval over the Internet

3. iTrust Membership Protocolsa) Joining the Membership

Joining Node

BootstrappingNode

345/3/2013 Yung-Ting Chuang's Ph.D. Defense

Page 35: Trustworthy Distributed Search  and Retrieval over the Internet

3. iTrust Membership Protocolsb) Leaving the Membership

LeavingNode

355/3/2013 Yung-Ting Chuang's Ph.D. Defense

Page 36: Trustworthy Distributed Search  and Retrieval over the Internet

Source Node

3. iTrust Membership Protocolsc) Distributing Metadata

Discover Leaving Node

Discover New Node

365/3/2013 Yung-Ting Chuang's Ph.D. Defense

Page 37: Trustworthy Distributed Search  and Retrieval over the Internet

Requesting Node

3. iTrust Membership Protocolsd) Distributing Requests

Discover Leaving Node

Discover New Node

Redistribute Metadata

375/3/2013 Yung-Ting Chuang's Ph.D. Defense

Page 38: Trustworthy Distributed Search  and Retrieval over the Internet

4. Foundationsa) Metrics

LND: Leaves Not DetectedJND: Joins Not DetectedMA: Membership AccuracyMP: Match Probability for a requestRT: Response Time required for a requestMC: Message Cost per time unit

5/3/2013 38Yung-Ting Chuang's Ph.D. Defense

Page 39: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluationa) Retry R Membership ProtocolMotivation:

When a node distributes a request message to R nodes, it might detect some leaving nodes. Therefore, it might not receive exactly R responses.

Solution: We allow a node to keep sending its message to more than

R nodes until it receives exactly R responses.Our input variables for the Retry R Membership Protocol:

Try: The number of times that a requesting node sends its request message in an attempt to receive R responses.

TryMax: The maximum Try value.

5/3/2013 39Yung-Ting Chuang's Ph.D. Defense

Page 40: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluationb) Adaptive RR Membership ProtocolOur Churn Estimator is:

whereLeft: Number of nodes that were detected as non-operationalJoined: Number of nodes that were discovered have joined NumNodes: Number of requests that a requesting node sent

The Requesting Rate (RR) is: if CE > RRMin / RRMax then

RR RRMax x CE

elseRR RRMin

5/3/2013 40Yung-Ting Chuang's Ph.D. Defense

Page 41: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluationc) Message Cost vs. Membership Accuracy

5/3/2013 Yung-Ting Chuang's Ph.D. Defense 41

?

Page 42: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluation d) Combined Adaptive Membership Start infinite loop

if current time reaches nextTime while Try<=2 and resRec < R make request to (R-resRec) nodes and get responses array determine left, joined, N, responded from responses array resRec = resRec + responded Try = Try + 1

CE = (left+joined) / (R + R – resRec) if CE > 1 / RRMax

RR = RRMax x CE

else RR = 1

5/3/2013 Yung-Ting Chuang's Ph.D. Defense 42

Page 43: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluation e) Performance Tuning

5/3/2013 43

Combined Adaptive with Try=2, RRMax = 100, 50, 30

Yung-Ting Chuang's Ph.D. Defense

Page 44: Trustworthy Distributed Search  and Retrieval over the Internet

5. Performance Evaluatione) Message Cost vs. Membership Accuracy

5/3/2013 44Yung-Ting Chuang's Ph.D. Defense

Page 45: Trustworthy Distributed Search  and Retrieval over the Internet

6. Extended Scenarioa) Combined Adaptive Membership Protocol

5/3/2013 45Yung-Ting Chuang's Ph.D. Defense

Page 46: Trustworthy Distributed Search  and Retrieval over the Internet

7. Summary

5/3/2013 46

Problem we are trying to solve in this chapter:We cannot accurately estimate the joining or leaving rates,

or maintain an accurate view of the membership when the system has high membership churn

Our solution and contributions:We presented an adaptive membership management

protocol, which uses random sampling to discover newly joining and leaving nodes

Based on the responses it received to its request, a node calculates the churn estimator and dynamically adjusts its requesting rate to update its local view of the membership

Our membership protocol exploits the messages already required by the messaging protocol

Yung-Ting Chuang's Ph.D. Defense

Page 47: Trustworthy Distributed Search  and Retrieval over the Internet

1. Background

2. Model for iTrust

3. Dynamic Adaptation Algorithm

4. Performance Evaluation

5. Summary

5/3/2013 47Yung-Ting Chuang's Ph.D. Defense

Page 48: Trustworthy Distributed Search  and Retrieval over the Internet

1. BackgroundProblems that co-exist in a fully distributed system

High membership churnLarge proportion of malicious nodes

Our approach to address both problems:Use random samplingApply statistical inference techniques to estimate:

Membership churn with a large proportion of malicious nodes Proportion of malicious nodes in the presence of high membership

churn

5/3/2013 48Yung-Ting Chuang's Ph.D. Defense

Page 49: Trustworthy Distributed Search  and Retrieval over the Internet

2. Model for iTrusta) System and Fault ModelWe consider the following scenarios

A node leaves the membership voluntarilyA node leaves the membership involuntarilyA malicious node responds to a request but it does not report

a matchParameters for membership churn:

JR: Joining RateLR: Leaving Rate

Parameters for detecting malicious nodes:X: Proportion of non-malicious nodes

5/3/2013 49Yung-Ting Chuang's Ph.D. Defense

Page 50: Trustworthy Distributed Search  and Retrieval over the Internet

3. Dynamic Adaptation Algorithma) Parameters and Variablesn: Size of the node’s current view of the membershipm: Number of nodes to which the metadata are distributedr: Number of nodes to which the requests are distributedIE: Intersection estimator obtained by random sampling:

nIE: Estimate of n in ImIE: Estimate of m in IrIE: Estimate of r in Ileft: Number of nodes that were detected as non-operationalnumNodes: Number of requests that a requesting node sent its request5/3/2013 50Yung-Ting Chuang's Ph.D. Defense

nIEmIErIE

Page 51: Trustworthy Distributed Search  and Retrieval over the Internet

3. Dynamic Adaptation Algorithm1. Newly joining node distributes join messages

2. Start infinite loop if current time reaches nextTime

if a node is a source node distribute metadata to m nodes

if a node is a requesting node distribute request messages to r nodes calculate empirical array O based on the responses it obtained calculates estimator IE, then nIE, mIE, rIE, then update n estimate x’ based on nIE, mIE, rIE, kMax, O estimate r’ based on x’ , n, m, yo

if a node is a source node, calculate and send more metadata calculate CE and rmr

5/3/2013 Yung-Ting Chuang's Ph.D. Defense 51

Page 52: Trustworthy Distributed Search  and Retrieval over the Internet

4. Performance Evaluation

5/3/2013 52Yung-Ting Chuang's Ph.D. Defense

Page 53: Trustworthy Distributed Search  and Retrieval over the Internet

4. Performance Evaluation

5/3/2013 53Yung-Ting Chuang's Ph.D. Defense

Page 54: Trustworthy Distributed Search  and Retrieval over the Internet

5. SummaryProblem we are trying to solve in this chapter:

Inferring proportion of malicious nodes and the size of the membership when the network has a lot of churn.

Our solution and contributions:We use random sampling and statistical inference for iTrust

in the presence of both membership churn and malicious nodes, which are not directly observable.

We have demonstrated that the dynamic adaptation algorithm is sufficiently accurate and timely to allow it to be used to estimate both metrics

5/3/2013 54Yung-Ting Chuang's Ph.D. Defense

Page 55: Trustworthy Distributed Search  and Retrieval over the Internet

1. Trustworthy Distributed Search and Retrieval

2. Protecting against Malicious Attacks in iTrust

3. Membership Management for iTrust

4. Statistical Inference and Dynamic Adaptation for iTrust

5/3/2013 55Yung-Ting Chuang's Ph.D. Defense

Page 56: Trustworthy Distributed Search  and Retrieval over the Internet

1. Trustworthy Distributed Search and RetrievalConclusion:

We presented iTrust, a distributed search and retrieval system for the Internet to allow people to share information without worrying about censorship of information

We have demonstrated that, for appropriate choice of the parameters, the probability of obtaining a match is high

Future Work:Investigate the efficiency, scalability, and reliability in EmulabInvestigate different classes of nodes, effects of geographical

location, and network and processing loadsEvaluate the ease of installation and use of iTrustApply the ideas of iTrust to other applications

5/3/2013 56Yung-Ting Chuang's Ph.D. Defense

Page 57: Trustworthy Distributed Search  and Retrieval over the Internet

2. Protecting against Malicious Attacks in iTrustConclusion:

We have presented novel statistical algorithms for detecting and defending against malicious attacks

We recognize that multiple responses to a request provide valuable information about the network

We use statistical inference techniques to infer the characteristics of the network that are not measurable directly

Experimental results show the effectiveness of the algorithms for detecting and defending against malicious attacks

Future Work:Investigate other kinds of malicious attacksDevelop other detection and defensive algorithmsInvestigate detection algorithm with different sets of metadata

5/3/2013 57Yung-Ting Chuang's Ph.D. Defense

Page 58: Trustworthy Distributed Search  and Retrieval over the Internet

3. Membership Management for iTrustConclusion:

We have presented membership algorithms that allow each member to maintain its own local view of the membership and keep that view close to the actual membership

We exploit messages already required by the messaging protocol, rather than requiring extra messages for membership

A requesting node discovers newly joining nodes and leaving nodes, and adjusts its requesting rate accordingly

We have demonstrated that our membership algorithm is effective in estimating churn

Future Work:Refine the algorithms to handle million of nodesInvestigate the performance of the membership protocols in

other scenarios5/3/2013 58Yung-Ting Chuang's Ph.D. Defense

Page 59: Trustworthy Distributed Search  and Retrieval over the Internet

4. Statistical Inference and Dynamic Adaptation for iTrust

Conclusion:We have presented a dynamic adaptive algorithm that uses random

sampling and statistical inference to infer information that is not easy to detect or that is expensive to collect

The algorithm dynamically adjusts r and rmr to obtain reasonable accuracy, response time, message cost and match probability

We have demonstrated that our dynamic adaptive algorithm is effective in maintaining a high match probability and reasonable membership accuracy

Future Work:Apply these statistical inference and dynamic adaptation

techniques to other fieldsCreate other dynamic adaptation algorithms using random

sampling and statistical inference for distributed systems and computer networks

5/3/2013 59Yung-Ting Chuang's Ph.D. Defense

Page 60: Trustworthy Distributed Search  and Retrieval over the Internet

Questions? Comments?

Our iTrust Web Sitehttp://itrust.ece.ucsb.edu

Contact informationYung-Ting Chuang: [email protected]

Our project is supported by NSF CNS 10-16193

5/3/2013 60Yung-Ting Chuang's Ph.D. Defense