an approach to evaluate data trustworthiness based on data provenance department of computer science...

An Approach to Evaluate Data Trustworthiness Based on Data Provenance

Department of Computer Science

Purdue University

Outline

Motivation Application Scenario and Problem Definition A Trust Model for Data Provenance Performance Study Conclusion

Motivation

Data integrity is critical for making effective decisions

Evaluate the trustworthiness of data provenance is essential for data integrity

Few efforts have been devoted to investigate approaches for assessing how trusted the data are

No exist techniques try to protect data against data deception

Motivation

To evaluate the trustworthiness of data provenance, we need to answer questions:

Where did the data come from?

How trustworthy is the original data source?

Who handled the data?

Are the data managers trustworthy?

Application Scenario and Problem Definition

In our scenario, parties are characterized as Data source providers (sensor nodes

or agents that collect data items) Intermediate agents (computers that

pass the data items or generate knowledge items)

Data users (people or computers that use items to make decisions).

Items (data items and knowledge items) describe the properties of certain entities or events information. Data items is generated or collected by

Data source providers Knowledge items refer to the new

information generated by the intermediate agent by inference techniques.


Our goal is to evaluate the trustworthiness of data items, knowledge items, source providers and intermediate agents.

Aspects needed to be considered including: Data similarity: two items similar to each other can be

considered support to each other. Path similarity: two items come from different paths (source

nodes) can be considered more trustworthy. Data conflict: two items against each other based on certain

prior knowledge defined by the users. Data deduction: knowledge deducted by the intermediate

agents from items they received


We model an item (denoted as r) as a row in a relational table and each item has k attributes A1, ..., Ak.

As shown in the table, there are five items, each of which has seven attributes RID, SSN, Name, Gender, Age, Location, Date. RID is the identifier of each item. The information represents the location of the person at a certain time

A Trust Model for Data Provenance How to Compute Data Similarity

Employ a clustering algorithm to group items describing the same event.

The purpose of the clustering is to eliminate minor errors like typos.

After clustering, we obtain sets of items and each set represents a single event.

For each item r, the effect of data similarity on its trust score, denoted as sim(r), is determined by the number of items in the same cluster and the size of the cluster.

Formal definition

where is the diameter of the cluster, is the number of items in the cluster.

C

C

Nersim

)(C CN

A Trust Model for Data Provenance

Path Similarity Given two items r1 and r1, suppose their paths are P1 and P1

respectively. The path similarity between P1 and P1 is defined as the edit

distance between their identifiers. Formal definition

is a parameter range from to 1. when no two items share one path, it equals to one. when all items share one path, it equals to .

CC

C

Nersim

)(*

CCN

1

CN

1


Data conflict Refers to inconsistent descriptions or information about the

same entity or event. A simple example of a data conflict is that the same person appears at different locations during the same time period.

Prior knowledge is used to define the data conflict. The data conflict score of one cluster against another cluster

is determined by the distance between two clusters and the number of items in the second cluster taking into account path similarity.

Formal definition:

where is the distance between the two clusters.

2221

1

),(

1

21 ),( ccNCCd

c eCCcon

),( 21 CCd


Data Deduction It is computed based on all its input items and the inference

techniques used by the intermediate agent. A weighted function is used to compute the score.

Here, is a parameter based on the operation the intermediate agent takes and its impact on the trustworthiness of knowledge k, t(a) is the trustworthiness of agent a, and t(rj) is the trust worthiness of the input item set.

2

)()(

)(

1

n

rtatw

kt

n

j j

i

iw


Computing trust scores We compute the trust score of a data item by taking the

above four aspects into account

Above equation is chosen based on the probability theory. Where t(f) is the probability of fact f being true and t(r) is the probability of item r being true. f and r belong to the same cluster.

in the equation is to take the similarity between two items into account.

The more similar of two items, the more likely they represent the same event.

Cr

fsimrtft ))()(1()(1 *

)(* fsim


Computing trust scores (cont’) Similar equations are used to take the conflict of items into

account. Trustworthiness of intermediate agents and source nodes

are computed as the average value of the trust scores of items belong to them.

The complexity of our algorithm is dominated by the cost of computing the data similarity, path similarity and data conflict, which are all O(n2).

An overview of our algorithm is listed on the next ppt

A Trust Model for Data Provenance 1. cluster data facts and knowledge items 2. for each cluster 3. compute data similarity 4. compute path similarity 5. compute data conflict 6. assign initial trust scores to all the source providers intermediate

agents 7. repeat 8. for each data fact and knowledge item 9. compute its trust score 10. for each knowledge item 11. compute data deduction 12. recompute trust score of the knowledge item by combining the

effect of data deduction 13. compute trust scores for all the source provider and intermediate

agents 14. until the change of trust scores is ignorable

Performance Study In the performance study, we

simulate a network containing 100 source providers and 100 intermediate agents.

As shown in Figure (a), the running time of initialization phase increases when the dataset size becomes large.

This is because in the worst case the complexity of the clustering algorithm, the computation of data similarity, path similarity and data conflict are all O(n2).

Performance Study

Compared to the initialization phase, the iteration phase is much faster (see Figure (d)).

This is because the iteration phase simply computes score functions based on the results obtained from initialization phase and trust scores converge to stable values in a short time.

Performance Study

As shown in Figure (c) and (f), the running time of both phases increases with the length of path.

Conclusion

Formulated and introduced the problem of evaluation of trustworthiness of data provenance.

Proposed a trust model by taking into account four important factors that influence trustworthiness.

Evaluated the efficiency of our approach. Our proposed method can deal with both unintentional

errors and malicious attacks without collusion.

Future Work

Develop an approach to estimate the confidence results of a query.

Develop a policy language to specify which is the minimum confidence level that a query result must have for use by users in certain roles.

How to dynamically adjust our trust model when information keeps streaming into the system.

How to certify data provenance so to achieve a certified data lineage.

an approach to evaluate data trustworthiness based on data provenance department of computer science...

Documents