truth discovery to resolve object conflicts in linked dataan important characteristic of linked data...

13
Truth Discovery to Resolve Object Conflicts in Linked Data ABSTRACT In the community of Linked Data, anyone can share information as Linked Data on the web because of the openness of the Semantic Web. As such, RDF (Resource Description Framework) triples that describe the same real-world entity can be obtained from multiple sources; it inevitably results in conflicting objects for a certain predicate of a real-world entity. The objective of this study is to identify one truth from multiple conflicting objects for a certain predicate of a real-world entity. An intuitive principle based on common sense is that an object from a reliable source is trustworthy; thus, a source that provides trustworthy object is reliable. Many truth discovery methods based on this principle have been proposed to estimate source reliability and identify the truth. However, the effectiveness of existing truth discovery methods is significantly affected by the number of objects provided by each source. Therefore, these methods cannot be trivially extended to resolve conflicts in Linked Data with a scale-free property, i.e., most of the sources provide few conflicting objects, whereas only a few sources have many conflicting objects. To address this challenge, we propose a novel approach called TruthDiscover to identify the truth in Linked Data with a scale-free property. Two strategies are adopted in TruthDiscover to reduce the effect of the scale- free property on truth discovery. First, this approach leverages the topological properties of the Source Belief Graph to estimate the priori beliefs of sources, which are utilized to smooth the trustworthiness of sources. Second, the Hidden Markov Random Field is utilized to model interdependencies between objects for estimating the trust values of objects accurately. Experiments are conducted in the six datasets, which include persons, locations, organizations, descriptors, films and music, to evaluate TruthDiscover. Experimental results show that TruthDiscover outperforms TruthFinder, F-Quality Assessment and Voting in terms of accuracy when confronted with data having a scale-free property, and it is robust and consistent in various domains. 1. INTRODUCTION Linked Data has gained considerable attentions in recent years. The number of available Linked Data sources increased from 12 in 2007 to 1,014 in 2014 [1]. The data model of Linked Data is RDF, which encodes a resource in the form of subject, predicate, object triples. Subject denotes the resource, and predicate is used to express relationships between the subject and the object. An important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain rules [2] because of the openness of the Semantic Web. As such, RDF triples described the same real-world entity can be obtained from multiple sources. Many Linked Data sources on the web have been created from semi-structured datasets (e.g., Wikipedia) and unstructured ones [3]. Thus, errors inevitably occur during the creation process. As a result, many Linked Data sources contain noisy, out-of-date, missing or erroneous data. Worse, multiple Linked Data sources often provide conflicting data. Conflicts in Linked Data can be classified into three categories, namely, identity, schema, and object conflicts [4]. Identity conflicts refer to that different subjects from various sources denote the same real-world entity, for example, dbpedia:Beijing and freebase:m.01914 (we use prexes in this paper, instead of full URIs, to save space). Schema conflicts indicate that different schemata are utilized to describe the same predicate, for example, dbo:populationTotal and gn:population. Object conflicts occur when multiple inconsistent objects exist for a certain predicate of a real-world entity. For example, Table 1 shows conflicting objects for four predicates about the place Beijing. Six sources have five different objects for predicate dbo:populationTotal, and six conflicting objects for predicates geo:lat and geo:long respectively. This study focuses on resolving object conflicts. Table 1. Conflicting objects about Beijing in six datasets. Source Predicates dbo:popula tionTotal geo:lat geo:long rdfs:label DBpedia 21,150,000 39.90638 116.3916 “Beijing” Freebase 20,180,000 39.91666 116.3833 “Peipingshih” Yago 19,612,368 39.9 116.38 “Beijing” Opencyc 13,133,000 39.90749 116.3972 “Peiping” Geoname 14,933,274 39.91691 116.3970 “Beijing Shi” BBC Null 39.908 116.397 “Beijing” 1.1 Problems of Object Conflicts in Linked Data In this study, four questions are addressed by conducting empirical analysis in the six datasets that belong to six domains: persons, locations, organizations, descriptors, films and music. These datasets are described in detail in Section 4.1. (1) Are object conflicts a serious problem for the Linked Data community? The answers obtained by observing the six datasets are surprising. Approximately 45% of predicates have conflicting objects provided by multiple sources, and the average number of conflicting objects for a certain predicate of a real-world entity Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). SIGMOD’16.

Upload: others

Post on 06-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

Truth Discovery to Resolve Object Conflictsin Linked Data

ABSTRACTIn the community of Linked Data, anyone can share informationas Linked Data on the web because of the openness of theSemantic Web. As such, RDF (Resource DescriptionFramework) triples that describe the same real-world entity canbe obtained from multiple sources; it inevitably results inconflicting objects for a certain predicate of a real-world entity.The objective of this study is to identify one truth from multipleconflicting objects for a certain predicate of a real-world entity.An intuitive principle based on common sense is that an objectfrom a reliable source is trustworthy; thus, a source that providestrustworthy object is reliable. Many truth discovery methodsbased on this principle have been proposed to estimate sourcereliability and identify the truth. However, the effectiveness ofexisting truth discovery methods is significantly affected by thenumber of objects provided by each source. Therefore, thesemethods cannot be trivially extended to resolve conflicts inLinked Data with a scale-free property, i.e., most of the sourcesprovide few conflicting objects, whereas only a few sourceshave many conflicting objects. To address this challenge, wepropose a novel approach called TruthDiscover to identify thetruth in Linked Data with a scale-free property. Two strategiesare adopted in TruthDiscover to reduce the effect of the scale-free property on truth discovery. First, this approach leveragesthe topological properties of the Source Belief Graph to estimatethe priori beliefs of sources, which are utilized to smooth thetrustworthiness of sources. Second, the Hidden Markov RandomField is utilized to model interdependencies between objects forestimating the trust values of objects accurately. Experiments areconducted in the six datasets, which include persons, locations,organizations, descriptors, films and music, to evaluateTruthDiscover. Experimental results show that TruthDiscoveroutperforms TruthFinder, F-Quality Assessment and Voting interms of accuracy when confronted with data having a scale-freeproperty, and it is robust and consistent in various domains.

1. INTRODUCTIONLinked Data has gained considerable attentions in recent years.The number of available Linked Data sources increased from 12in 2007 to 1,014 in 2014 [1]. The data model of Linked Data isRDF, which encodes a resource in the form of subject, predicate,object triples. Subject denotes the resource, and predicate isused to express relationships between the subject and the object.An important characteristic of Linked Data is that anyone canpublish their data as Linked Data on the web by followingcertain rules [2] because of the openness of the Semantic Web.

As such, RDF triples described the same real-world entity canbe obtained from multiple sources.

Many Linked Data sources on the web have been created fromsemi-structured datasets (e.g., Wikipedia) and unstructured ones[3]. Thus, errors inevitably occur during the creation process. Asa result, many Linked Data sources contain noisy, out-of-date,missing or erroneous data. Worse, multiple Linked Data sourcesoften provide conflicting data. Conflicts in Linked Data can beclassified into three categories, namely, identity, schema, andobject conflicts [4]. Identity conflicts refer to that differentsubjects from various sources denote the same real-world entity,for example, dbpedia:Beijing and freebase:m.01914 (we useprefixes in this paper, instead of full URIs, to save space).Schema conflicts indicate that different schemata are utilized todescribe the same predicate, for example, dbo:populationTotaland gn:population. Object conflicts occur when multipleinconsistent objects exist for a certain predicate of a real-worldentity. For example, Table 1 shows conflicting objects for fourpredicates about the place Beijing. Six sources have fivedifferent objects for predicate dbo:populationTotal, and sixconflicting objects for predicates geo:lat and geo:longrespectively. This study focuses on resolving object conflicts.

Table 1. Conflicting objects about Beijing in six datasets.

Source

Predicates

dbo:popula

tionTotalgeo:lat geo:long rdfs:label

DBpedia 21,150,000 39.90638 116.3916 “Beijing”Freebase 20,180,000 39.91666 116.3833 “Peipingshih”

Yago 19,612,368 39.9 116.38 “Beijing”Opencyc 13,133,000 39.90749 116.3972 “Peiping”Geoname 14,933,274 39.91691 116.3970 “Beijing Shi”

BBC Null 39.908 116.397 “Beijing”

1.1 Problems of Object Conflicts in LinkedDataIn this study, four questions are addressed by conductingempirical analysis in the six datasets that belong to six domains:persons, locations, organizations, descriptors, films and music.These datasets are described in detail in Section 4.1. (1) Are object conflicts a serious problem for the LinkedData community?The answers obtained by observing the six datasets aresurprising. Approximately 45% of predicates have conflictingobjects provided by multiple sources, and the average number ofconflicting objects for a certain predicate of a real-world entity

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor third-party components of this work must be honored. For all otheruses, contact the Owner/Author. Copyright is held by the owner/author(s).SIGMOD’16.

Page 2: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

is 11. This phenomenon indicates that the object conflicts are aserious issue in the community of Linked Data.Normalized entropy [5] is selected to examine the inconsistencyof conflicting objects for a certain predicate of a real-worldentity to understand the degree of inconsistency of Linked Data.Generally, the higher normalized entropy is, the higher thedegree of inconsistency is. Let denotes a set ofconflicting objects for a certain predicate of a real-world entity,and represents percentage of occurrences of . Thecorresponding normalized entropy can be defined as follows:

. (1)

Our observations of the six datasets indicate that the averagenormalized entropy is 0.75. Approximately 80% of predicateshave normalized entropy of more than 0.8; this result indicates ahigh degree of inconsistency.(2) What are the causes of object conflicts in Linked Data?We analyzed all conflicting objects in the six datasets, anddiscovered four distinct reasons for the inconsistency of LinkedData. The first reason for the inconsistency is multi-values(32%); the predicate inherently has multiple objects. (e.g., thepredicate owl:sameAs). The second reason for the inconsistencyis out-of-date (13%); since the predicate is time sensitive, thecorresponding object tends to change over time (e.g.,dbo:populationTotal in Table 1). The third reason for theinconsistency is variety (43%); the variety refers to conflictingobjects that may be presented in different ways or different dataprecisions (e.g., rdfs:label in Table 1). The fourth reason for theinconsistency is pure errors (12%). In this study, we focus onresolving three reasons (68%) including out-of-date, variety andpure errors, which only have one truth for a certain predicate ofa real-world entity. (3) Can we just trust an authoritative source?Although several well-known authoritative sources, such asFreebase 1 and DBpedia 2 , provide reasonably accurateinformation, they are unsuitable for all domains. In addition,objects from different well-known sources for the samepredicate are not always consistent. For example, six well-known sources provide five different objects for the predicatedbo:populationTotal in Table 1. Selecting one of these well-known sources as a trustworthy source is difficult when confrontwith the object conflicts problem. Therefore, this method fails toresolve conflicts in Linked Data.(4) Whether many previous methods can be triviallyextended to resolve conflicts in Linked Data?A straightforward method to resolve object conflicts is majorityvoting, where the object with the maximum number ofoccurrences is regarded as truth [6]. However, we find that thismethod achieves relatively low accuracy (ranging from 0.3 to0.45) in the six datasets. There are two reasons why majorityvoting preforms poorly in Linked Data.Firstly, approximately 50% of predicates have no dominantobject. Let’s take a close look at dbo:populationTotal in Table 1.

1 https://www.freebase.com/2 http://wiki.dbpedia.org/

Five different objects receive equal votes from six sources. Inthis case, majority voting can only randomly select one object inorder to break the tie. In order to reveal the deep-seated reasons,we examine the correlation between the dominance factor andaccuracy as shown in Figure 1. We find that the majority votingcan only achieve satisfactory accuracy at the dominance factormore than 0.7. However, this requirement is too stringent tosatisfy in Linked Data. The dominance factor of a certainpredicate is defined as:

, (2)

where represents a set of conflicting objects for a certainpredicate of a real-world entity, and describes the numberof occurrences .

Figure 1. Correlation between dominance factor andaccuracy.

Secondly, majority voting assumes all sources are equallyreliable and does not distinguish them. Recent research [7] haspointed out that different Linked Data sources have differentquality. Therefore, this method is not applicable to Linked Data.To address the limitation of the majority voting, many truthdiscovery methods [6, 8-15] which found the truth by sourcereliability estimation and inferring trust values of objectssimultaneously, have been proposed. In these methods, the truthfor an entity refers to the object which is assigned to maximumtrust value among all conflicting objects. A common principle ofthese methods is that a source which provides trustworthyobjects more often is more reliable, and an object from a reliablesource is more trustworthy. The trustworthiness of each sourcecan be simulated as the percentage of true objects provided bythis source. Consequently, the more objects a source provides,the more likely that the trustworthiness of the source is closer tothe real source reliability degree. However, for some “small”sources which provide very few objects, it’s difficult to evaluatetheir reliability degrees. Considering an extreme case whensome sources just provide one object, their trustworthiness isone if the object is correct and the source is regarded as highlyreliable. Otherwise, the source is considered as highly unreliable.Inaccurate estimation of source reliability inevitably hasnegative effects on identifying trustworthy objects. Therefore,the effectiveness of many truth discovery methods issignificantly affected by the number of objects provided by eachsource.

0.1 0.3 0.5 0.7 0.9

0.4

0.5

0.6

0.7

0.8

0.9

1

Dominance Factor

Accu

racy

PersonsLocationsOrganizationsDescriptorsFilmsMusic

Page 3: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

In this study, we find that the total number of conflicting objectsprovided by each source typically follows the approximatepower law in Linked Data. This finding indicates that LinkedData has a scale-free property. This property is characterized by

, which is the fraction of the sources having conflictingobjects, following the power law , where is theexponent of the power law and ranges from 2.12 to 3.1 in the sixdatasets as shown in Figure 2. In the six plots, the X-coordinaterepresents the number of conflicting objects provided by asource and the Y-coordinate represents the complementarycumulative distribution function .

100 101 102 103 10410-3

10-2

10-1

100

Pr(k

)

k

:γ 2.42Dataset: Persons

100 101 102 103 10410-3

10-2

10-1

100

Pr(k

)

k

γ:3.1Dataset: Locations

100 101 102 103 10410-3

10-2

10-1

100

Pr(k

)

k

γ:2.56Dataset:Organizations

100 101 102 103 10410-3

10-2

10-1

100

Pr(k

)

k

:γ 2.67Dataset: Descriptors

100 101 102 103 10410-3

10-2

10-1

100

Pr(k

)

k

γ: 2.12Dataset: Films

100 101 102 103 10410-3

10-2

10-1

100

Pr(k

)

k

γ: 2.68Dataset: Music

Figure 2. Distributions of conflicting objects in the sixdatasets.

Figure 2 shows that the number of conflicting objects providedby most of the sources ranges from 1 to 10, and only a fewsources have many conflicting objects. As discussed above,many truth discovery methods are sensitive to the number ofobjects provided by each source. Therefore, these methodscannot be trivially extended to resolve conflicts in Linked Datawith a scale-free property.

1.2 Overview of Our ApproachA simple method to solve the issues resulting from the scale-freeproperty is to remove “small” sources [12]. However, theremoval of “small” sources results in limited coverage andsparse data because most Linked Data sources are “small.” Inthis study, we propose a truth discovery approach calledTruthDiscover to resolve object conflicts in Linked Data with ascale-free property. TruthDiscover involves the following steps.

(i) Priori belief estimation: the non-uniform priori beliefs ofall sources are computed by leveraging the topologicalproperties of the Source Belief Graph.

(ii) Computing the trustworthiness of sources: thetrustworthiness of each source is automatically computed basedon the trust scores of objects. Thereafter, by using the averagingstrategy, the priori beliefs of sources are added to smooth thetrustworthiness of sources.

(iii) Computing the trust values of objects: the trust values ofobjects are computed based on Hidden Markov Random Field(HMRF) model. If the changes in all objects after each iterationare less than the threshold, then the object with maximum trustscore is regarded as truth; otherwise, return to step (ii).

We conducted three experiments in six real datasets from thepersons, locations, organizations, descriptors, films and musicdomains. The experimental results show that TruthDiscoveroutperforms existing approaches for resolving object conflicts inLinked Data with a scale-free property.

1.3 Contributions and OrganizationThe main contributions of this study are as follows:

(i) We find that the number of conflicting objects provided bymultiple Linked Data sources typically follows the approximatepower law. This finding indicates that only a few sources havemany conflicting objects, whereas most of the sources providefew objects. We identify the challenges brought by the scale-freeproperty on the task of truth discovery.

(ii) A truth discovery approach called TruthDiscover isproposed to identify the truth in Linked Data with a scale-freeproperty. Two strategies are adopted in TruthDiscover toaddress the challenges resulting from the scale-free property.First, this approach leverages the topological properties of theSource Belief Graph to estimate the priori beliefs of sources forsmoothing the trustworthiness of sources. Second, a methodbased on HMRF is proposed to infer the trust values of objectsaccurately by modeling the interdependencies between objects.The effectiveness of our approach is validated by threeexperiments in six real datasets.

The remainder of this paper is organized as follows. Relatedwork is discussed in Section 2. Section 3 presents theformulation of problem and derivation of our method. Theexperimental results are discussed in Section 4. Section 5presents the conclusions and future work.

2. RELATED WORKThe resolution of conflicts from multiple sources has beeninvestigated for many years [10, 16, 17]. Existing methods canbe grouped into two categories, namely, relational databases andLinked Data, depending on different data model.

2.1 Conflicts in Relational DatabasesRelational databases have the formal structure of data models.Resolving conflicts in relational databases refer to resolvingcontradictory attribute values from different sources whenintegrating data [10]. This problem was first mentioned byDayal et al. [18] in 1983. However, the problem did not receiveplenty of attention at that time because many applicationsadopted conflict-avoiding or conflict-ignoring strategies [10].Later on, many methods were proposed inspired by measuringweb page authority, such as Authority-Hub analysis [19]. Butauthority does not mean high accuracy [11]. Recently, the

Page 4: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

methods based on truth discovery have gained increasingattention due to its ability to estimate source reliability degreesand infer trust values of objects simultaneously. These methodscan be divided into three groups [20], namely, iterative methods,optimization based methods, and probabilistic graphical modelbased methods.Iterative methods. These methods usually employ theinterdependency between the trust value of objects and thetrustworthiness of sources to find true objects. Yin’s research[11] played an important role in this subfield. This methodutilized Bayesian analysis and the relationship betweentrustworthiness of sources and the probabilities of each claimbeing true to identify truth. Since then, several methods havebeen proposed for specific scenarios based on Yin’s seminalwork. For example, Dong et al. [21] proposed an iterativemethod by analyzing the dependency between source reliabilityestimation and truth computation, which differs from Yin’swork in that it considers dependence between data sources.Pasternack et al. [22] introduced a truth discovery frameworkby incorporating prior knowledge of facts into an iterativeprocedure.Optimization based methods. These methods find the truth byminimizing the distance between the information provided bysources and the identified truth. For example, Li et al. [6]proposed an optimization framework among multiple sources ofheterogeneous data types, where the trust value of objects andthe trustworthiness of sources are defined as two sets ofunknown variables. The truth was discovered by a minimizingoptimization function.Probabilistic graphical model based methods. These methodscan automatically infer truth and source reliability degree byprobabilistic graphical model. For instance, Zhao et al. [9]developed a probabilistic graphical model to address the truthfinding problem by modeling the two aspects of sourcereliability, namely, sensitivity and specificity. This approach isalso the first to address the problem of multi-valued attributetypes.

2.2 Conflicts in Linked DataLinked Data has not been organized into a pre-defined datamodel that nevertheless has associated information, such asmetadata or other markers, to separate semantic elements. Theproblem of conflict resolution also has been studied in the fieldof Linked Data. As discussed in Section 1, the three types ofconflicts for Linked Data are identity, schema, and objectconflicts. Accordingly, existing methods to resolve conflicts inLinked Data can be grouped into three groups.Identity conflicts. The task of resolving identify conflicts is alsoknown as entity resolution or object co-reference resolution.Two types of methods are generally adopted to resolve identityconflicts. One is based on Web Ontology Language (OWL)semantics inference. For instance, Glaser et al. [23]implemented a co-reference resolution service based onowl:sameAs. Hogan et al. [24] proposed a method based oninverse functional properties (IFPs) to conduct large-scale co-reference resolution. The other is based on the assumption thattwo subjects denote the same real-world entity if they shareseveral common property-value pairs. For instance, Wang et al.[25] proposed a concept mapping method based on the

similarities between concept instances. Li et al. [26] presented adynamic entity resolution framework by computing similaritiesbetween instances.Schema conflicts. Many methods have been introduced to solveschema conflicts through schema mapping. These methods arefurther divided into two categories, namely, linguistic matching-based and structural matching-based. Linguistic matching-basedmethods usually employ string similarity computation accordingto names, labels, and several other descriptions. For instance,Qu et al. [27] presented a method to resolve schema conflicts bycomputing the similarity between documents of a domain entity(e.g., a class or a property). Structural matching-based methodsusually employ graphs to represent different schemata andmeasure the structural similarity between graphs. For example,Hu et al. [28] proposed a method based on RDF bipartite graphsto resolve schema conflicts. This method computes structuralsimilarities between domain entities and between statements byusing a propagation procedure over the bipartite graphs.Object conflicts. Research on resolving object conflicts haselicited less attention than research on identity and schemaconflicts. In the early stage of Linked Data, conflict-avoidingand conflict-ignoring strategies were frequently employed forsimplicity. Later on, the methods based on manual rules wereproposed. For instance, Mendes et al. [29] presented a LinkedData quality assessment framework called Sieve. The core ofthis framework is the rule that more recent data are closer to thetrue value. Thereafter, Michelfeit et al. [4] presented anassessment model that leverages the quality of the source, dataconflicts, and confirmation of values to decide which valuesshould be the true value.Although previous works have good performances in certainscenarios or applications, it is difficult to evaluate the reliabilitydegrees of “small” sources as discussed in Section 1.1. Moreresearch on resolving object conflicts in Linked Data with ascale-free property should be conducted.

3. METHODOLOGY3.1 Problem FormulationSeveral important notations utilized in this study are introducedin this subsection. Thereafter, the problem is formally defined.Definition 1 (RDF Triple) [30]. We let denotes the set of IRIs(Internationalized Resource Identifier), denotes the set ofblank nodes, and denotes the set of literals (denoted by quotedstrings, e.g., “Beijing City”). An RDF triple can be representedby , where is a subject, is a predicate,and is an object.

Definition 2 (SameAs Triple). A sameAs triple can berepresented by , which connects two RDFresources through the owl:sameAs predicate.

Definition 3 (SameAs Graph). Given a set of sameAs triples ,a SameAs Graph can be represented by , where

is aset of vertices (i.e., subjects and objects), is a set ofdirected edges with each edge corresponding to a sameAs triplein .

Page 5: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

Definition 4 (Trustworthiness of Sources) [11]. Thetrustworthiness of a source is the expected confidence of theobjects provided by , denoted by .

Definition 5 (Trust Values of Objects) [11]. The trust value ofan object is the probability of being correct, denoted by .

We let denotes a set of conflicting objects for acertain predicate of a real-world entity. The process of resolvingobject conflicts in Linked Data is formally defined as follows:given a set of conflicting objects , TruthDiscover will produceone truth for a certain predicate of a real-world entity. The truthis represented by .

3.2 Basic IdeasThis subsection introduces three assumptions that serve as thebasis of our method and the framework of our method.Assumption 1: A certain predicate of an entity has only onetrue object.In this study, we only consider the case wherein a certainpredicate of a real-world entity has only one truth. The threereasons for inconsistency have different definitions of truth. Forout-of-date, the truth indicates the recent object. For variety, thetruth refers to the object that is presented in the most standardmanner. For pure errors, the object whose value is truerepresents the truth. These definitions of truth are also regardedas annotation guidelines in Section 4.1.Assumption 2: An object is likely to be true if it is providedby a reliable source; thus, a source that provides trustworthyobjects is reliable.We derive this intuitive assumption based on common sense andour observations of the six datasets. This assumption also servesas a basic principle for many truth discovery methods [3, 6, 11-13, 21, 31] to estimate source reliability and identify the truth.Assumption 3: The true objects appear to be similar indifferent sources; the false objects are less likely to be similar.In practice, the true objects provided by different sources maybe presented in slightly different ways or different dataprecisions, such as “Beijing” and “Beijing Shi.” It indicates thatthe true objects appear to be similar. Conversely, different falseobjects are less likely to be similar because different sourcesoften result in different mistakes. In order to validate thisassumption, two similarity functions are adopted to measure thesimilarity between objects and in this study.

For numerical data, the most commonly used similarity functionis defined as:

, (3)

. (4)

For string data, the Levenshtein distance [32] is adopted todescribe the similarities of objects. The similarity function isdefined as follows:

, (5)

where denotes the Levenshtein distance betweenobjects and ; and are the length of and

respectively.

The distribution of average similarities between true objects inthe six datasets is shown in Figure 3 (a). Approximately 90% ofpredicates have average similarities of more than 0.8. Figure 3(b) shows the average similarities between false objects in thesix datasets. The average similarities range from 0 to 0.4 andapproximately 80% of predicates whose average similarities areless than 0.2. This finding indicates that the truths provided bydifferent sources appear to be similar, and false objects aregenerally less consistent than true objects.

(a) Average similarities between true objects.

(b) Average similarities between false objects.Figure 3. Distributions of average similarities between

objects in the six datasets.Based on these assumptions, we develop a method calledTruthDiscover to resolve object conflicts in Linked Data with ascale-free property. Given a set of conflicting objects ,Figure 4 illustrates the framework of generating truth byTruthDiscover, which mainly includes the following threemodules.

(1) Module I. Priori Belief Estimation (described in Section3.3): This module produces priori belief for each source byleveraging the topological properties of the Source Belief Graph.

(2) Module II. Computing the Trustworthiness of Sources(presented in Section 3.4): This module computes thetrustworthiness of each source according to the trust scores of

0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The average similarity

Perc

enta

geof

pred

icat

es

PersonsLocationsOrganizationsDescriptorsFilmsMusic

0-0.1 0.1-0.2 0.2-0.3 0.3-0.40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The average similarity

Perc

enta

geof

pred

icat

es

PersonsLocationsOrganizationsDescriptorsFilmsMusic

Page 6: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

objects. Then, by using an averaging strategy, the priori beliefsof sources are added to smooth the trustworthiness of sources.

(3) Module III. Computing the Trust Values of Objects(described in Section 3.4): According to Assumption 3, the trustvalue of an object can propagate to other objects. Therefore, thismodule adopts HMRF to model the relationships betweenobjects for computing trust values of objects accurately. In thisstudy, the loopy belief propagation algorithm [33] is applied toestimate the marginal probabilities of each hidden variable inHMRF. If the changes in all objects after each iteration are lessthan the preset threshold, then object with the maximum trustscore is regarded as the truth.

Figure 4. Framework of TruthDiscover.Algorithm I highlights the main steps in generating a truth.

Algorithm I. TruthDiscoverInput: a set of conflicting objects , a set of sources that

provideOutput: the truth// The purpose of 1~2 is to generate the priori beliefs ofsources1: Priori belief estimation:

, compute through BeliefRank(described in Section 3.3) ;

2: Initialize the trustworthiness of sources by the normalizedpriori beliefs:

, (described in Section 3.4);3: repeat4: , compute according to Equation 8;5: , compute trust values according to

Equation 10;6: , update according to Algorithm II;7: , update trustworthiness of sources

according to Equation 7;8: until the convergence criterion is satisfied;9: ;

10: return .

3.3 Priori Belief EstimationThis subsection describes a method called BeliefRank toestimate the priori beliefs of all sources by leveraging thetopological properties of the Source Belief Graph.The owl:sameAs property in OWL [34] indicates that twosubjects actually refer to the same thing. The use of this propertyfurther enriches the Linked Data space by declaratively

interconnecting “equivalent” things across distributed LinkedData sources [35]. In recent years, the owl:sameAs propertyhave been extensively utilized in many Linked Data sources,such as DBpedia, Freebase, Yago [36] and GeoNames3. Figure 5shows a fragment of owl:sameAs triples in dbpedia:Beijing.When many of these owl:sameAs triples are taken together, theyform a directed graph called SameAs Graph [37], as defined indefinition 3. Owing to the importance of owl:sameAs in LinkedData integration, many researchers investigated owl:sameAstriples and sameAs graph [37, 38]. However, to the best of ourknowledge, no attempt has been made to estimate the reliabilitydegree of Linked Data sources through SameAs Graph analysis.

Figure 5. Fragment of owl:sameAs triples in dbpedia:Beijing.When data publishers publish their data as Linked Data on theweb, they usually add new owl:sameAs triples pointing to theexternal equivalent subject. As dictated by logic, they select asubject provided by the source they trust. That is, theowl:sameAs property indicates that the data publishers pay theirattentions and trusts to the subject provided by a source theytrust. Typically, the data publisher of a subject can berepresented by the name of source [37]. For example,“DBpedia” is an abstract representation of the data publisher fordbpedia:Beijing. That is, the SameAs Graph can be converted toa directed multigraph called the Source Belief Graph, whichrepresents the relationship between sources. Formally, theSource Belief Graph is a pair of sets , where isa set of vertices with each vertex corresponding to the sourcename of the vertex in SameAs Graph ; is a multiset[39] of formed by pairs of vertices , andeach pair corresponds to an edge in SameAs Graph .Figure 6 shows a fragment of a SameAs Graph and thecorresponding Source Belief Graph.

Figure 6. Example of SameAs Graph and its correspondingSource Belief Graph.

The Source Belief Graph indicates that the trustworthiness ofdifferent sources can be propagated through the edges. The edgestructure of the Source Belief Graph is utilized to produce aglobal reliability ranking of each source. Generally, a highly

3 www.geonames.org/

Page 7: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

linked source is more reliable than sources with a few edges. Welet denotes the set of sources that point to ,denotes the number of edges going out of source and

presents the number of edges that point to . Thepriori belief of source can be defined as follows:

, (6)

where parameter is a damping factor. The damping factor isnecessary because the graph contains loops.Recent research [38] has pointed out that in practice, theowl:sameAs property does not always mean that the two subjectsrefer to the same thing. Four incorrect usages of owl:sameAshave been identified in Linked Data, including, Same Thing AsBut Different Context, Same Thing As But Referentially Opaque,Represents and Very Similar To. Intuitively, the damping factor

in BeliefRank can be considered the probability that the usageof owl:sameAs is correct. The experimental results of [38] showthat approximately 51% of the usages of owl:sameAs are correct.Therefore, the damping factor is set to 0.51 in this study.The effectiveness of BeliefRank is significantly affected by thetotal number of sameAs triples. We extracted eighteen millionsof sameAs triples from BTC2012 [40], which covers asignificant portion of Linked Data, to produce a globalreliability of source. In practice, BeliefRank reaches a stablestage after fourteen iterations when the threshold is set to 0.001.By using BeliefRank, we obtain the priori beliefs of 1,402sources4. Table 2 lists the top 15 results.

Table 2. Top fifteen results obtained by using BeliefRank.

Source Priori belief of source

DBpedia 10.0759www.advogato.org 6.5998

Freebase 6.1583www.deri.ie 3.1503

FOAF 2.8472semanticweb 2.5908

DBLP 2.5776data.semanticweb.org 2.2072

identi.ca 1.9278olafhartig.de 1.8668www.w3.org 1.7666

semantictweet.com 1.7619revyu.com 1.6767

mud.cz 1.6731www4.wiwiss.fu-berlin.de 1.6604

3.4 Truth ComputationThis subsection reveals how to accurately infer thetrustworthiness of the source and the trust value of an object inLinked Data with a scale-free property.

3.4.1 Computing the Trustworthiness of SourcesA native method to compute the accuracy of a source is thatregarding the trustworthiness of a source as the percentage oftrue objects provided by this source. However, we do not know

4 http://1drv.ms/1M2PHoG

for sure which objects are the truths. Therefore, we insteadcompute trustworthiness as the average probability of theobject provided by being true as defined as follows:

, (7)

where is the set of objects provided by source .

Considering the scale-free property of Linked Data, it’s difficultfor Equation 7 to estimate the real reliability degree of sourceaccurately when is “small,” as discussed in Section 1.1.In this study, the trustworthiness of source is smoothedby priori belief based on the averaging strategy asdefined as follows:

, (8)

, (9)

where represents the normalized priori belief ofsource ; and indicate the maximum and minimumvalues of all priori beliefs respectively.

3.4.2 Computing the Trust Values of ObjectsThis subsection describes how the trust values of objects arecomputed. First, we analyze a simple case where all objects areindependent. The trust value of object can be defined asfollows:

, (10)

where represents the set of sources that provide object .

However, Assumption 3 shows that the true objects appear to besimilar on different sources, and the false objects are less likelyto be similar. That is, the trust value of an object can propagateto other objects through the similarity relation. In this study, wemodel the relationship between objects by adopting a methodbased on HMRF.The concept of HMRF is derived from the Hidden MarkovModel (HMM) [41]. HMRF is a powerful formalism used tomodel real-world events based on the Markov chain andknowledge of soft constraints. In this study, the relationshipbetween different objects is denoted by soft constraints, not thecausal relation. Moreover, the trust value of an object is onlyaffected by its neighbors. These conditions motivated us toselect a method based on HMRF. HMRF is mainly composed ofthree components: a hidden field of random variables, anobservable set of random variables, and the neighborhoodsbetween each pair of variables in the hidden field. We let theobservation variables are a set of conflicting objectsfor a certain predicate of a real-world entity. The hiddenvariables are the labels of . Each hidden variable

indicates whether corresponding object is a truth. indicates the similarity of objects and . As shown

in Figure 3(a), approximately 90% of predicates have averagesimilarities of more than 0.8 between true objects. Given that,an intuitive strategy is adopted in this study that the hiddenvariables and are neighbors if similarity is morethan 0.8. The HMRF model can be illustrated by Figure 7.

Page 8: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

Figure 7. Illustration of the HMRF model.The probability distribution of the hidden variables in HMRFobeys the Markov property. Thus, the probability distribution ofthe value of is independent of all other hidden variables givenits neighbors, i.e., each hidden variable is only affected by itsneighbors. Each hidden variable follows the Bernoullidistribution defined as follows.

(11)

Let denotes the set of all maximal cliques. For example, is a maximal clique as shown in Figure 7. The set of

variables of a maximal clique is represented by . Thejoint distribution of variables in HMRF is factorized as follows:

,

,(12)

where is a constant selected to ensure that the distribution isnormalized, and is a potential function in HMRF.

The belief propagation algorithm [42] is proved to be an exactsolution for estimating the marginal probabilities of hiddenvariable when the graph has no loops. Loopy belief propagationis an approximate algorithm for a loopy graph. In this study, wedesign a loopy belief propagation process to estimate themarginal probabilities of the hidden variable in considerationof the loops. In belief propagation, estimating the marginalprobabilities of the hidden variable is a process of minimizingthe graph energy. The key steps of the propagation process areshown as follows.

· Step I: Initialization. The trust value of object isinitialized with Equation 10, and the probabilitydistribution of is initialized with Equation 11.

· Step II: Spreading the Belief Message. The messagefrom variable to is represented by ,

. A high value of indicates that marginalvalue is high. The message is definedas follows:

, (13)

where is a unary energy function. This functionindicates that if is 1, then such propagation requires low

energy (easy to propagate). Otherwise, high energy(difficult to propagate) is required.

· Step III: Belief Assignment. The marginal probability of hidden variable is updated according to

its neighbors, and is defined as follows:

, (14)

where indicates the set of neighbors of .

The algorithm updates all messages in parallel and assigns thelabel. Given only one truth for a certain predicate of a real-worldentity, a value of 1 is assigned to when is the maximum;otherwise, a value of 0 is assigned to . The algorithm stopswhen does not change for any hidden variable betweeniterations. This condition indicates that will converge aftera sufficient number of iterations. The pseudo code of thismethod is shown in Algorithm II.

Algorithm II. Computing the trust values of objectsInput: a set of conflicting objectsOutput: the trust value ,// The purpose of 1~4 is to generate the trust value of objects// and distribution of each hidden variable1: for do2: Compute trust value with Equation 10;3: is initialized with Equation 11;4: end for5: : Calculating their similarity ;6: repeat

// The purpose of 7~15 is to spread the belief message7: for to do8: ;9: for to do10: if then11: Compute message with Equation 13; // The purpose of 12~13 is to assign Belief12: ;13: ;14: end if15: end for16: ;17: end for18: until the convergence criterion is satisfied;19: return , .

We let denotes total number of conflicting objects, is thenumber of iterations of Algorithm II, and is the number ofiterations of TruthDiscover. The time complexity of AlgorithmII is . BeliefRank can produce a global reliabilityranking of each source through an offline process. Therefore, thetime complexity of TruthDiscover is andexperimentally validated in Section 4.2.

4. EXPERIMENTSThree experiments are conducted in six real datasets to validatethe effectiveness of our approach. The experimental results showthat TruthDiscover outperforms existing approaches in resolvingobject conflicts when confronted with the challenge of data

Page 9: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

having a scale-free property. The experiment setup is discussedin Section 4.1, and the experimental results are presented inSection 4.2.

4.1 Experiment SetupSix datasets and three baseline methods are introduced in thissubsection.Datasets: Three experiments are conducted in the six datasetsincluding persons, locations, organizations, descriptors, filmsand music. The first four datasets are constructed based on theOAEI2011 New York Times dataset5, which is a well-knownand carefully created dataset of Linked Data. In order to drawmore robust conclusions, two other domains including films andmusic are constructed through SPARQL queries over DBpedia.The construction process of datasets mainly involves thefollowing steps:

(i) For each entity of the six domains, we perform entity co-reference resolution through the API of sameas.org6, which is awell-known tool, to identify subjects for the same real-worldentities.

(ii) For each of the six domains, we crawl every entityfrom the start position to a depth of 1 by using LDspider [43].The statistics of the six datasets are shown in Table 3. The row“#Subjects” indicates the total number of subjects returned bysameas.org. The row “#Predicates” describes the total number ofpredicates. The row “#Conflicting Predicates” represents thetotal number of predicates that have conflicting objects and therow “#Entities” represents the number of entities for each of thesix domains.

Table 3. Statistics of the six datasets.

Datasets #Entities #Subjects #Predicates #ConflictingPredicates

Person 4,978 130,174 16,245 7,506Locations 1,910 74,015 14,162 6,870Organizations 3,044 25,051 13,956 6,360Descriptors 498 10,362 6,980 3,250Films 7,542 115,172 15,452 9,271Music 7,131 124,456 16,896 9,123

One truth was selected from multiple conflicting objects forexperimental verification. A strict process was established toensure the quality of the annotation. This process mainlyinvolved the following steps:

(i) The annotators were provided annotated examples andannotation guidelines.

(ii) Every two annotators were asked to label the samepredicate on the same entity independently.

(iii) The annotation results from two annotators weremeasured by using Cohen’s kappa coefficient [44]. Theagreement coefficient of the six datasets was set to be at least0.75. When an agreement could not be reached, a third annotatorwas asked to break the tie.

5 http://data.nytimes.com/#6 http://sameas.org/

The manually labeled results were regarded as the ground truthused in the evaluation.

Multi-values Filtering: As discussed in Section 1.1,TruthDiscover focuses on three reasons for object conflicts,whereas the fourth (multi-valued predicates) is left for the future.A method to distinguish multi-valued predicates is needed sothat the applicability of TruthDiscover can be assessed. In thisstudy, an effective rule that if a source provides more than oneobjects for a predicate of a real-world entity, this predicate is themulti-valued predicate, is used to automatic filter multi-valuedpredicate. The method based on this rule achieves relatively highaccuracy (ranging from 0.96 to 0.98) in the six datasets.Therefore, this method meets the desired objectives comparedwith manual annotation method.Baseline methods: We select three well-known state-of-the-arttruth discovery methods as baseline. These methods areevaluated using the same datasets in the experiments.

Vote: Voting regards the object with the maximum number ofoccurrences as truth. Moreover, voting is a straightforwardmethod.

TruthFinder [11]: It’s a seminal work that used to resolveconflicts based on source reliability estimation. It adoptsBayesian analysis to infer the trustworthiness of sources and theprobabilities of a value being true.

F-Quality Assessment [4]: This method is a popularalgorithm used to resolve conflicts in Linked Data. Three factors,namely, the quality of the source, data conflicts, andconfirmation of values from multiple sources, are leveraged todecide which value should be true value.The parameters of the baseline methods were set according tothe authors’ suggestions. The experiments were performed on adesktop computer with Intel Core i5-3470 CPU 3.2 GHz with 4GB main memory, and Microsoft Windows 7 professionaloperating system. All baseline methods were executed in theEclipse (Java) platform7 by a single thread.

4.2 Experimental ResultsThe experimental results for the six datasets show thatTruthDiscover outperforms the baseline methods in determiningthe truth from multiple conflicting objects in Linked Data with ascale-free property.

4.2.1 Accuracy EvaluationIn the experiments, we have two types of data in our datasets:numerical data and string data. For these two types of data, onlyone truth is selected from multiple conflicting objects. Therefore,accuracy as a unified measure is adopted in the experiments forthe two types of data, and can be measured by computing thepercentage of matched values between the output of eachmethod and ground truths. In this sub-subsection, twoexperiments are described. The first experiment evaluates theaccuracy of TruthDiscover with five baseline methods. Thesecond experiment evaluates the effectiveness of TruthDiscoverwith regard to the three reasons for inconsistency.In the first experiment, except for the three baseline methods asdiscussed in Section 4.1, two other baseline methods, including

7 https://www.eclipse.org/

Page 10: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

Baseline1 and Baseline2 are selected in order to evaluate theeffectiveness of two strategies adopted in TruthDiscover. TheBaseline1 removes the priori belief of all sources, and theBaseline2 ignores the interdependencies between objects used inTruthDiscover. The following observations are drawn from thestatistical data presented in Figure 8.

1) TruthDiscover outperforms three baseline methods,including Vote, TruthFinder and F-Quality Assessment in termsof accuracy. The main reason why these three baseline methodsachieve low accuracy is that it’s difficult to estimate thereliability degree of “small” sources accurately in Linked Data.In TruthDiscover, two strategies are adopted to reduce the effectof scale-free property. One strategy is leveraging the topologicalproperties of the Source Belief Graph to estimate the prioribeliefs of sources for smoothing the trustworthiness of sources.The other strategy is that using HMRF to infer the trust valuesof objects accurately by modeling the interdependenciesbetween objects.

2) In addition, the accuracy of Baseline2 and Baseline1 islower than TruthDiscover, which indicates that BeliefRank andAlgorithm II are effective in reducing the effect of “small”sources.

3) The Baseline1 has higher accuracy than TruthFinder. In fact,the Baseline1 adopts Bayesian analysis to infer thetrustworthiness of sources in the same way as TruthFinder does.The most important difference between Baseline1 andTruthFinder is in that two different strategies are adopted tomodel the interdependencies between objects. TruthFinder usesa fixed parameter to control the influence of related facts;however, an appropriate fixed parameter for all objects is hard todetermine. Therefore, TruthFinder is not necessarily effective.Baseline1 considers influence in a principled fashion, and can

automatically adjust the influence between objects depend onHMRF model. Therefore, Baseline1 outperforms theTruthFinder in six datasets.

Figure 8. Performance comparison in six datasets.The second experiment is conducted to validate the effectivenessof three baseline methods including Vote, TruthFinder and F-Quality Assessment, with regard to the three reasons forinconsistency. The following observations are drawn from thestatistical data presented in Table 4.

1) The average accuracy of four methods varies in the differentreasons. These methods achieve lowest accuracy in reasons ofout-of-date, which indicates these methods based only on sourcereliability estimation are insufficient to resolve conflicts of out-of-date, and extra information is required.

2) For three reasons, TruthDiscover outperforms the threebaseline methods in terms of accuracy because two effectivestrategies are adopted.

Table 4. Performance comparison with regard to the three reasons for inconsistency.

DatasetsOut-of-date Variety Pure errors

TruthD. TruthF. F-Quality Vote TruthD. TruthF. F-Quality Vote TruthD. TruthF. F-Quality Vote

Persons 0.45 0.42 0.31 0.20 0.90 0.73 0.57 0.32 0.93 0.78 0.63 0.41

Locations 0.39 0.29 0.32 0.23 0.89 0.68 0.48 0.44 0.92 0.81 0.68 0.45

Organizations 0.41 0.35 0.33 0.17 0.88 0.59 0.42 0.37 0.95 0.81 0.71 0.48

Descriptors 0.48 0.33 0.24 0.23 0.87 0.78 0.51 0.31 0.90 0.83 0.84 0.46

Films 0.49 0.35 0.18 0.19 0.91 0.72 0.53 0.31 0.96 0.93 0.91 0.50

Music 0.43 0.35 0.36 0.25 0.83 0.61 0.42 0.37 0.89 0.88 0.75 0.46

The columns TruthD., TruthF., and F-Quality indicate the TruthDiscover, TruthFinder and F-Quality Assessment respectively.

4.2.2 Convergence AnalysisIn this sub-subsection, two experiments are conducted to validatethe convergence of TruthDiscover. The first experiment isconducted to analyze the convergence of TruthDiscover. Thesecond experiment is performed to show the relation betweenaccuracy and iteration.We formulate the problem of resolving conflicts as an iterativecomputation problem because of the interdependencies betweenthe trust value of objects and the trustworthiness of sources.Therefore, convergence significantly affects the performance ofTruthDiscover. Figure 9 shows the average change in the trustvalue of objects after each iteration. The change decreases rapidly

in the first five iterations, and then reaches a stable stage until theconvergence criterion is satisfied. The average number ofiterations for persons, locations, organizations, descriptors, filmsand music are 23, 24, 25, 13, 28 and 29, respectively.

Persons Locations Organizations Descriptors Films Music0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Datasets

Accu

racy

TruthDiscover TruthFinder F-Quality Assessment Voting Baseline 1 Baseline 2

Page 11: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

Figure 9. Change in the trust values of objects after eachiteration.

The second experiment is conducted to analyze the relationshipbetween accuracy and iteration. The results are shown in Figure 10.The accuracy of TruthDiscover increases as the number ofiterations increases and reaches a stable stage until theconvergence criterion is satisfied.

Figure 10. Relation between accuracy and iteration.

4.2.3 Time Efficiency EvaluationWe sample different numbers of conflicting objects to determinethe computational complexity of TruthDiscover in a singlemachine. Figure 11 shows the running time for conflicting objects.The power law function is adopted to fit the relationship betweenrunning time and number of conflicting objects. We find that therelationship between running time and the number of conflictingobjects typically follows the power law , where is39.844 and is 2.037, which verifies the analysis of the timecomplexity of TruthDiscover discussed in Section 3.4.

Figure 11. Running time of different numbers of entities.The experimental results in Sections 4.2 show that (1) threeassumptions are very reasonable for automatic resolving conflictsin Linked Data, and (2) two strategies are effective to reduce theeffect of scale-free property. These results indicate that theperformance of TruthDiscover is robust and consistent in variousdomains.

5. CONCLUSION AND FUTURE WORKIn this study, observations on six datasets reveal that Linked Datahas a scale-free property. This property means that only a fewsources have many conflicting objects, and most of the sourcesprovide very few objects. Owing to this property, existing workcannot be trivially extended to resolve object conflicts in LinkedData. In this study, the problem of resolving object conflicts inLinked Data is formulated as a truth discovery problem. A truthdiscovery approach called TruthDiscover is proposed to determinethe most trustworthy object, which leverages the topologicalproperties of the Source Belief Graph and the interdependenciesbetween objects to infer the trustworthiness of sources and thetrust values of objects. TruthDiscover8 is evaluated in six real-world datasets. The experimental results show that TruthDiscoverexhibits satisfactory accuracy.A potential direction for future research is to focus on resolvingout-of-date conflicts by leveraging truth discovery and provenanceinformation. Another potential future direction is to identify thecopying relations of different sources to improve performance.

6. REFERENCES[1] C. B. Max Schmachtenberg, Anja Jentzsch and Richard

Cyganiak. "State of the LOD Cloud 2014," URL:http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/.

[2] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data-thestory so far,” International Journal on Semantic Web andInformation Systems, vol. 5(3), pp. 1-22, 2009.

[3] A. Dutta, C. Meilicke, and S. P. Ponzetto, “A probabilisticapproach for integrating heterogeneous knowledge sources,”In ESWC, Crete, Greece, 2014, pp. 286-301.

8 For researchers who are interested in our method, six datasetsand codes are available at http://1drv.ms/1M2PHoG

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Iteration

The

chan

geaf

teri

tera

tions

PersonsLocationsOrganizationsDescriptorsFilmsMusic

0 5 10 15 20 25 30 35 400.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Iterations

Accu

racy

PersonsLocationsOrganizationsDescriptorsFilmsMusic

100 200 300 400 500 600 700 800 900 10000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

#Conflicting objects

Tim

es(/m

s)

PersonsLocationsOrganizationsDescriptorsFilmsMusic

Page 12: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

[4] J. Michelfeit, T. Knap, and M. Nečaský, “Linked DataIntegration with Conflicts,” arXiv preprint arXiv:1410.7990,2014.

[5] D. Srivastava, and S. Venkatasubramanian, “Informationtheory for data management,” In SIGMOD, Indianapolis,Indiana, 2010, pp. 1255-1256.

[6] Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han, “Resolvingconflicts in heterogeneous data by truth discovery and sourcereliability estimation,” In ACM SIGMOD, Utah, USA, 2014,pp. 1187-1198.

[7] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann,and S. Auer, “Quality assessment for linked open data: Asurvey,” Semantic Web Journal, 2013.

[8] X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava,“Truth finding on the deep web: is the problem solved?,” InPVLDB, Istanbul, Turkey, 2012, pp. 97-108.

[9] B. Zhao, B. I. Rubinstein, J. Gemmell, and J. Han, “Abayesian approach to discovering truth from conflictingsources for data integration,” In PVLDB, Istanbul, Turkey,2012, pp. 550-561.

[10] J. Bleiholder, and F. Naumann, “Data fusion,” ACMComputing Surveys vol. 41, no. 1, pp. 1, 2008.

[11] X. Yin, J. Han, and P. S. Yu, “Truth discovery with multipleconflicting information providers on the web,” IEEETransactions on Knowledge and Data Engineering, vol. 20,no. 6, pp. 796-808, 2008.

[12] Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan,and J. Han, “A Confidence-Aware Approach for TruthDiscovery on Long-Tail Data,” In PVLDB, Hangzhou, China,2014.

[13] X. L. Dong, E. Gabrilovich, K. Murphy, V. Dang, W. Horn,C. Lugaresi, S. Sun, and W. Zhang, “Knowledge-Based Trust:Estimating the Trustworthiness of Web Sources,” In PVLDB,Hawai'i, USA, 2015, pp. 938-949.

[14] Y. Li, Q. Li, J. Gao, L. Su, B. Zhao, W. Fan, and J. Han, “Onthe Discovery of Evolving Truth,” In ACM SIGKDD, Sydney,Australia, 2015, pp. 675-684.

[15] V. Vydiswaran, C. Zhai, and D. Roth, “Content-driven trustpropagation framework,” In ACM SIGKDD, CA, USA, 2011,pp. 974-982.

[16] J. Bleiholder, and F. Naumann, “Conflict handling strategiesin an integrated information system,” In WWW, Scotland,United Kingdom, 2006.

[17] X. L. Dong, and F. Naumann, “Data fusion: resolving dataconflicts for integration,” In PVLDB, Lyon, France, 2009, pp.1654-1655.

[18] U. Dayal, and F. C. Center, “Processing queries overgeneralization hierarchies in a multidatabase system,” InPVLDB, Florence, Italy, 1983.

[19] J. M. Kleinberg, “Authoritative sources in a hyperlinkedenvironment,” Journal of the ACM, vol. 46, no. 5, pp. 604-632, 1999.

[20] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J.Han, “A Survey on Truth Discovery,” arXiv preprintarXiv:1505.02463, 2015.

[21] X. L. Dong, L. Berti-Equille, and D. Srivastava, “Integratingconflicting data: the role of source dependence,” In PVLDB,2009, pp. 550-561.

[22] J. Pasternack, and D. Roth, “Knowing what to believe (whenyou already know something),” In Proceedings of the 23rdInternational Conference on Computational Linguistics,Uppsala, Sweden, 2010, pp. 877-885.

[23] H. Glaser, A. Jaffri, and I. Millard, “Managing co-referenceon the semantic web,” In WWW Madrid, Spain, 2009.

[24] A. Hogan, A. Harth, and S. Decker, “Performing objectconsolidation on the semantic web data graph,” In WWW,Alberta, Canada, 2007.

[25] S. Wang, G. Englebienne, and S. Schlobach, “Learningconcept mappings from instance similarity,” In ISWC,Karlsruhe, Germany, 2008.

[26] J. Li, J. Tang, Y. Li, and Q. Luo, “Rimom: A dynamicmultistrategy ontology alignment framework,” IEEETransactions on Knowledge and Data Engineering, vol. 21,no. 8, pp. 1218-1232, 2009.

[27] Y. Qu, W. Hu, and G. Cheng, “Constructing virtualdocuments for ontology matching,” In WWW, EdinburghScotland,United kingdom, 2006, pp. 23-31.

[28] W. Hu, N. Jian, Y. Qu, and Y. Wang, “GMO: A graphmatching for ontologies,” In K-CAP, Banff, Canada, 2005, pp.41-48.

[29] P. N. Mendes, H. Mühleisen, and C. Bizer, “Sieve: linkeddata quality assessment and fusion,” In EDBT/ICDT Berlin,Germany, 2012, pp. 116-123.

[30] F. Manola, E. Miller, and B. McBride. "RDF1.1 primer,"URL: http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/.

[31] X. Yin, and W. Tan, “Semi-supervised truth discovery,” InWWW, Lyon, France, 2011, pp. 217-226.

[32] G. Navarro, “A guided tour to approximate string matching,”ACM computing surveys vol. 33, no. 1, pp. 31-88, 2001.

[33] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy beliefpropagation for approximate inference: An empirical study,”In UAI, Stockholm, Sweden 1999, pp. 467-475.

[34] M. Dean, G. Schreiber, S. Bechhofer, F. van Harmelen, J.Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. A. Stein, OWL web ontology languagereference, http://www.w3.org/TR/owl-ref/#sameAs-def, 2004.

[35] L. Ding, J. Shinavier, T. Finin, and D. L. McGuinness, “owl:sameAs and Linked Data: An empirical study,” In WebSci,NC, USA, 2010.

[36] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a coreof semantic knowledge,” In WWW, Alberta, Canada, 2007, pp.697-706.

[37] L. Ding, J. Shinavier, Z. Shangguan, and D. L. McGuinness,“SameAs networks and beyond: analyzing deployment statusand implications of owl: sameAs in linked data,” In ISWC,Shanghai, China, 2010, pp. 145-160.

[38] H. Halpin, P. J. Hayes, J. P. McCusker, D. L. McGuinness,and H. S. Thompson, “When owl: sameas isn’t the same: An

Page 13: Truth Discovery to Resolve Object Conflicts in Linked DataAn important characteristic of Linked Data is that anyone can publish their data as Linked Data on the web by following certain

analysis of identity in linked data,” In ISWC, Shanghai,China,2010, pp. 305-320.

[39] W. D. Blizard, “Multiset theory,” Notre Dame Journal offormal logic, vol. 30, no. 1, pp. 36-66, 1989.

[40] A. Harth, {Billion Triples Challenge} data set, Downloadedfrom http://km.aifb.kit.edu/projects/btc-2012/, 2012.

[41] Z. Ghahramani, and M. I. Jordan, “Factorial hidden Markovmodels,” Machine learning, vol. 29, no. 2-3, pp. 245-273,1997.

[42] J. Pearl, “Reverend Bayes on inference engines: A distributedhierarchical approach,” In AAAI, Pennsylvania, USA, 1982,pp. 133-136.

[43] R. Isele, J. Umbrich, C. Bizer, and A. Harth, “LDspider: Anopen-source crawling framework for the Web of LinkedData,” In ISWC, Shanghai,China, 2010.

[44] J. Carletta, “Assessing agreement on classification tasks: thekappa statistic,” Computational linguistics, vol. 22, no. 2, pp.249-254, 1996.