sun yuqing,chenzy,yzm@sdu. edu. dongyongquan@mail....

Top-K Query Answering for Probabilistic Data Integration Systems inPervasive Computing Environment

Peng Pan, Qizhong Li,YuQing Sun,ZhiYong Chen, ZhongMin Yan,YongQuan DongSchool ofComputing Science and Technology, ShanDong University

ppan, lqz, sun_yuqing, chenzy,yzm@sdu. edu. cn, dongyongquan@mail. sdu. edu. cn

Abstract

In pervasive computing environment, a challenge ishow to exchange and share information onheterogeneous devices. The mediated-base approach ofdata integration provide a uniform view by creating thesemantic relationship between sources and mediatedschema. Since the manual match is unfeasible, manyautomatic approaches are developed. The probabilityof mappings should be considered inevitably during aquery proposed on the global schema of mediatedprocess. Since each tuple of the results has differentprobability, we may adapt Top-K algorithm to get themost approximate k answers. In this paper, based on agiven pervasive computing scenario, we formulate atheoretical modelfor data integration with probability,demonstrate the probabilities for schema mappings andquery answering. We also describe a distributed top-kalgorithm cooperating with a local top-k algorithm.

Keywords: Pervasive computing, Data integration,Data probability, Schema mapping, Top-k query

1. Introduction

Recently, with the development and maturation ofrelated technologies and theories gradually, pervasivecomputing is getting more and more researchers'attentions. The ultimate object of pervasive computingis to integrate the information spaces constructed bycommunication and computer and integrate the physicalspaces for people's lives and works [1].

In pervasive computing environment, a challenge ishow to exchange and share information onheterogeneous devices[2][3]. The technique of dataintegration[4] [5] [6] may be the most applicablesolution, which aggregates distributed heterogeneousdata sources, and provides an uniform view to the users.By creating the semantic relationships between theschemas of sources and mediator [7], the mediated-baseapproach maps all the sources to an uniform domain ofconcept,. Since manually specifying schema matchesand mappings is tedious, time-consuming, error-prone,and therefore expensive process, many faster and lesslabor-intensive integration approaches and toolssupporting automatic matching are developed[E.Rahn2001]. However, in reality, because the users are not

skilled enough to provide precise mappings and thescales of the data sources prevent generating andmaintaining precise mappings, such as in integratingdata of the web scale , the mappings are frequentlyinaccurate [8] along with their probabilities.

The problem of data probability is getting more andmore severe increasingly. In many regions such as dataintegration, scientific data ,IR, sensor network [13] [14][15], a great deal of data with probabilities areincreasing. These uncertainties may result from datathemselves, mappings between data instances orschemas(just as mentioned above), and the queryprocesses. As a result, more and more organization andusers are facing the expensive cost for datacleaning(such as integration of web data) ,or evenunfeasibility to clean(such as sensor data or RFIDdata).In [17], Nilesh Dalvi etc. present the challengesand main problems for data probability.

In a data integration system, if the mappings arecreated by the automatic or semi-automatic matchapproach, the probabilities of mappings should beconsidered inevitably. While a query is submitted on theglobal schema of mediator, the query would give birthto no less than one query reformulations with certainprobability. Since each tuple of the results has differentprobability, we may adapt top-k algorithm to get themost approximate k answers.

The paper is organized as follows. Section 2describes a scenario in computing pervasiveenvironment which motivates our research. Section 3discusses related works. Section 4 redefines thetheoretical model for data integration by consideringprobability. Based on the scenario mentioned in section2, section 5 focuses on the probability for the schemamappings, and describes an algorithm of queryanswering. Section 6 describes the distributed and localtop-k algorithm inspired by the algorithms in [8] and[12]., and section 7 concludes.Our main contributions to this paper are:

1) Considering uncertainty, we define a theoreticalmodel combing traditional data integration system andprobability.2) we describes the distributed and local top-kalgorithms applicable to query answering for distributeddata integration system with probabilities.

2. Motivating Scenario

978-1-4244-2020-9/08/$25.00 02008 IEEE

- 274 -

To illustrate the research focus and motivate the needfor our discussion, consider the following scenario:

Rose is walking in a shopping centre, in which manyadjacent marketplaces and shops locate. By running theapplication on her wireless handheld device that sendsthe request to surrounding sensors of shops, she wantsto look through the nearby shops to find some athletics-leisure shoes interesting her. In collaboration with thecatalog servers which received the relayed the request,a top-k monitor takes charge of proceeding the query,and responds Rose by providing the top-k results.

Figure 1 presents the architecture for the scenario.

Figure 1. The architecture of scenarioNow we describe the data process in Figure 11) The request contains the schema of products

catalog on Rose's handheld device2) The sensors receive the request and relay to

their master catalog servers3) Applying the schema automatic matching

tools, the catalog servers create the schema mappingsbetween the schema of their product catalogs and thehandheld device.

4) Referencing to the mappings, the catalogservers reformulate the query whose schema based onthe handheld device's catalog, to a new query on theirown schemas.

5) Running the local Top-K algorithm to finishingthe process of query, the catalog servers translate theresults in terms of the schema of handheld's deviceschema, and send their Top-K results to the Top-Kmonitor who completes the main task.

6) The Top-K monitor run the distributedalgorithms to select the final Top-K results.

7) The handheld device gets the results from theTop-k monitor, and presents them to Rose.

In 3), since match is processed automatically, it'sunavoidable that an original query would bring morethan one reformulation on each catalog server. As eachreformulation has a probability respectively, the tuplesof it's query answer would have different probabilitiesfrom others. Since a tuple may appears in more than onereformulation's query answer, each tuple has it's ownprobability value. We may select the top-k tuples withprobabilistic value by applying appropriate Top-Kalgorithm. In 4), a catalog server accomplishes the local

top-k selection. Then quite a few groups of Top-Ktuples are produced, the top-k monitor need to executedistributed TOP-K algorithm to get the finial Top-Kselections among these groups. This process wouldcooperate with catalog servers, just like the descriptionin 5).

3. Related Work

In the research of data integration, [10] describesseveral languages for describing contents of datasources, the tradeoffs between them, and the associatedreformulation algorithms. [4] discusses a series ofproblems theoretically, including modeling a dataintegration application, processing queries in dataintegration, dealing with inconsistent data sources, andreasoning on queries. [22] discusses an abstractviewpoint of data integration system that the globalview is an ontology expressed in a class-basedformalism. However, all of them don't take theuncertainty into consideration. As the first to discussprobabilistic data integration comprehensively, [8]introduces the concept of probabilistic schemamappings and analyzes their formal foundations.

The research on probabilistic data at present focus onthe probabilistic database and on the management ofprobabilistic data in general. The topic relatedprobabilistic mappings just starts off in recent years [17][20] [16]. [18] used the top-k schema mappingsobtained by a semi-automatic mapper to improve theprecision of the top mapping. [19] combines IR andmachine learning fields to find suitable mappingcandidates. However, it seems that they didn't talkabout the relationships between mappings anduncertainty as a whole. [8] presents two possiblesemantics for probabilistic mappings: BY-TABLE andBY-TUPLE, proposes the query complexity algorithmsfor answering queries in the presence of approximateschema mappings, and describes an algorithm forefficiently computing the top-k answers to queries insuch a settings.

In the research on Top-K query, the representative isThreshold Algorithm (TA) [21 ]. It is generallyapplicable in database applications, but inefficient whenapplied to answer top-k queries in large distributednetworks in terms of bandwidth consumption . Hence,in [11], the first constant number of round algorithm forcalculating top-k objects in distributed systems isproposed and referred to as the Three-Phase Uniform-Threshold algorithm (TPUT). Since TPUT does not takedata distributions into account, [12] proposed differentalgorithms to calculate top-k queries in constant numberof rounds to further enhance the performance byaccounting for varying data distributions. They arereferred to as the Three-Phase Adaptive-Thresholdalgorithm (TPAT), the Three-Phase Object-Rankingbased algorithm (TPOR) and the Hybrid-Thresholdalgorithm (HT). In [23], R.Akbarinia etc. propose two

- 275 -

new algorithms: BPA andsooner than TA.

4. The model forintegration

BPA2 which stop much

probabilistic data

probabilistic mapping

Local Schema

Figure 2. The model of probabilisticdata integration system

Based on the theoretical model for dataintegration proposed in [4], this paper defines a dataintegration model with probability(seen in Figure 2),with the following formalizations:

A data integration system A is a quadruple <G, S,Dp, Mp>, where

G is the global schema, expressed by the logictheory over an related alphabet AG.

S is the source schema, expressed by the logictheory over an related alphabet As.

Dp is a duplation <D, Pd>, where D= {Ii Ii is a

database satisfy S, ie [1, N]}, Pd={< Ii, Pr(Ii)>ie [1, N], Ii D, Pr(I) E (0,1] and N Pr(L) =1}

Mp is a duplation <M, Pm>, M {m m, is a

mappings between S and G }, Pm= {< m,,Pr(m,)>ie [1,1], m. E M, Pr(mi) (0,1] and E'1Pr( ml) =1}

As far as mapping mie M be concerned, it can bedefined to the correspondence between attributesexpressed as Cij = (Si, Ti), where Si is a source attributeof schema S, and Ti is a target attribute of schema T.

There are three approaches to create the mappingsin traditional integration system: local-as-view (LAV),global-as-view (GAV) [10] and GLAV[9].Respectively, [4] [10] make comprehensivecharacteristic comparison between LAV and GAV.LAV is easy to express the source with the complexquery process, while GAV is likely to cope with thequery , but inconvenient to express and maintain thesource schema. GLAV tends toward both of theircharacters. In above scenario, as the handheld deviceadapts light-weight databases, it's necessary to simplifythe query process. On the same time, the application

only has regard for the query based on it's schema, andneedn't adjust it's schema to consist with the varyingpervasive computing environments. As a result, theGAV approach is the appropriate candidate to create themappings for this paper, with the followingformalization in general:

Vx(qs(x) e g(x)) (while g D qs i.e. sound view)Vx(qs(x) g(x)) (while g qs i.e. exact view)where g is an element of G which is the global

schema, expressed by the logic theory over a relatedalphabet AG, and qs is a query over S of the arity of g..

5. Probability for the schema mappings

5.1. The probability of mappings

If the schema mappings are created automatically,it's highly probable that no less than one candidate witha probability would appear. [8] divides the distributionof tuples into two types:1) BY-TABLE: all the tuples follow the same

schema mapping2) BY-TUPLE: in the source S, there are more

than one subset of tuples grouped by differentmappings.

In this paper our discuss has these limitations:1) Our discussion is only about the relational datamodel, of which a schema is considered as a set ofrelations, and a relation as a set of attributes.2) Only select-project-join (SPJ) queries in SQLare considered.3) The GAV mappings is in the form of which a

single relation of T is expressed by only one projectionquery of S.

Examplel

In the schema G of handheld device, there is a

relation DIRECTORY, see Table 1; in the catalogdatabase of a shop's server CS1, there is a relationPRODUCT, see Table 2; when the match is finished on

CS1, the possible mappings with probability is seen inTable 3.

Table 1. The relation DIRECTORY

Table 2. An instance of PRODUCTPRODUCT

class mark pname code price width series

Casual |Nike TrackRacer 74129641 84 B AthlesisureFitness |Adidas ILS_Move 73932941 791A |FitnessFitness Reebok |ZanChi Ballina 7216702 791B |Athlesisure

- 276 -

DIRECTORYcategories lbrand |name |pro_id |price Isize Ilifestyle

Table 3. a probabilistic schema mapping betweenDIRECTORY and PRODUCT

PRODUCT(class, mark, pname,code, price, 0.5width, series) => DIRECTORY(categories,brand, name, pro_id, price, size, lifestyle )PRODUCT( series, pname, mark, code, price, 0.2width, class ) => DIRECTORY(categories,brand, name, pro_id, price, size, lifestyle )PRODUCT(mark, series, pname, code, price, 0.2width, class ) => DIRECTORY(categories,brand, name, pro_id, price, size, lifestyle )PRODUCT(pname, mark, class, code, price, 0.1width, series) => DIRECTORY(categories,brand, name, pro_id, price, size, lifestyle )

5.2. Query answering

When a query is proposed over the schema of G,several reformulations would be formed in terms oftheir mappings over the schema of S, and theprobability of each returned tuple may be unequal.

Based on example 1, by applying the queryreformulation algorithm in [8], there is a SQL querysubmitted on the handheld device.

select categoriesfrom DIRECTORYSince the schema of G is represented by the view

over source schema S, it's convenient to unfold eachrelation of G in terms of corresponding mapping, thenthe new query reformulations look like these:

Q1: select classfrom PRODUCTQ': select seriesfrom PRODUCTQ3: select markfrom PRODUCTQ4: selectpnamefrom PRODUCT

Therefore, the probabilities of above queryreformulations are 05, 02, 0.2, 0. 1.

By executing these queries, we get the results seenin Table 4. Since come out from both Ql and Q2, theitem 'Fitness' get the sums of the two reformulation'sprobabilities(0.7).

Table 4. the answers of Q over PRODUCT

Casual 0. 5Fitness 0. 5+0. 2Athlesisure 0. 2Nike 0. 2Adidas 0. 2Reebok 0. 2Track Racer 0. 1LS-Move 0. 1Zan Chi Ballina 0. 1

6. Distributed Top-K query answering

6.1. Top-K query answering

In above scenario, owing to the probabilisticmappings and the light-weight devices whichimpossibly and unnecessarily accepts all results, it isfeasible to get the top k-th elements (i.e. tuples here)whose scores (i.e. probabilities here) are maximum toRose. For there is no less than one node(i.e. catalogserver here), the top-k elements should be selectedamong these nodes. In order to achieve this, top-kelements should be picked out locally on each nodesfirst, and then a distributed top-k algorithm would runon the Top-K monitor to decide the final top-k elementsby aggregating the local top-k elements of each node.Based on the top-k algorithm of BY-TALBEmappings[8], we design the top-k algorithm for localtop-k query answering. In the algorithm of [8], as soonas the top-k elements are picked out, the process wouldend at the loss of final scores of the top-k elements. Weset a threshold value 6 which maintain the top-k processat each node, so that the top-k scored elementsnecessary for the monitor are produced.

Now we introduce the top-k algorithm at singlenode(i.e. local top-k algorithm).

For each tuple of results , there are upper boundand lower bound, what are

Pmax represents the highest probability availablefor rest query process, with the initial value 0.

Pmin represents the total probability up to currentquery process, and in terms of this value , the top-kalgorithm decides the top-k elements.

During the whole query process, five variants bekept:

PMax: represents the highest probability availablefor the whole query

th: is set by the value of the k-th largest Pmin fortuples in the answer list

L: a list whose elements are tuplesdf: the index of round for Ql when algorithm ends3: when PMax< th, it means that in the results of rest

reformulation's query, a new tuple t wouldn't be addedto the list of candidates. Because even the value OfPminis equal to PMax , Pmin is still below th (which is thevalue of Pmin for the k-th candidate ) ultimately. Sincethe determination on top-k candidates can be ended atthis time, in rest reformulation's query processes, thevalue of partial candidate's Pmin may be augmented. Asa result, the final order of top-k candidates would bedifferent from the case when PMax is less than th.Therefore, a count threshold 3 (3 <= n (n is the numberof reformulations)) is set to guarantee the right order oftop-k candidates without increasing the complexity. Inthis case, if PMax<th, the query process round forreformulations may continue to the 6-th, and the effectof rest reformulation 6+1 .... Qn can be ignoredbecause of their very tiny probabilities.

Algorithm 1:

- 277 -

Input: a query Q whose schema is based on GOutput: the Top-K tuples whose probabilities are the khighest on one node.1. Considering the combination forms of possiblemappings for Q Q is reformulated into Q1,... Q, set ito zero.2. execute Ql, Qn in descending order of theirprobabilities, and update the results as followings:

1) PMax = PMax- Pr( Q;)2) set the k-thpmin ofL to th3) for each tuple t ofL

if t appears in the result of

Pmin(t) Pmin(t) + Pr( Q;)else Pmax(t) = Pmax(t) - Pr( Q;), which means that

t's available value decrease4) if PMax >=th and t is not in L, then add t into L,

Pmin(t) Pr( Q), PMax = Pmax(t)5) when th > PMax, end cycle of executing Ql, set i to

3. Filter tuples from L on the condition of th > Pmax(t),get the top-k elements, and rank them in descendingorder.

6.2. Distributed Top-K algorithm

For Rose would search more than one catalogserver to find k tuples whose probabilities are thehighest, it means that the top-k elements should beselected among several groups with their own top-kelements provided by each node.[11] proposes theThree Phase Uniform Threshold Algorithm to solve thedistributed top-k; based on [11] , [12] presents threeimproved algorithms, and this paper's method ismotivated by one of them which is HT(HybridThreshold Algorithm).

Algorithm 2:Input: n groups with top-k elementsOutput: new top-k elements whose probabilities are thehighest among all nodes.1. Each node sends it's top-k elements to the top-kmonitor, this is done based on the each node's work ofAlgorithm 1 in advance. The monitor calculates thepartial sums(i.e. the sum of probabilities here) of all theelements seen so far, and puts the elements with the khighest partial sums into the List Ld. The k-th partialsums is set to threshold i-i2. The monitor broadcasts Ld and T= i-i /m(m is thenumber of nodes) to all nodes; on each node, execute:

1) for each tuple ofLd,{Pmin(t) = partial sumsstartings from 4, until 4>=6

{continue the rest query Q;+1.at each round{For each tuples of Q":

if tE Ld, Pmin(t) = Pmin(t) + Pr( Q. )} } }2) reorder the tuples in terms Ofppmin3) for each tuple of Ld, determine the lowest local

s;owest Pmin among all the k elements in Ld. Then eachnode sends the list of tuples whose Pmin is no less thanTi=max( sowest ,) to the monitor.3. The monitor recalculates partial sum for each tuple,and indicates the k highest elements, then sets i-2 tothe k-th partial sums4. The monitor checks if the value of Ti from node i ismore than Tpatch = -2 /2. If so, the monitor would sendTpatch to node i, requesting for the elements whosescore is no less than Tpatch. Then, the monitorcalculates the partial sums of all elements to indicate thek highest.5. The monitor sends k highest candidates to all nodes,each node returns the Pmin of these elements, thenmonitor calculates the final sum and decides the finalorders.

7. Conclusion and future works

This paper discusses the problems related to theprobabilistic data integration including theoreticalmodel, probabilistic mappings, probabilistic queries,and distributed Top-K queries. Based on a givenscenario in pervasive computing environment, weformulate a theoretical model for data integration withprobability, demonstrate the probabilities for the schemamappings and query answerings. Finally, we describe adistributed Top-K algorithm cooperate with the localTop-K algorithm.

Many other problems need to be solved andfurthered .Having only discussed the probabilities ofBY-TALBE mappings, we would probe in the topic ofBY-TUPLE mappings in the future, because it's morecomplex than BY-TABLE. We also need to improve thetop-k algorithms.

References[I]Mark Weiser, The Computer for the 21st Century,

Scientific American, Sep. 1991, 265(3): 94 - 104.[2] Debashis Saha, and Amitava Mukherjee, A Pervasive

computing: a paradigm for the 21st century ,ComputerVolume: 36, no 3, Mar. 2003 , pp. 25- 31.

[3]K. Henricksen, J. Indulska and A. Rakotonirainy,Infrastructure for Pervasive Computing: Challenges,Workshop on Pervasive Computing and Information Logisticsat Informatik 2001, Vienna, September 25-28, 2001.

[4]Lenzerini, Maurizio, Data integration: A theoreticalperspective , The 21st ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems (PODS 2002),Madison, WI, United States, June 03-05 2002, pp. 233-246.

[5]A. Halevy, A. Rajaraman, and J. Ordille, Dataintegration: The teenage years, In VLDB, 2006, pp. 9-16

[6] Patrick Ziegler, and Klaus R. Dittrich , Three decadesof data integration - all problems solved ? In Jacquart, R.,editeur: 18th IFIP World Computer Congress (WCC 2004),

- 278 -

Volume 12, Building the Information Society, volume 156 deIFIP International Federation for Information Processing,pages 3-12, Toulouse, France, August 22-27,2004, Kluwer.

[7]Renee J. Miller, Laura M. Haas, and Mauricio A.Hernandez , Schema Mapping as Query Discovery,Proceedings of the 26th International Conference on VeryLarge Data Bases, 2000, pp. 77 - 88.

[8] Xin Dong, Alon Y. Halevy, and Cong Yu , Dataintegration with uncertainty , Proceedings of the 33rdinternational conference on Very large data bases table ofcontents, Vienna, Austria, 2007, pp.687-698

[9] M. Friedman, A. Levy, and T. Millstein, Navigationalplans for data integration, In Proc. of the 16th Nat.Conf. onArtificial Intelligence (AAAI'99), AAAI Press/The MITPress, 1999, pp. 67-73.

[10] Alon Y. Levy , Logic-based techniques in dataintegration, in MinkerJ (ed.) Logic-based artificialintelligence. Kluwer Academic, Dordrecht, 2000, pp. 575-595

[11] Pei Cao, Zhe Wang, Efficient Top-K querycalculation in distributed networks, Proceedings of the AnnualACM Symposium on Principles of Distributed Computing, v23, Proceedings of the 23rd Annual ACM Symposium onPrinciples of Distributed Computing, 2004, pp. 206-215.

[12] Hailing Yu, Hua-Gang Li, Ping Wu, DivyakantAgrawal, and Amr El Abbadi, Efficient Processing ofDistributed Top-k Queries, Lecture Notes in ComputerScience, v 3588, Database and Expert Systems Applications:16th International Conference, DEXA 2005. Proceedings,2005, p 65-74.

[13] Reynold Cheng, Prabhakar, and Sunil, Managinguncertainty in sensor databases, SIGMOD Record, v 32, n 4,December, 2003, pp. 41-46.

[14] AH. Doan, R.Ramakrishnan, and S.Vaithyanathan,Managing information extraction: State of the art and researchdirections, Proceedings of the ACM SIGMOD InternationalConference on Management of Data, SIGMOD 2006 -Proceedings of the ACM SIGMOD International Conferenceon Management of Data, 2006, pp. 799-800.

[15] D. Florescu, D. Koller, and A. Levy, Usingprobabilistic information in data integration. In VLDB, 1997.

[16] Dan Suciu, Nilesh Dalvi, Foundations of probabilisticanswers to queries, Proceedings of the ACM SIGMODInternational Conference on Management of Data, SIGMOD2005, Proceedings of the ACM SIGMOD InternationalConference on Management of Data, 2005, pp. 963.

[17] Nilesh Dalvi, Dan Suciu, Management of probabilisticdata: Foundations and challenges, Proceedings of the ACMSIGACT-SIGMOD-SIGART Symposium on Principles ofDatabase Systems, Proceedings of the Twenty-sixth ACMSIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems, PODS 2007, 2007, pp. 1-12.

[18] Avigdor Gal, Managing uncertainty in schemamatching with top-K schema mappings, Lecture Notes inComputer Science (including subseries Lecture Notes inArtificial Intelligence and Lecture Notes in Bioinformatics), v4090 LNCS, Journal on Data Semantics VI - Special Issue onEmergent Semantics, 2006, pp. 90-114.

[19] Henrik Nottelmann, Umberto Straccia, Informationretrieval and machine learning for probabilistic schemamatching, Information Processing and Management, v 43, n 3,May, 2007, pp. 552-576.

[20] Anish .Das Sarma, Omar Benjelloun, Alon Halevy,Jennifer Widom, Working models for uncertain data,Proceedings - International Conference on Data Engineering,v 2006, Proceedings of the 22nd International Conference onData Engineering, ICDE '06, 2006, pp. 7

[21] R.Fagin, A.Lotem, M.Naor, Optimal aggregationalgorithms for middleware Source, Proceedings of the ACMSIGACT-SIGMOD-SIGART Symposium on Principles ofDatabase Systems, 2001, pp. 102-113.

[22] Diego Calvanese, De Giacomo, Giuseppe, Dataintegration: A logic-based perspective, Al Magazine, v 26, n1, Spring, 2005, pp. 59-70.

[23] R.Akbarinia, E.Pacitti, P.Valduriez ,Best positionalgorithms for top-k queries, Proceedings of the 33rdinternational conference on VLDB.2007,pp. 495-506.

- 279 -

sun yuqing,chenzy,yzm@sdu. edu. dongyongquan@mail....

Documents