a load balancing model based on cloud partitioning

A Privacy Leakage Upper BoundConstraint-Based Approach for

Cost-Effective Privacy Preservingof Intermediate Data Sets in Cloud

Xuyun Zhang, Chang Liu, Surya Nepal, Suraj Pandey, and Jinjun Chen, Member, IEEE

Abstract—Cloud computing provides massive computation power and storage capacity which enable users to deploy computation

and data-intensive applications without infrastructure investment. Along the processing of such applications, a large volume of

intermediate data sets will be generated, and often stored to save the cost of recomputing them. However, preserving the privacy of

intermediate data sets becomes a challenging problem because adversaries may recover privacy-sensitive information by analyzing

multiple intermediate data sets. Encrypting ALL data sets in cloud is widely adopted in existing approaches to address this challenge.

But we argue that encrypting all intermediate data sets are neither efficient nor cost-effective because it is very time consuming and

costly for data-intensive applications to en/decrypt data sets frequently while performing any operation on them. In this paper, we

propose a novel upper bound privacy leakage constraint-based approach to identify which intermediate data sets need to be encrypted

and which do not, so that privacy-preserving cost can be saved while the privacy requirements of data holders can still be satisfied.

Evaluation results demonstrate that the privacy-preserving cost of intermediate data sets can be significantly reduced with our

approach over existing ones where all data sets are encrypted.

Index Terms—Cloud computing, data storage privacy, privacy preserving, intermediate data set, privacy upper bound

Ç

1 INTRODUCTION

TECHNICALLY, cloud computing is regarded as an inge-nious combination of a series of technologies, establish-

ing a novel business model by offering IT services andusing economies of scale [1], [2]. Participants in thebusiness chain of cloud computing can benefit from thisnovel model. Cloud customers can save huge capitalinvestment of IT infrastructure, and concentrate on theirown core business [3]. Therefore, many companies ororganizations have been migrating or building theirbusiness into cloud. However, numerous potential custo-mers are still hesitant to take advantage of cloud due tosecurity and privacy concerns [4], [5].

The privacy concerns caused by retaining intermediatedata sets in cloud are important but they are paid littleattention. Storage and computation services in cloudare equivalent from an economical perspective becausethey are charged in proportion to their usage [1]. Thus,cloud users can store valuable intermediate data sets

selectively when processing original data sets in data-intensive applications like medical diagnosis, in order tocurtail the overall expenses by avoiding frequent re-computation to obtain these data sets [6], [7]. Such scenariosare quite common because data users often reanalyzeresults, conduct new analysis on intermediate data sets, orshare some intermediate results with others for collabora-tion. Without loss of generality, the notion of intermediatedata set herein refers to intermediate and resultant data sets[6]. However, the storage of intermediate data enlargesattack surfaces so that privacy requirements of data holdersare at risk of being violated. Usually, intermediate data setsin cloud are accessed and processed by multiple parties, butrarely controlled by original data set holders. This enablesan adversary to collect intermediate data sets together andmenace privacy-sensitive information from them, bringingconsiderable economic loss or severe social reputationimpairment to data owners. But, little attention has beenpaid to such a cloud-specific privacy issue.

Existing technical approaches for preserving the privacyof data sets stored in cloud mainly include encryption andanonymization. On one hand, encrypting all data sets, astraightforward and effective approach, is widely adoptedin current research [8], [9], [10]. However, processing onencrypted data sets efficiently is quite a challenging task,because most existing applications only run on unencrypteddata sets. Although recent progress has been made inhomomorphic encryption which theoretically allows per-forming computation on encrypted data sets, applyingcurrent algorithms are rather expensive due to theirinefficiency [11]. On the other hand, partial information ofdata sets, e.g., aggregate information, is required to expose

1192 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 6, JUNE 2013

. X. Zhang, C. Liu, and J. Chen are with the Faculty of Engineering and IT,University of Technology Sydney, PO Box 123, Broadway, Sydney, NSW2007, Australia. E-mail: {xyzhanggz, changliu.it, jinjun.chen}@gmail.com.

. S. Nepal is with ICT Centre, CSIRO, Cnr Vimiera and Pembroke Roads,Marsfield, Sydney, NSW 2122, Australia. E-mail: [email protected].

. S. Pandey is with IBM Research, Lvl 5/204 Lygon Street, Carlton,Melbourne, VIC 3053, Australia. E-mail: [email protected].

Manuscript received 3 Mar. 2012; revised 21 July 2012; accepted 23 July2012; published online 7 Aug. 2012.Recommended for acceptance by V.B. Misic, R. Buyya, D. Milojicic, andY. Cui.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPDSSI-2012-03-0250.Digital Object Identifier no. 10.1109/TPDS.2012.238.

1045-9219/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

to data users in most cloud applications like data miningand analytics. In such cases, data sets are anonymizedrather than encrypted to ensure both data utility andprivacy preserving. Current privacy-preserving techniqueslike generalization [12] can withstand most privacy attackson one single data set, while preserving privacy for multipledata sets is still a challenging problem [13]. Thus, forpreserving privacy of multiple data sets, it is promising toanonymize all data sets first and then encrypt them beforestoring or sharing them in cloud. Usually, the volume ofintermediate data sets is huge [6]. Hence, we argue thatencrypting all intermediate data sets will lead to highoverhead and low efficiency when they are frequentlyaccessed or processed. As such, we propose to encrypt partof intermediate data sets rather than all for reducingprivacy-preserving cost.

In this paper, we propose a novel approach to identifywhich intermediate data sets need to be encrypted whileothers do not, in order to satisfy privacy requirements givenby data holders. A tree structure is modeled fromgeneration relationships of intermediate data sets to analyzeprivacy propagation of data sets. As quantifying jointprivacy leakage of multiple data sets efficiently is challen-ging, we exploit an upper bound constraint to confineprivacy disclosure. Based on such a constraint, we modelthe problem of saving privacy-preserving cost as a con-strained optimization problem. This problem is thendivided into a series of subproblems by decomposingprivacy leakage constraints. Finally, we design a practicalheuristic algorithm accordingly to identify the data sets thatneed to be encrypted. Experimental results on real-worldand extensive data sets demonstrate that privacy-preser-ving cost of intermediate data sets can be significantlyreduced with our approach over existing ones where alldata sets are encrypted.

The major contributions of our research are threefold.First, we formally demonstrate the possibility of ensuringprivacy leakage requirements without encrypting allintermediate data sets when encryption is incorporatedwith anonymization to preserve privacy. Second, wedesign a practical heuristic algorithm to identify whichdata sets need to be encrypted for preserving privacy whilethe rest of them do not. Third, experiment resultsdemonstrate that our approach can significantly reduceprivacy-preserving cost over existing approaches, which isquite beneficial for the cloud users who utilize cloudservices in a pay-as-you-go fashion.

This paper is a significantly improved version of [14].Based on [14], we mathematically prove that our approachcan ensure privacy-preserving requirements. Further, theheuristic algorithm is redesigned by considering morefactors. We extend experiments over real data sets. Ourapproach is also extended to a graph structure.

The remainder of this paper is organized as follows: Therelated work is reviewed in the next section. A motivatingexample and problem analysis are given in Section 3. InSection 4, we present the fundamental privacy representa-tion of data sets and derive privacy leakage upper boundconstraints. Section 5 formulates our approach. In Section 6,we evaluate the proposed approach by conducting experi-ments on both real-world data sets and extensive data sets.Finally, we conclude this paper and discuss our future workin Section 7.

2 RELATED WORK

We briefly review the research on privacy protection incloud, intermediate data set privacy preserving andPrivacy-Preserving Data Publishing (PPDP).

Currently, encryption is exploited by most existingresearch to ensure the data privacy in cloud [8], [9], [10].Although encryption works well for data privacy in theseapproaches, it is necessary to encrypt and decrypt data setsfrequently in many applications. Encryption is usuallyintegrated with other methods to achieve cost reduction,high data usability and privacy protection. Roy et al. [15]investigated the data privacy problem caused by MapRe-duce and presented a system named Airavat whichincorporates mandatory access control with differentialprivacy. Puttaswamy et al. [16] described a set of toolscalled Silverline that identifies all functionally encryptabledata and then encrypts them to protect privacy. Zhang et al.[17] proposed a system named Sedic which partitionsMapReduce computing jobs in terms of the security labelsof data they work on and then assigns the computationwithout sensitive data to a public cloud. The sensitivity ofdata is required to be labeled in advance to make the aboveapproaches available. Ciriani et al. [18] proposed anapproach that combines encryption and data fragmentationto achieve privacy protection for distributed data storagewith encrypting only part of data sets. We follow this line,but integrate data anonymization and encryption togetherto fulfill cost-effective privacy preserving.

The importance of retaining intermediate data sets incloud has been widely recognized [6], [7], but the researchon privacy issues incurred by such data sets just com-mences. Davidson et al. [19], [20], [21] studied the privacyissues in workflow provenance, and proposed to achievemodule privacy preserving and high utility of provenanceinformation via carefully hiding a subset of intermediatedata. This general idea is similar to ours, yet our researchmainly focuses on data privacy preserving from aneconomical cost perspective while theirs concentratesmajorly on functionality privacy of workflow modulesrather than data privacy. Our research also differs fromtheirs in several aspects such as data hiding techniques,privacy quantification and cost models. But, our approachcan be complementarily used for selection of hidden dataitems in their research if economical cost is considered.

The PPDP research community has investigated exten-sively on privacy-preserving issues and made fruitfulprogress with a variety of privacy models and preservingmethods [13]. Privacy principles such as k-anonymity [22]and l-diversity [23] are put forth to model and quantifyprivacy, yet most of them are only applied to one singledata set. Privacy principles for multiple data sets are alsoproposed, but they aim at specific scenarios such ascontinuous data publishing or sequential data releasing[13]. The research in [24], [25] exploits information theory toquantify the privacy via utilizing the maximum entropyprinciple [26]. The privacy quantification herein is based onthe work in [24], [25]. Many anonymization techniques likegeneralization [12] have been proposed to preserve privacy,but these methods alone fail to solve the problem ofpreserving privacy for multiple data sets. Our approachintegrates anonymization with encryption to achieveprivacy preserving of multiple data sets. Moreover, we

ZHANG ET AL.: A PRIVACY LEAKAGE UPPER BOUND CONSTRAINT-BASED APPROACH FOR COST-EFFECTIVE PRIVACY PRESERVING OF... 1193

consider the economical aspect of privacy preserving,adhering to the pay-as-you-go feature of cloud computing.

3 MOTIVATING EXAMPLE AND PROBLEM ANALYSIS

Section 3.1 shows a motivating example to drive ourresearch. The problem of reducing the privacy-preservingcost incurred by the storage of intermediate data sets isanalyzed in Section 3.2.

3.1 Motivating Example

A motivating scenario is illustrated in Fig. 1 where an onlinehealth service provider, e.g., Microsoft HealthVault [27], hasmoved data storage into cloud for economical benefits.Original data sets are encrypted for confidentiality. Datausers like governments or research centres access or processpart of original data sets after anonymization. Intermediatedata sets generated during data access or process are retainedfor data reuse and cost saving. Two independently generatedintermediate data sets (Fig. 1a) and (Fig. 1b) in Fig. 1 areanonymized to satisfy 2-diversity, i.e., at least two individualsown the same quasi-identifier and each quasi-identifiercorresponds to at least two sensitive values [23]. Knowingthat a lady aged 25 living in 21,400 (corresponding quasi-identifier is h214�; female; youngi) is in both data sets, anadversary can infer that this individual suffers from HIV withhigh confidence if Fig. 1a and Fig. 1b are collected together.Hiding Fig. 1a or Fig. 1b by encryption is a promising way toprevent such a privacy breach. Assume Fig. 1a and Fig. 1b areof the same size, the frequency of accessing Fig. 1a is 10 andthat of Fig. 1b is 100. We hide Fig. 1a to preserve privacybecause this can incur less expense than hiding Fig. 1b.

In most real-world applications, a large number ofintermediate data sets are involved. Hence, it is challengingto identify which data sets should be encrypted to ensurethat privacy leakage requirements are satisfied whilekeeping the hiding expenses as low as possible.

3.2 Problem Analysis

3.2.1 Sensitive Intermediate Data Set Management

Similar to [6], data provenance is employed to manageintermediate data sets in our research. Provenance iscommonly defined as the origin, source or history ofderivation of some objects and data, which can be reckonedas the information upon how data were generated [28].Reproducibility of data provenance can help to regenerate adata set from its nearest existing predecessor data setsrather than from scratch [6], [20]. We assume herein that the

information recorded in data provenance is leveraged tobuild up the generation relationships of data sets [6].

We define several basic notations below. Let do be aprivacy-sensitive original data set. We use D ¼ fd1;d2; . . . ; dng to denote a group of intermediate data setsgenerated from do where n is the number of intermediatedata sets. Note that the notion of intermediate data hereinrefers to both intermediate and resultant data [6]. DirectedAcyclic Graph (DAG) is exploited to capture the topologicalstructure of generation relationships among these data sets.

Definition 1 (Sensitive intermediate data set graph). ADAG representing the generation relationships of intermediatedata sets D from do is defined as a Sensitive Intermediate dataset Graph, denoted as SIG. Formally, SIG ¼ hV ;Ei, whereV ¼ fdog[D, E is a set of directed edges. A directed edgehdp; dci in E means that part or all of dc is generated from dp,where dp; dc 2 fdog[D.

In particular, an SIG becomes a tree structure if each dataset in D is generated from only one parent data set. Then,we have the following definition for this situation.

Definition 2 (Sensitive intermediate data set tree (SIT)).An SIG is defined as a Sensitive Intermediate data set Tree if itis a tree structure. The root of the tree is do.

An SIG or SIT not only represents the generationrelationships of an original data set and its intermediatedata sets, but also captures the propagation of privacy-sensitive information among such data sets. Generally, theprivacy-sensitive information in do is scattered into itsoffspring data sets. Hence, an SIG or SIT can be employedto analyze privacy disclosure of multiple data sets. In thispaper, we first present our approach on an SIT, and thenextend it to an SIG with minor modifications in Section 5.

An intermediate data set is assumed to have beenanonymized to satisfy certain privacy requirements. How-ever, putting multiple data sets together may still invoke ahigh risk of revealing privacy-sensitive information, result-ing in violating the privacy requirements. Privacy leakageof a data set d is denoted as PLsðdÞ, meaning the privacy-sensitive information obtained by an adversary after d isobserved. The value of PLsðdÞ can be deduced directly fromd, which is described in Section 4.1. Similarly, privacyleakage of multiple data sets in D is denoted as PLmðDÞ,meaning the privacy-sensitive information obtained by anadversary after all data sets in D are observed. It ischallenging to acquire the exact value of PLmðDÞ due to theinference channels among multiple data sets [24].

3.2.2 Privacy-Preserving Cost Problem

Privacy-preserving cost of intermediate data sets stemsfrom frequent en/decryption with charged cloud services.Cloud service venders have set up various pricing modelsto support the pay-as-you-go model, e.g., Amazon WebServices pricing model [29]. Practically, en/decryptionneeds computation power, data storage, and other cloudservices. To avoid pricing details and focus on thediscussion of our core ideas, we combine the prices ofvarious services required by en/decryption into one. Thiscombined price is denoted as PR. PR indicates theoverhead of en/decryption on per GB data per execution.


Fig. 1. A scenario showing privacy threats due to intermediate data sets.

Similar to [6], an attribute vector is employed to frameseveral important properties of the data set di. The vector isdenoted as hSi; F lagi; fi; PLii. The term Si represents thesize of di. The term Flagi, a dichotomy label, signifieswhether di is hidden. The term fi indicates the frequency ofaccessing or processing di. If di is labeled as hidden, it willbe en/decrypted every time when accessed or processed.Thus, the larger fi is, the more cost will be incurred if di ishidden. Usually, fi is estimated from the data provenance.The term PLi is the privacy leakage through di, and iscomputed by PLsðdiÞ.

Data Sets in D can be divided into two sets. One is forencrypted data sets, denoted as Denc. The other is forunencrypted data sets, denoted as Dune. Then, the equationsDenc [Dune ¼ D and Denc \Dune ¼ ; hold. We define thepair hDenc;Dunei as a global privacy-preserving solution.

The privacy-preserving cost incurred by a solutionhDenc; Dunei is denoted as CppðhDenc;DuneiÞ. With the nota-tions framed above, the cost CppðhDenc;DuneiÞ in a givenperiod [T0, T ], can be deduced by the following formula:

CppðhDenc;DuneiÞ ¼Z T

t¼T0

Xdi2Denc

Si � PR � fi � t !

� dt: ð1Þ

The privacy-preserving cost rate for CppðhDenc;DuneiÞ,denoted as CRpp, is defined as follows:

CRpp ¼�X

di2Denc

Si � PR � fi: ð2Þ

In the real world, Si and fi possibly vary over time, butwe assume herein that they are static so that we canconcisely present the core ideas of our approach. Thedynamic case will be explored in our future work. With thisassumption, CRpp determines CppðhDenc;DuneiÞ in a givenperiod. Thus, we blur their meanings subsequently.

The problem of how to make privacy-preserving cost aslow as possible given an SIT can be modeled as anoptimization problem on CRpp:

Minimize CRpp ¼X

di2Denc

Si � PR � fi; Denc � D: ð3Þ

Meanwhile, the privacy leakage caused by unencrypteddata sets in Dune must be under a given threshold.

Definition 3 (Privacy leakage constraint). Let " be theprivacy leakage threshold allowed by a data holder, then aprivacy requirement can be represented as PLmðDuneÞ � ",Dune � D. This privacy requirement is defined as a PrivacyLeakage Constraint, denoted as PLC.

With a PLC, the problem defined in (3) becomes aconstrained optimization problem. So, we can save privacy-preserving cost by minimizing it. As it is challenging toobtain the exact value of PLmðDuneÞ, which is formulated inSection 4.2, our approach is to address the problem viasubstituting the PLC with one of its sufficient conditions.

4 PRIVACY REPRESENTATION AND PRIVACY

LEAKAGE UPPER BOUND CONSTRAINT

It is fundamental to measure privacy leakage of anon-ymized data sets to quantitatively describe how much

privacy is disclosed. Privacy quantification of a single dataset is stated in Section 4.1. We point out the challenge ofprivacy quantification of multiple data sets in Section 4.2and then derive a privacy leakage upper bound constraintcorrespondingly in Section 4.3.

4.1 Single Intermediate Data Set PrivacyRepresentation

The privacy-sensitive information is essentially regarded asthe association between sensitive data and individuals [13].We denote an original sensitive data set as do; ananonymized intermediate data set as d�; the set of sensitivedata as SD; and the set of quasi-identifiers as QI. Quasi-identifiers, which represent the groups of anonymized data,can lead to privacy breach if they are too specific that only asmall group of people are linked to them [13]. Let S denotea random variable ranging in SD, and Q be a randomvariable ranging within QI. Suppose s 2 SD and q 2 QI.The joint possibility of an association hs; qi, denoted aspðS ¼ s;Q ¼ qÞ (abbr. pðs; qÞ), is the information thatadversaries intend to recover [13]. When an adversary hasobserved d� and a quasi-identifier q, the conditionalpossibility pðS ¼ sjQ ¼ qÞ representing intrinsic privacy-sensitive information of an individual can be inferred. IfpðS ¼ sjQ ¼ qÞ is deduced as a high value or even 1.0, theprivacy of the individual with q will be awfully breached.

We employ the approach proposed in [25] to computethe probability distribution P �ðS;QÞ of hs; qi in do afterobserving d�. More details can be found in Appendix A.1,which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.238 (All appendices are included in the supple-mental file). Then, the privacy quantification of data set d�

can be fulfilled. Formally, PLsðd�Þ is defined as

PLsðd�Þ ¼� H S;Qð Þ �H�ðS;QÞ: ð4Þ

HðS;QÞ is the entropy of random variable hS;Qi befored� is observed, while H�ðS;QÞ is that after observation.P ðQ;SÞ is estimated as a uniform distribution according tothe maximum entropy principle [25]. Based on this, HðS;QÞcan be computed by HðS;QÞ ¼ logðjQIj � jSDjÞ. H�ðS;QÞ iscalculated from distribution P �ðS;QÞ by

H�ðS;QÞ ¼ �X

q2QI;s2SApðs; qÞ � logðpðs; qÞÞ: ð5Þ

4.2 Joint Privacy Leakage of Multiple IntermediateData Sets

The value of the joint privacy leakage incurred by multipledata sets in D ¼ fd1; d2; . . . ; dng, n 2 N , is defined by

PLmðDÞ ¼� HðS;QÞ �HDðS;QÞ: ð6Þ

HðS;QÞ and HDðS;QÞ are the entropy of hS;Qi beforeand after data sets in D are observed, respectively.HðS;QÞ ¼ logðjQIj � jSDjÞ. HDðS;QÞ can be calculated onceP ðS;QÞ is estimated after data sets in D are observed. Giventhe relationship between " and PLmðDuneÞ in PLC, " rangesin the interval ½max1�i�nfPLsðdiÞg; logðjQIj � jSAjÞ�.

Zhu et al. [24] proposed an approach to indirectlyestimate P ðS; QÞ for multiple data sets with the maximum


entropy principle. But this approach becomes inefficientwhen many data sets are involved because the number ofvariables and constraints possibly increase sharply whenthe number of data sets grows. According to the experi-ments in [24], it takes more than 200 minutes to quantify theprivacy of two data sets with 6,000 records. Further, sinceDune is uncertain before a solution is found, we need to trydifferent Dune, where Dune 2 2D. So, the inefficiency willbecome unacceptable in many applications where a largenumber of intermediate data sets are involved.

Fortunately, the PLC can be achieved without exactlyacquiring PLmðDuneÞ because our goal is to control theprivacy disclosure caused by multiple data sets. A promis-ing approach is to substitute the PLC with its sufficientconditions. Specifically, our approach is to replace the exactvalue of PLmðDuneÞwith one of its upper bounds which canbe calculated efficiently.

4.3 Upper Bound Constraint of Joint PrivacyLeakage

We attempt to derive an upper bound of PLmðDuneÞ thatcan be easily computed. Intuitively, if an upper boundBðPLmðDuneÞÞ is found, a stronger privacy leakage con-straint BðPLmðDuneÞÞ� " can be a sufficient condition of thePLC. Accordingly, PLmðDuneÞ will never exceed the thresh-old " if BðPLmðDuneÞÞ� " holds.

Let du and dv be two data sets whose privacy leakage arePLsðduÞ and PLsðdvÞ, respectively. The joint privacy leakagecaused by them together is PLmðfdu; dvgÞ. As informationgain is never negative, PLmðfdu; dvgÞ is not less than neitherPLsðduÞ nor PLsðdvÞ. Further, PLmðfdu; dvgÞ will not exceedthe sum of PLsðduÞ and PLsðdvÞ, i.e., PLmðfdu; dvgÞ �PLsðduÞ þ PLsðdvÞ, where the equality holds if and only ifthe information provided by du and dv does not overlap.This property of joint privacy leakage can be extended tomultiple unencrypted data sets in Dune:

PLmðDuneÞ �X

di2Dune

PLsðdiÞ: ð7Þ

Hence, the sum of privacy leakage of unencrypted datasets can be deemed as an upper bound of PLmðDuneÞ. Basedon replacing the PLC with such an upper bound constraint,we propose an approach to address the optimizationproblem in (3). See Section 5 for details.

5 PRIVACY LEAKAGE UPPER BOUND

CONSTRAINT-BASED APPROACH FOR PRIVACY

PRESERVING

We propose an upper bound constraint-based approach toselect the necessary subset of intermediate data sets thatneeds to be encrypted for minimizing privacy-preservingcost. In Section 5.1, we specify relevant basic notationsand elaborate two useful properties on an SIT. Theprivacy leakage upper bound constraint is decomposedlayer by layer in Section 5.2. A constrained optimizationproblem with the PLC is then transformed into a recursiveform in Section 5.3. In Section 5.4, a heuristic algorithm isdesigned for our approach. We extend our approach to anSIG in Section 5.5.

5.1 Basic Notations and Properties on an SIT

Let dr 2 D denote an intermediate data set in an SIT. Itsdirectly generated data sets constitute a set CDðdrÞ ¼�fdr1; . . . ; drig, where dr1, . . . , dri 2 D. For any d 2 CDðdrÞ,PLsðdÞ � PLsðdrÞ because all information in d is from dr.Further, let PDðdrÞ ¼� fdr1; . . . ; drjg be all posterity data setsgenerated from dr, where dr1, . . . , drj 2 D. Similar toCDðdrÞ, PLsðdÞ � PLsðdrÞ for any d 2 PDðdrÞ. A subtree ofan SIT with dr 2 D being its root is denoted as SDT ðdrÞ.Lemma 1. If dr 2 Dune, then for any d 2 PDðdrÞ, d 2 Dune.

The proof of Lemma 1 can be found in Appendix B.1,available in the online supplemental material. Lemma 1defines a useful property named Root Privacy Coverage(RPC) property. This property means that it is unnecessaryto check the data sets in PDðdrÞ if dr unencrypted. Thus,SDT ðdrÞ is somewhat equivalent to dr from the privacy-preserving perspective. Such a subtree is named as anunencrypted subtree, denoted as UST . UST s have the RPCproperty. Due to the RPC property, we only need toconsider part of intermediate data sets rather than all whenidentifying which data sets should be encrypted.

Lemma 2. Given a privacy-preserving solution hDenc;Dunei withthe minimal cost, a graph denoted as EDG ¼ hDenc [fdog; E0imust be a tree structure, where E0 � E.

The proof of Lemma 2 can be found in Appendix B.2,available in the online supplemental material. This lemmadefines a property named Encrypted Data Set Tree (EDT).The property means that all encrypted data sets in an SITconstitute a tree structure if the privacy-preserving cost isthe minimal. Consequently, the data sets in Denc are alwaysconnected, regardless of which part of D constitutes Denc.Moreover, any subtree with its root connected to EDG is anUST because of the RPC property. According to the EDTproperty, the descendant intermediate data sets of a data sethave the possibility to be encrypted if this data set isencrypted. Hence, it is feasible to construct a privacy-preserving solution with the minimal cost layer by layerfrom the root of an SIT to leaves.

Let H be the height of an SIT and Li denote the layerwhere data sets with depth i locate, 1 � i � H. LetDi denotethe set of data sets in Li. The set of encrypted data sets in Liis denoted as EDi. Further, the set of all children data sets ofdata sets in EDi is denoted as CDEiþ1, i.e., CDEiþ1 ¼[d2EDi

CDðdÞ. Then, the set of unencrypted data sets inCDEi is denoted as UDi. Note that UDi is not required to beequal to the set of unencrypted data sets in Li.

Suppose that the encrypted data sets before the layer Lihave been identified. Then, the encrypted data sets in Li areselected from CDEi�1, rather than from all data sets in Di. Apossible choice of encryption is defined as a local encryp-tion solution in Li, denoted as �i ¼ hEDi; UDii. The set ofall potential local solutions in the layer Li is denoted as�i ¼ f�ijg, where j is the number of solutions. Further, agroup of local encryption solutions, which are constructedlayer by layer, constitute a global encryption solution,denoted as �k ¼ h�1j1

; . . . ; �HjH i, where �iji 2 �i, 1 � i � H,1 � k �

QHi¼1 j�ij. �k can determine a global privacy-preser-

ving solution hDenc;Dunei, Denc ¼ [Hi¼1EDi.


5.2 Recursive Privacy Leakage ConstraintDecomposition

To satisfy the PLC, we decompose the PLC recursively intodifferent layers in an SIT. Then, the problem stated in (3)can be addressed via tackling a series of small-scaleoptimization problems. Let the privacy leakage thresholdrequired in the layer Li be "i, 1 � i � H. The privacyleakage incurred by UDi in the solution �i can never belarger than "i, i.e., PLmðUDiÞ � "i. The threshold "i can beregarded as the privacy leakage threshold of the remainderpart of an SIT after the layer Li�1. In terms of the basic ideaof our approach, the privacy leakage constraintPLmðUDiÞ � "i is substituted by one of its sufficientconditions. According to (7), the PLC can be substitutedby a set of privacy leakage constraints, named as PLC1:X

d2UDi

PLsðdÞ � "i; 1 � i � H: ð8Þ

The above threshold "i, 1 � i � H, is calculated by

"i ¼ "i�1 �X

d2UDi�1

PLsðdÞ

"1 ¼ ":

(ð9Þ

A local encryption solution in the layer Li is feasible if itsatisfies the PLC1 in (8). The set of feasible solutions in Li isdenoted as �f

i ¼� f�ijj�ij 2 �ig, where j is the number of

feasible solutions. Similarly, a feasible global encryptionsolution can be denoted as �kf ¼

� h�1j1; . . . ; �HjH i, where

�iji 2 �fi ; 1 � i � H, 1 � k �

QHi¼1 j�ij.

Given a feasible global solution �kf for an SIT, we compressthe SIT into a “compressed” tree layer by layer fromL1 toLH ,denoted as CT ð�kfÞ, where H is the height of the EDG in theSIT. The construction of CT ð�kfÞ is achieved via three steps.First, the data sets in EDi are “compressed” into oneencrypted node. According to the EDT property, thesecompressed nodes together with the original data set appearto be a string with the length being H. Second, all offspringdata sets of the data sets in UDi are omitted. This will notaffect the privacy preserving in terms of the RPC property.Third, the data sets in UDi are compressed into one node.Note that the action “compress” here is imaginary from alogical perspective for a demonstration purpose. Construc-tion process of CT ð�kfÞ is illustrated in Fig. 2 where the first ilayers have been built up according the above steps.

Theorem 1. Under the PLC1 constraint in (8), a constructionprocess of a compressed tree on an SIT corresponds to a feasibleglobal encryption solution �kf , i.e., the privacy leakage PLmðDÞwith regard to �kf can be ensured beneath the given threshold ".

The proof of Theorem 1 can be found in Appendix B.3,available in the online supplemental material. Complyingwith the PLC1, we can obtain a feasible global encryptionsolution in the construction of a compressed tree.

5.3 Minimum Privacy-Preserving Cost

Usually, more than one feasible global encryption solutionexists under the PLC1 constraints, because there are manyalternative local solutions in each layer. Further, eachintermediate data set has various size and frequency ofusage, leading to different overall cost with differentsolutions. Therefore, it is desired to find a feasible solutionwith the minimum privacy-preserving cost under privacy

leakage constraints. Note that the minimum solutionmentioned herein is somewhat pseudominimum becausean upper bound of joint privacy leakage is just anapproximation of its exact value. But a solution can beexactly minimal in the sense of the PLC1 constraints. Wederive the recursive minimal cost formula as follows.

The minimum cost for privacy preserving of the data setsafter Li�1 under the privacy leakage threshold "i isrepresented as CMið"iÞ, 1 � i � H. Given a feasible localencryption solution �i ¼ hEDi; UDii in Li, the cost incurredby the encrypted data sets in Li is denoted as Cið�iÞ

Ci �ið Þ ¼�X

dk2EDi

Sk � PR � fk; 1 � i � H: ð10Þ

Then, CMið"iÞ is calculated by the recursive formula:

CMi "ið Þ ¼ min�ij2�f

i

� Xdk2EDij

ðSk � PR � fkÞ

þCMiþ1ð"i �X

dk2UDij

PLs dkð ÞÞ�;

CMHþ1ð"Hþ1Þ ¼ 0:

8>>>>><>>>>>:

ð11Þ

As a result, CM1ð"Þ is the minimum privacy-preservingcost required in the optimization problem specified by (3)and (8). The privacy-preserving solution hDenc;Dunei can bedetermined during the process of acquiring CM1ð"Þ.

According to the specification of CMið"iÞ, an optimalalgorithm can be designed to identify the optimal privacy-preserving solution. The algorithm details can be found inAppendix C.1, available in the online supplemental material.However, this optimal algorithm suffers from poor effi-ciency due to its huge state-search space if a large number ofintermediate data sets are involved. The time complexity isanalyzed in Appendix C.2, available in the online supple-mental material. As such, it is necessary to turn to heuristicalgorithms for scenarios where a large number of inter-mediate data sets are involved, in order to obtain a near-optimal solution with higher efficiency than the optimal one.

5.4 Privacy-Preserving Cost Reducing HeuristicAlgorithm

In this section, we design a heuristic algorithm to reduceprivacy-preserving cost. In the state-search space for an SIT,a state node SNi in the layer Li herein refers to a vector ofpartial local solutions, i.e., SNi corresponds to h�1j1 ; . . . ; �ijii,where �kjk 2 �k, 1 � k � i. Note that the state-search treegenerated according to an SIT is different from the SIT itself,but the height is the same. Appropriate heuristic informationis quite vital to guide the search path to the goal state. Thegoal state in our algorithm is to find a near-optimal solutionin a limited search space.


Fig. 2. Construction of a compressed tree.

Heuristic values are obtained via heuristic functions. Aheuristic function, denoted as fðSNiÞ, is defined tocompute the heuristic value of SNi. Generally, fðSNiÞconsists of two parts of heuristic information, i.e.,fðSNiÞ ¼ gðSNiÞ þ hðSNiÞ, where the information gðSNiÞis gained from the start state to the current state nodeSNi and the information hðSNiÞ is estimated from thecurrent state node to the goal state, respectively.

Intuitively, the heuristic function is expected to guide thealgorithm to select the data sets with small cost but highprivacy leakage to encrypt. Based on this, gðSNiÞ is definedas gðSNiÞ ¼

�Ccur=ð"� "iþ1Þ, where Ccur is the privacy-

preserving cost that has been incurred so far, " is the initialprivacy leakage threshold, and "iþ1 is the privacy leakagethreshold for the layers after Li. Specifically, Ccur iscalculated by Ccur ¼

Pdj2[ik¼1

EDkðSj � PR � fjÞ. The smaller

Ccur is, the smaller total privacy-preserving cost will be.Larger ð"� "iþ1Þ means more data sets before Liþ1 remainunencrypted in terms of the RPC property, i.e., moreprivacy-preserving expense can saved.

The value of hðSNiÞ is defined as hðSNiÞ ¼ ð"iþ1�Cdes �BFAVGÞ=PLAVG. Similar to the meaning of ð"� "iþ1Þin gðSNiÞ, smaller "iþ1 in hðSNiÞ implies more data setsbefore Liþ1 are kept unencrypted. If a data set with smallerdepth in an SIT is encrypted, more data sets are possiblyunencrypted than that with larger depth, because theformer possibly has more descendant data sets. For a statenode SNi, the data sets in its corresponding EDk are theroots of a variety of subtrees of the SIT. These treesconstitute a forest, denoted as F�i . In hðSNiÞ, Cdes representsthe total cost of the data sets in F�i , and is computed viaCdes ¼

Pdl2EDk

Pdj2PDðdlÞðSj � CR � fjÞ. Potentially, the less

Cdes is, the fewer data sets in following layers will beencrypted. BFAVG is the average branch factor of the forestF�i , and can be computed by BFAVG ¼ NE=NI , where NE isthe number of edges and NI is the number of internal datasets in F�i . Smaller BFAVG means the search space forsequent layers will be smaller, so that we can find a near-optimal solution faster. The value of PLAVG indicates theaverage privacy leakage of data sets in F�i , calculated byPLAVG ¼

Pdl2EDk

Pdj2PDðdlÞ PLsðdjÞ=NI . Heuristically, the

algorithm prefers to encrypt the data sets which incur lesscost but disclose more privacy-sensitive information. Thus,higher PLAVG means more data sets in F�i should beencrypted to preserve privacy from a global perspective.Based on the above analysis, the heuristic value of thesearch node SNi can be computed by the formula:

f SNið Þ ¼ Ccur=ð"� "iþ1Þ þ ð"iþ1 � Cdes �BFAVGÞ=PLAVG:ð12Þ

Based on this heuristic, we design a heuristic privacy-preserving cost reduction algorithm, denoted as H_PPCR.The basic idea is that the algorithm iteratively selects a statenode with the highest heuristic value and then extendsits child state nodes until it reaches a goal state node.The privacy-preserving solution and corresponding cost arederived from the goal state.

Algorithm 1 specifies the details of our heuristic algorithm.A priority queue is exploited to keep state nodes. Only thequalified state nodes that are added to the priority queue, i.e.,the corresponding partial global solutions are feasible. To

avoid the size of the priority queue increase dramatically, thealgorithm only retains the state nodes with top K highestheuristic values. When determining to add child search nodesin layerLiþ1 into the priority queue, the algorithm generates alocal encryption solution from CDEi at first. The algorithmprobably suffers from poor efficiency because it has to checkall combinations of data sets in CDEi. To circumvent thissituation, the algorithm ascendingly sorts the data sets inCDEi according to the value Ck=PLsðdkÞ, where dk 2 CDEi

and Ck ¼ Sk � PR � fk. If jCDEij is larger than a threshold M,only the firstM data sets in the sortedCDEi will be examinedwhile the remaining are set to be encrypted. Intuitively, datasets with higher privacy-preserving cost and lower privacyleakage are expected to remain unencrypted. The valueCk=PLsðdkÞ can help to guide the algorithm to find these datasets with a higher possibility. Hence, the algorithm is guidedto approach the goal state in the state space as close aspossible. Above all, in the light of heuristic information, theproposed algorithm can achieve a near-optimal solutionpractically. SORT and SELECT are two simple externalfunctions as their names signify.

Algorithm 1. Privacy-Preserving Cost Reducing Heuristic

5.5 Extension to SIG

Although SITs can suit many applications, SIGs are alsocommon, i.e., an intermediate data set can originate frommore than one parent data set. An example of an SIG is


illustrated in Fig. 3a. The data set d5 is generated from d2, d3,and d4, where d4 is from d2. Thus, it is possible that PLsðd5Þis larger than PLsðd2Þ or PLsðd3Þ, resulting in the failure ofLemma 1 and RPC property. As a result, our approachcannot be directly applied to an SIG. However, we canadapt it to an SIG with minor modifications.

Let dm denote a merging data set that inherits data frommore than one predecessor, e.g., d5 in Fig. 3. As only oneroot data set is assumed to exist in an SIG, all paths from theroot data set do to dm must converge at one point beside dmitself. Let ds denote this source data set, e.g., ds ¼ do inFig. 3. The inequality PLsðdmÞ � PLsðdsÞ holds because allprivacy information of dm comes from ds. Let PDiðdsÞ be theset of the offspring data sets of ds in the layer Li, thenPDiðdsÞ � PDðdsÞ. Data Sets in PDiðdsÞ are split into EDi

and UDi when determining which data sets are encrypted.We discuss three cases where the graph structure can

affect the applicability of our approach on an SIG. The firstone is PDiðdsÞ � UDi, i.e., all data sets in PDiðdsÞ will keepunencrypted, e.g., fd1; d2g � UD1 shown in Fig. 3b. Allancestor data sets of dm after the layer Li will keepunencrypted according to the RPC property. So, the dataset dm poses little influence on applying our algorithm to anSIG because dm will not be considered in following steps.The second one is PDiðdsÞ � EDi, i.e., all data sets inPDiðdsÞ are encrypted. If dm is a child of a data set inPDiðdsÞ, dm is added to CDEiþ1 for the next round. Assumethe parent data set is dp. Then, we delete the edges pointingto dm from parents except dp, e.g., hd2; d5i is retained whilehd3; d5i and hd4; d5i are deleted in Fig. 3c. Logically, dm canbe deemed as a “compressed” candidate data set of severalimaginary data sets in CDEiþ1, which is similar to theconstruction of a compressed tree. The last one is that Dx �UDi and Dy � EDi, where Dx \Dy ¼ ; and Dx [Dy ¼PDiðdsÞ, i.e., part of data sets in PDiðdsÞ are encryptedwhile the remainder keep unencrypted. According to theRPC property, it is safe to expose part of privacyinformation in dm, where the part of privacy informationis from Dx. The edges which point to dm from data sets inDx and their offspring are deleted, e.g., hd3; d5i in Fig. 3d.Further, the value of PLsðdmÞ is substituted by themaximum value of its direct parents who are data sets inDy or the offspring of Dy if PLsðdmÞ is larger than themaximum.

To make the approach for an SIT available to an SIG aswell, three minor modifications are required. The first oneis to identify all merging data sets. The second one is toadjust the SIG according to the third case discussed above ifUD� 6¼ ; after we get a local solution � ¼ hED�; UD�i. The

third one is to label the data sets that have been processed.In this way, it is unnecessary to explicitly delete edgesdiscussed in the second case.

6 EVALUATION

6.1 Overall Comparison

Encrypting all data sets for privacy preserving is widelyadopted in existing research [8], [9], [10]. This category ofapproach is denoted as ALL_ENC. The privacy-preservingcost of ALL_ENC is denoted as CALL. For data sets in set Dof an SIT, CALL is computed by

CALL ¼Xdk2D

Sk � PR � fk: ð13Þ

Compared with ALL_ENC, our approach only selectsnecessary intermediate data sets to encrypt while keepingothers unencrypted. Intuitively, the cost of our approach islower than the ALL_ENC approach. To further evaluatehow our approach can prevail significantly against theALL_ENC approach in saving expense, we conduct experi-ments on real-world data sets and then extend theexperiments to data sets of large amounts. To facilitate thecomparison, the privacy-preserving cost of H_PPCR isdenoted as CHEU . Given a global privacy-preservingsolution with the encrypted set ED, CHEU is computed by

CHEU ¼Xdk2ED

Sk � PR � fk: ð14Þ

The difference between CALL and CHEU is denoted asCSAV ¼� CALL � CHEU . We run both approaches on SITs withdifferent " values. According to the analysis in Section 4.2,threshold " ranges in the interval ½Emin; Emax�, where Emax ¼logð QIj j � jSDjÞ and Emin ¼ max0�i�nfPLsðdiÞg. For conve-nience, we define privacy leakage degree, denoted as"d ¼� ð"�EminÞ=ðEmax � EminÞ, to indicate privacy leakageincurred by unencrypted intermediate data sets.

6.2 Experiment Evaluation

6.2.1 Experiment Environment

U-Cloud is a cloud computing environment at theUniversity of Technology Sydney (UTS). The system over-view of U-Cloud is depicted in Fig. 4. The computingfacilities of this system are located among several labs atUTS. On top of hardware and Linux operating system, weinstall KVM virtualization software [30] which virtualizesthe infrastructure and provides unified computing and


Fig. 3. Example for extending our approach to an SIG.Fig. 4. System overview of U-Cloud.

storage resources. To create virtualized data centres, weinstall OpenStack open-source cloud environment [31] forglobal management, resource scheduling and interactionwith users. Further, Hadoop [32] is installed based on thecloud built via OpenStack to facilitate massive dataprocessing. Our experiments are conducted in this cloudenvironment.

6.2.2 Experiment Process

To demonstrate the effectiveness and scalability of ourapproach, we run H_PPCR and ALL_ENC on real-worlddata sets and extensive data sets. We first run them on real-world data sets with "d varying in [0.05, 0.9], then onextensive intermediate data sets with the number ranging in[50, 1,000] under certain "d. For each group of experiments,we repeat H_PPCR and ALL_ENC 50 times. The mean ofcost in each experiment group is regarded as the repre-sentative. The margin of error with 95 percent confidence isalso measured and shown in the results.

We first conduct our experiments on the Adult data set[33] which a commonly used data set in the privacyresearch community. Intermediate data sets are generatedfrom the original data set, and anonymized by thealgorithm proposed in [12]. The privacy leakage of eachdata set is quantified as formulated in Section 4.1. Thedetails about these data sets are described in Appendix D.1,available in the online supplemental material.

Further, we extend experiments to intermediate data setsof large amounts. SITs are generated via a random spanningtree algorithm [34]. The values of data size and usagefrequencies are randomly produced in the interval [10, 100]according to the uniform distribution. Without influencingthe trend analysis, we set the price PR to be 0.001. Theprivacy leakage values are generated randomly in aninterval ½0; Eup�, where Eup is value that is by no meanslarger than the maximum entropy Emax of original data sets.In our experiments Emax is set to 18.20, meaningjQIj � jSDj ¼ 300;000. The thresholds M and K in H_PPCR

are set to 20 and 10,000, respectively.

6.2.3 Experiment Results and Analysis

The experimental result on real-world data sets is depictedin Fig. 5, from which we can see that CHEU is much lowerthan CALL with different privacy leakage degree. Even thesmallest cost saving of CHEU over CALL at the left side ofFig. 5 is more than 40 percent. Therefore, Fig. 5 demon-strates our approach H_PPCR can save privacy-preservingcost significantly over ALL-ENC approach. Further, we cansee that the difference CSAV between CALL and CHEUincreases when the privacy leakage degree increases. This is

because looser privacy leakage restraints imply more datasets can remain unencrypted.

With Fig. 5 where we reason about the differencebetween CHEU and CALL with different privacy leakagedegree, Fig. 6 illustrates how the difference changes withdifferent numbers of extensive data sets while "d is certain.In most real-world cases, data owners would like the dataprivacy leakage to be much low. Thus, we select four lowprivacy leakage degrees of 0.01, 0.05, 0.1, and 0.2 to conductour experiments. The selection of these specific values israther random and does not affect our analysis becausewhat we want to see is the trend of CHEU against CALL.Similarly, we set the number of "d values as 4. As Fig. 5 hasshown the trend with different "d values, we do not have totry all values. At the same time, we would like toinformatively conduct the experiments. Hence, we selectfour values. Interested readers can try 3, 5, or other numberof values. The conclusions will be similar.

From Fig. 6, we can see that both CALL and CHEU go upwhen the number of intermediate data sets is getting larger.That is, the larger the number of intermediate data sets is,the more privacy-preserving cost will be incurred. CALLincreases notably because it is proportional to the number ofintermediate data sets. Given "d, CHEU also increases withthe increase of the number of data sets because more datasets are required to be encrypted. Moreover, Fig. 6 alsoshows that CHEU drops when the privacy leakage degreebecomes larger while CALL keeps invariable. This tendencycomplies with that shown in Fig. 5.

Most importantly, we can see from Fig. 6 that thedifference CSAV between CALL and CHEU becomes biggerand bigger when the number of intermediate data setsincreases. That is, more expense can be reduced when thenumber of data sets becomes larger. This trend is the resultof the dramatic rise in CALL and relatively slower increase inCHEU when the number of data sets is getting larger. In thecontext of Big Data, the number and sizes of data sets andtheir intermediate data sets are quite large in cloud. Thus,this trend means our approach can reduce the privacy-preserving cost significantly in real-world scenarios.


Fig. 5. Experiment results about real-world data sets: change in privacy-preserving cost in relation to privacy leakage degree.

Fig. 6. Experiment results in a large number of data sets: change inprivacy-preserving cost in relation to the number of data sets and theprivacy leakage degree.

As a conclusion, both the experimental results demon-

strate that privacy-preserving cost intermediate data sets

can be saved significantly through our approach over

existing ones where all data sets are encrypted.

7 CONCLUSIONS AND FUTURE WORK

In this paper, we have proposed an approach that identifies

which part of intermediate data sets needs to be encrypted

while the rest does not, in order to save the privacy-

preserving cost. A tree structure has been modeled from the

generation relationships of intermediate data sets to analyze

privacy propagation among data sets. We have modeled the

problem of saving privacy-preserving cost as a constrained

optimization problem which is addressed by decomposing

the privacy leakage constraints. A practical heuristic

algorithm has been designed accordingly. Evaluation

results on real-world data sets and larger extensive data

sets have demonstrated the cost of preserving privacy in

cloud can be reduced significantly with our approach over

existing ones where all data sets are encrypted.In accordance with various data and computation

intensive applications on cloud, intermediate data setmanagement is becoming an important research area.Privacy preserving for intermediate data sets is one ofimportant yet challenging research issues, and needsintensive investigation. With the contributions of thispaper, we are planning to further investigate privacy-aware efficient scheduling of intermediate data sets incloud by taking privacy preserving as a metric togetherwith other metrics such as storage and computation.Optimized balanced scheduling strategies are expected tobe developed toward overall highly efficient privacy awaredata set scheduling.

REFERENCES

[1] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A.Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M.Zaharia, “A View of Cloud Computing,” Comm. ACM, vol. 53,no. 4, pp. 50-58, 2010.

[2] R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic,“Cloud Computing and Emerging It Platforms: Vision, Hype, andReality for Delivering Computing as the Fifth Utility,” FutureGeneration Computer Systems, vol. 25, no. 6, pp. 599-616, 2009.

[3] L. Wang, J. Zhan, W. Shi, and Y. Liang, “In Cloud, Can ScientificCommunities Benefit from the Economies of Scale?,” IEEE Trans.Parallel and Distributed Systems, vol. 23, no. 2, pp. 296-303, Feb.2012.

[4] H. Takabi, J.B.D. Joshi, and G. Ahn, “Security and PrivacyChallenges in Cloud Computing Environments,” IEEE Security& Privacy, vol. 8, no. 6, pp. 24-31, Nov./Dec. 2010.

[5] D. Zissis and D. Lekkas, “Addressing Cloud Computing SecurityIssues,” Future Generation Computer Systems, vol. 28, no. 3, pp. 583-592, 2011.

[6] D. Yuan, Y. Yang, X. Liu, and J. Chen, “On-Demand MinimumCost Benchmarking for Intermediate Data Set Storage in ScientificCloud Workflow Systems,” J. Parallel Distributed Computing,vol. 71, no. 2, pp. 316-332, 2011.

[7] S.Y. Ko, I. Hoque, B. Cho, and I. Gupta, “Making CloudIntermediate Data Fault-Tolerant,” Proc. First ACM Symp. CloudComputing (SoCC ’10), pp. 181-192, 2010.

[8] H. Lin and W. Tzeng, “A Secure Erasure Code-Based CloudStorage System with Secure Data Forwarding,” IEEE Trans.Parallel and Distributed Systems, vol. 23, no. 6, pp. 995-1003, June2012.

[9] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-PreservingMulti-Keyword Ranked Search over Encrypted Cloud Data,” Proc.IEEE INFOCOM ’11, pp. 829-837, 2011.

[10] M. Li, S. Yu, N. Cao, and W. Lou, “Authorized Private KeywordSearch over Encrypted Data in Cloud Computing,” Proc. 31st Int’lConf. Distributed Computing Systems (ICDCS ’11), pp. 383-392, 2011.

[11] C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices,”Proc. 41st Ann. ACM Symp. Theory of Computing (STOC ’09),pp. 169-178, 2009.

[12] B.C.M. Fung, K. Wang, and P.S. Yu, “Anonymizing ClassificationData for Privacy Preservation,” IEEE Trans. Knowledge and DataEng., vol. 19, no. 5, pp. 711-725, May 2007.

[13] B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, “Privacy-PreservingData Publishing: A Survey of Recent Developments,” ACMComputing Survey, vol. 42, no. 4, pp. 1-53, 2010.

[14] X. Zhang, C. Liu, J. Chen, and W. Dou, “An Upper-Bound ControlApproach for Cost-Effective Privacy Protection of IntermediateData Set Storage in Cloud,” Proc. Ninth IEEE Int’l Conf. Dependable,Autonomic and Secure Computing (DASC ’11), pp. 518-525, 2011.

[15] I. Roy, S.T.V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel,“Airavat: Security and Privacy for Mapreduce,” Proc. SeventhUSENIX Conf. Networked Systems Design and Implementation (NSDI’10), p. 20, 2010.

[16] K.P.N. Puttaswamy, C. Kruegel, and B.Y. Zhao, “Silverline:Toward Data Confidentiality in Storage-Intensive Cloud Applica-tions,” Proc. Second ACM Symp. Cloud Computing (SoCC ’11), 2011.

[17] K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan, “Sedic:Privacy-Aware Data Intensive Computing on Hybrid Clouds,”Proc. 18th ACM Conf. Computer and Comm. Security (CCS ’11),pp. 515-526, 2011.

[18] V. Ciriani, S.D.C.D. Vimercati, S. Foresti, S. Jajodia, S. Paraboschi,and P. Samarati, “Combining Fragmentation and Encryption toProtect Privacy in Data Storage,” ACM Trans. Information andSystem Security, vol. 13, no. 3, pp. 1-33, 2010.

[19] S.B. Davidson, S. Khanna, T. Milo, D. Panigrahi, and S. Roy,“Provenance Views for Module Privacy,” Proc. 30th ACMSIGMOD-SIGACT-SIGART Symp. Principles of Database Systems(PODS ’11), pp. 175-186, 2011.

[20] S.B. Davidson, S. Khanna, S. Roy, J. Stoyanovich, V. Tannen, andY. Chen, “On Provenance and Privacy,” Proc. 14th Int’l Conf.Database Theory, pp. 3-10, 2011.

[21] S.B. Davidson, S. Khanna, V. Tannen, S. Roy, Y. Chen, T. Milo, andJ. Stoyanovich, “Enabling Privacy in Provenance-Aware Work-flow Systems,” Proc. Fifth Biennial Conf. Innovative Data SystemsResearch (CIDR ’11), pp. 215-218, 2011.

[22] P. Samarati, “Protecting Respondents’ Identities in MicrodataRelease,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 6,pp. 1010-1027, Nov. 2001.

[23] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubrama-niam, “L-Diversity: Privacy Beyond K-Anonymity,” ACM Trans.Knowledge Discovery from Data, vol. 1, no. 1, article 3, 2007.

[24] G. Wang, Z. Zutao, D. Wenliang, and T. Zhouxuan, “InferenceAnalysis in Privacy-Preserving Data Re-Publishing,” Proc. EighthIEEE Int’l Conf. Data Mining (ICDM ’08), pp. 1079-1084, 2008.

[25] W. Du, Z. Teng, and Z. Zhu, “Privacy-Maxent: IntegratingBackground Knowledge in Privacy Quantification,” Proc. ACMSIGMOD Int’l Conf. Management of Data (SIGMOD ’08), pp. 459-472, 2008.

[26] E.T. Jaynes, “Information Theory and Statistical Mechanics,”Physical Rev., vol. 106, no. 4, pp. 620-630, 1957.

[27] Microsoft HealthVault, http://www.microsoft.com/health/ww/products/Pages/healthvault.aspx, July 2012.

[28] K.-K. Muniswamy-Reddy, P. Macko, and M. Seltzer, “Provenancefor the Cloud,” Proc. Eighth USENIX Conf. File and StorageTechnologies (FAST ’10), pp. 197-210, 2010.

[29] Amazon Web Services, “Aws Service Pricing Overview,” http://aws.amazon.com/pricing/, July 2012.

[30] KVM, http://www.linux-kvm.org/page/Main_Page, July 2012.[31] OpenStack, http://openstack.org/, July 2012.[32] Hadoop, http://hadoop.apache.org, June 2012.[33] UCI Machine Learning Repository, ftp://ftp.ics.uci.edu/pub/

machine-learning-databases/, July 2012.[34] J.A. Kelner and A. Madry, “Faster Generation of Random

Spanning Trees,” Proc. 50th Ann. IEEE Symp. Foundations ofComputer Science (FOCS ’09), pp. 13-21, 2009.


Xuyun Zhang is currently working toward thePhD degree at the University of TechnologySydney. His research interests include cloudcomputing, privacy and security, MapReduceand OpenStack.

Chang Liu is currently working toward the PhDdegree at the University of Technology Sydney,Australia. His research interests include cloudcomputing, resource management, cryptogra-phy and data security.

Surya Nepal is a principal research scientist atCSIRO ICT Centre, Australia. His main researchinterest includes security, privacy and trust inemerging science areas such as Service-Or-iented Architectures (SOA), Web Services,Cloud Computing and Social Computing. Hehas several journal and conference papers inthese areas to his credit.

Suraj Pandey holds a research scientist posi-tion at IBM Research Australia. His researchspans many areas of high performance comput-ing, including data-intensive applications, work-flow scheduling, resource provisioning, andaccelerating scientific applications, evident fromhis publications in ERA A� and ERA A rankedjournals and conferences.

Jinjun Chen is an associate professor and thedirector of Lab of Cloud Computing andDistributed Systems at the University of Tech-nology Sydney, Australia. His research inter-ests include cloud computing, social computing,green computing, service computing, e-science/e-research, workflow management. His re-search results have been published in morethan 100 papers in high-quality journals andconferences, including TOSEM, TSE, and

ICSE. He is a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


a load balancing model based on cloud partitioning

Documents

data storage privacy

data users

cost of intermediate

original data sets

endecrypt data sets

valuable intermediate

resultant data sets6

privacy concerns