efficient wildcard search over encrypted datacourse/dbms/papers/han.pdf · efficient wildcard...

9
Int. J. Inf. Secur. (2016) 15:539–547 DOI 10.1007/s10207-015-0302-0 REGULAR CONTRIBUTION Efficient wildcard search over encrypted data Changhui Hu 1 · Lidong Han 2 Published online: 4 September 2015 © Springer-Verlag Berlin Heidelberg 2015 Abstract Searchable encryption is an important technique that allows the data owners to store their encrypted data in the cloud. It also maintains the ability to search a key- word over encrypted data. In practice, searchable encryption scheme supporting wildcard search is very important and widely used. In this paper, we propose a new wildcard search technique to use one wildcard to represent any number of characters. Based on Bloom filter with a novel specified characters position technique, we construct a new searchable symmetric scheme to support wildcard search over encrypted data. This scheme is more efficient than prior schemes, and it can be strengthened to be secure against an adaptive attacker (CKA-2 security). Moreover, this scheme can be dynamic to support file addition and deletion. Our wildcard search technique is of independent interest. Keywords Searchable symmetric encryption · Cloud computing · Wildcard search 1 Introduction In cloud computing, the users outsource their data to the third party for cost savings [1]. However, the outsourcing data have to be encrypted since data cannot be locally controlled by B Changhui Hu [email protected] Lidong Han [email protected] 1 School of Mathematical Sciences, Dalian University of Technology, 116024 Dalian, People’s Republic of China 2 School of Information Science and Engineering, Hangzhou Normal University, 310036 Hangzhou, People’s Republic of China the data owners and it may be intercepted by the malicious attackers. Besides the data storage, the outsourcing data need to support additional applications such as various kinds of keyword search in practice. However, traditional encryption schemes can only protect the data security well but not for data searchable. One practical solution is searchable encryp- tion which supports keyword search without decryption. In general, there exist two main techniques to achieve searchable encrypted data: searchable symmetric encryp- tion and asymmetric searchable encryption. Compared to the asymmetric searchable encryption, searchable symmet- ric encryption (SSE) is more efficient and more prone to be implemented in practice. Searchable symmetric encryption (SSE) uses a symmetric-key algorithm to encrypt data for efficiency in practice. In order to support keyword search, SSE builds an index of keywords and their corresponding files. Up to now, many searchable symmetric encryption schemes are proposed. In terms of applications, there are various types of search- able symmetric encryptions that achieve different functions, for example similarity keyword search, wildcard search, fuzzy search. In practice, dynamic operations on the data file collection (e.g., addition and removal) should also be sup- ported. Among these applications, the searchable symmetric encryption with wildcard search is very important. In gen- eral, a wildcard in a keyword represents any character. The paper [2] gave a method of using Bloom filters which can check characters in the specified positions to deal with the wildcard. However, their method can only deal with the case that one wildcard represents one character. Certainly, there are many solutions to represent one wildcard as any num- ber of characters, such as [3]. However, it just enumerates all possibilities of the wildcard and it does not work well. In this paper, we present an efficient wildcard searchable sym- metric encryption scheme. Our new scheme overcomes the 123

Upload: others

Post on 28-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

Int. J. Inf. Secur. (2016) 15:539–547DOI 10.1007/s10207-015-0302-0

REGULAR CONTRIBUTION

Efficient wildcard search over encrypted data

Changhui Hu1 · Lidong Han2

Published online: 4 September 2015© Springer-Verlag Berlin Heidelberg 2015

Abstract Searchable encryption is an important techniquethat allows the data owners to store their encrypted datain the cloud. It also maintains the ability to search a key-word over encrypted data. In practice, searchable encryptionscheme supporting wildcard search is very important andwidely used. In this paper, we propose a newwildcard searchtechnique to use one wildcard to represent any number ofcharacters. Based on Bloom filter with a novel specifiedcharacters position technique, we construct a new searchablesymmetric scheme to support wildcard search over encrypteddata. This scheme is more efficient than prior schemes, and itcan be strengthened to be secure against an adaptive attacker(CKA-2 security). Moreover, this scheme can be dynamicto support file addition and deletion. Our wildcard searchtechnique is of independent interest.

Keywords Searchable symmetric encryption · Cloudcomputing · Wildcard search

1 Introduction

In cloud computing, the users outsource their data to the thirdparty for cost savings [1].However, the outsourcing data haveto be encrypted since data cannot be locally controlled by

B Changhui [email protected]

Lidong [email protected]

1 School of Mathematical Sciences, Dalian University ofTechnology, 116024 Dalian, People’s Republic of China

2 School of Information Science and Engineering, HangzhouNormal University, 310036 Hangzhou,People’s Republic of China

the data owners and it may be intercepted by the maliciousattackers. Besides the data storage, the outsourcing data needto support additional applications such as various kinds ofkeyword search in practice. However, traditional encryptionschemes can only protect the data security well but not fordata searchable. One practical solution is searchable encryp-tion which supports keyword search without decryption.

In general, there exist two main techniques to achievesearchable encrypted data: searchable symmetric encryp-tion and asymmetric searchable encryption. Compared tothe asymmetric searchable encryption, searchable symmet-ric encryption (SSE) is more efficient and more prone to beimplemented in practice. Searchable symmetric encryption(SSE) uses a symmetric-key algorithm to encrypt data forefficiency in practice. In order to support keyword search,SSE builds an index of keywords and their correspondingfiles. Up to now, many searchable symmetric encryptionschemes are proposed.

In terms of applications, there are various types of search-able symmetric encryptions that achieve different functions,for example similarity keyword search, wildcard search,fuzzy search. In practice, dynamic operations on the data filecollection (e.g., addition and removal) should also be sup-ported. Among these applications, the searchable symmetricencryption with wildcard search is very important. In gen-eral, a wildcard in a keyword represents any character. Thepaper [2] gave a method of using Bloom filters which cancheck characters in the specified positions to deal with thewildcard. However, their method can only deal with the casethat one wildcard represents one character. Certainly, thereare many solutions to represent one wildcard as any num-ber of characters, such as [3]. However, it just enumeratesall possibilities of the wildcard and it does not work well. Inthis paper, we present an efficient wildcard searchable sym-metric encryption scheme. Our new scheme overcomes the

123

Page 2: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

540 C. Hu, L. Han

above limitation, and it works well under the case that onewildcard represents any number of characters. Moreover, themore effective scheme can be dynamic and is proven to besecure against adaptive attacker (CKA-2 security).

1.1 Our contribution

In this paper, we propose a new wildcard search techniqueto deal with the case that one wildcard represents any num-ber of characters. In our method, we build one set for eachkeyword, in which we record each character of this keyword(except the wildcard) with their normal order, reverse orderand their existence. Then, if the querying keyword matcheswith the keywords in the index, then its character sets willbe part of the sets of the keyword in the index. This methodis very efficient but has some small error possibility. So, wealso propose a complete solution to overcome this limitation.With this wildcard search technique, we can build one Bloomfilter for each keyword. Based on the method of Bloom fil-ter used by Goh [4] and Suga et al. [2], we construct a newsearchable symmetric encryption scheme. It is shown that ournew scheme is more efficient than prior schemes. Moreover,it can be utilized to a dynamic structure to achieve securityagainst adaptive chosen-keyword attacks (CKA-2 security).This wildcard search technique is of independent interest.

1.2 Related work

The earliest solution for searching encrypted data wasinvented by Goldreich and Ostrovsky [5] in 1996. Thisscheme uses the oblivious RAMs to hide all informationfrom a malicious server. However, it requires a logarithmicnumber of interaction between the user and the server andis inefficient in practice. In 2000, Song et al. [6] presenteda practical symmetric scheme for searching encrypted data.They use a special two-layered encrypted structure.However,their scheme is not secure against statistical attacks and thesearching time is linear in the length of the data file collection.

Some of the above weakness were addressed by Goh [4]who proposed to use a Bloom filter for each document in thefile collection. The complexity required by a query is pro-portional to the number of documents in the collection. Sucha method, called forward index, costs more storage. Changand Mitzenmacher [7] constructed two index schemes withpre-building dictionaries similar to [4]. Their schemes areindependent of the encryptionmethods, and they also use oneindex per document. Watanabe et al. [8] in 2009 proposed asearchable symmetric encryption scheme for relational data-base with Bloom filter.

The inverted index scheme was introduced by Curtmolaet al. [9] in 2006. They proposed two new adversarial mod-els: a non-adaptive model (CKA-1) and an adaptive model(CKA-2), and they also designed two secure schemes in these

two secure models. Based on the inverted index technique in[9], Kamara et al. [10] constructed an SSE scheme whichis explicitly dynamic. Kamara et al.’s scheme is efficient,and the searching time is linear in the number of documentscontaining the keyword. The security of this scheme is CKA-2 secure. Recently, some improved dynamic searchableencryption schemes [11–14]were also proposed.Meanwhile,various different functional searches need to be performed fordata applications, such as wildcard search [2,3,15], similar-ity keyword search [16], multi-keyword fuzzy search[17].

In 2010, Sedghi et al. [15] proposed an searchable sym-metric encryption scheme that supports wildcard search,which is based on hidden vector encryption (in the publickey setting). In 2011, Bosch et al. [3] proposed a searchablesymmetric encryption scheme to support wildcard search(one wildcard stands for any number of characters). Theyuse pseudo-random functions and Bloom filters to constructwildcard searchable encryption scheme. However, they justenumerate all the possibilities of the wildcard and add thesepossibilities to one Bloom filter. So, the size of their Bloomfilter is very large, and it is difficult to deal with the case thatone keyword includes several wildcards. In 2012, Suga etal. [2] introduced the technique of creating one Bloom filterfor each keyword with specified character positions, whichsupports fuzzy keyword search and wildcard search. Theirscheme is efficient, and the number of their search indexes isproportional to the number of distinct keywords in the collec-tion. They can support wildcard search well, and they needto build one Bloom filter for each keyword. However, in theirschemes, one wildcard can only represent one character. Wefocus on the problem of how to deal with the case that onewildcard represents any number of characters and constructone efficient wildcard searchable encryption scheme.

2 Preliminaries

2.1 Notations

– m: the number of distinct keywords in the data collection.– n: the number of data files.– t : the size of the Bloom filter.– I : the index for keyword searching.– Tw: the trapdoor of the keyword w.– r : the number of different pseudo-random functions forthe Bloom filter.

– V0 � V1: the symmetric set difference of two sets V0 andV1 defined as V0 � V1 = (V0 − V1) ∪ (V1 − V0).

2.2 System model

In our systemmodel, there are three different entities: the dataowner, the server (e.g., a cloud server) and the data users

123

Page 3: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

Efficient wildcard search over encrypted data 541

Fig. 1 Model of searchable symmetric encryptions

as shown in Fig. 1. The data owner has a collection of nplaintext files defined by F which he wants to outsource tothe cloud server in the encrypted form D. To enable D tobe searchable, the data owner first builds a secure index Ifrom F and then outsources both the index I and D to theserver. If a user wants to search over the document collectionfor a given keyword w, he must be authorized and acquires acorresponding trapdoor Tw from the owner. After receivingTw, the server is responsible to search the index I and returnsthe corresponding set of encrypted documents.

In general, a searchable symmetric encryption schemeconsists of four polynomial-time algorithms in the follow-ing.

– KeyGen(s): is a probabilistic key generation algorithmthat generates the private keyK under a security parame-ter s. This function is called by the data owner to generatethe key.

– BuildIndex(K, F): is an algorithm that takes the privatekeyK and the document collectionF as input and outputsa search index I . This function is called by the data ownerto build the index.

– Trapdoor(K, w): is a polynomial-time algorithm whichoutputs the trapdoor Tw with a keyword w and a privatekey K as input. If the user wants to search the keywordw, then it sends w to the data owner, and the data ownercalls this function to generate the trapdoor.

– SearchIndex(Tw , I ): is a deterministic algorithm whichcan search over the documents D that contain the key-word w. The inputs are a trapdoor Tw for a keyword w

and an index I , and the output is a set of documents (inencrypted form) identifiers. This function is called by theserver to do the search.

We assume that the server is considered as “honest-but-curious” in our model. Specifically, the server acts in an

“honest” fashion following the designed protocol. However,it is also “curious” to infer and analyze the data received dur-ing the protocol so as to learn additional private and valuableinformation. And the data owner and data user are trustedand honest.

2.3 Security model

In the following, we will describe the security model of oursearchable symmetric encryption schemes. To describe thesecurity model more formally, we introduce the concept ofhistory which is defined as an instantiation of an interactionbetween the user and server. Such an interaction is deter-mined by a document collection and keywords that the clientwants to search; moreover, we wish to hide the keywordsfrom the adversary (server).

Before introducing the security definitions for SSE, weassume that the history is generated by the adversary in thedefinition at once. That is to say, it is not allowed to knowthe index of the document collection or the trapdoors of anyquerying keyword before it has finished generating the his-tory. We call such an adversary non-adaptive.

This security model we utilize in the context is based onIND-CKA introduced by Goh [4]. It is defined by a gamebetween a challenger C and an adversary A as follows.

Setup. A challenger C creates a set S of q pairs of key-words which is sent to an adversary A. A chooses somekeyword subsets S∗ ⊂ S and gives it to C. After receivingS∗, C executes KeyGen(s) to generate a secret key K, and Ccomputes the indexes for all subsets of S∗ with BuildIndex(K,F). Finally, C gives all indexes and related subsets S∗ toA.

Query. A can query a trapdoor Tx for a keyword x toC. For each index I, A can execute SearchIndex( Tx , I) tocheck whether I matches x .

Challenge. A chooses two non-empty subsets V0, V1 ∈S∗ such that |V0 − V1| �= 0, |V1 − V0| �= 0 and |V0| = |V1|.It is noted that A must not have queried C for the trapdoorof any keyword in V0 � V1 where V0 � V1 = (V0 − V1) ∪(V0 + V1). A gives V0 and V1 to C. C chooses b from {0, 1}randomly and then computes BuildIndex(K, Vb) to get anindex corresponding to Vb and return it to A.

Response. A outputs a bit b′, representing its guess forb. The advantage A obtains is defined as AdvA = |Pr [b =b′] − 1/2|.

An adversary A is said to (t, ε, q)-break our scheme ifAhas the probability advantage AdvA at least ε after A takesat most t times and queries trapdoors to C q times. If there isno adversary who can (t, ε, q)-break I, we say that the sym-metric searchable encryption I is (t, ε, q)-IND-CKA secure.

Definition 1 (IND-CKA or CKA-1)A SSE scheme is securein indistinguishability against non-adaptive chosen-keyword

123

Page 4: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

542 C. Hu, L. Han

attack if for any two histories with equal length and trace,no (probabilistic polynomial-time) adversary can distinguishthem with probability non-negligibly better than 1

2 .

Note that we can strengthen the security model to attaina stronger security model, i.e., security against adaptiveattacker (IND2-CKA or CKA-2). It is similar to the abovegame except when choosing subsets V0, V1 in the Challengephase. In the new game, the adversary is able to decide twonon-empty subsets V0, V1 ∈ S∗ such that |V0�V1| �= 0. Therest of the game is same as the one in IND-CKA. The readeris referred to the formal definition of IND2-CKA or CKA-2in the paper [9].

2.4 Bloom filter

Bloom filter is a probabilistic data structure which is usedto answer the membership query. Bloom filters allow falsepositives, but not false negatives. False positive means if theBloom filter returns true the element may be in the set or not,and false negative means if the Bloom filter outputs false theelement is definitely not in the set.

In general, Bloom filter is defined as a t-bit array which isinitialized by all 0s. In order to add an element toBloomfilter,we first choose r independent hash functions h1, h2, . . . , hr ,which map each element to a random number from the range[1, t]. For each element x ∈ S = {x1, x2, . . . xδ}, the posi-tions hi (x) for 1 ≤ i ≤ r in Bloom filter are set to 1. Todetermine whether Bloom filter has the element y, one cancheck whether all hi (y) positions for 1 ≤ i ≤ r are set to1. If it does not hold, then clearly y is not a member of S.If all hi (y) are set to 1, we assume that y is in S, althoughwe are wrong with some probability since the positions ofan element may have been set by one or more other ele-ments. We can choose the appropriate parameters to reducethe false-positive probability to a desired error rate.

3 The basic searchable symmetric encryptionscheme

Wefirst present the basic searchable symmetric scheme usingBloom filter with specified character positions. This basicscheme is from [2], where each keyword in the index followsa subset of file collection which contains this keyword. ABloom filter is built by all characters of one keyword. Thisscheme consists of the following four algorithms.

– KeyGen(s): Given the security parameter s, the dataowner outputs a s-bit string sk uniformly at randomand generates r random numbers k1, . . . , kr . The pri-vate key of searchable symmetric encryption is K ={sk, k1, k2, . . . , kr } .

– BuildIndex(K, F): From documents collection F ={ f1, f2, . . . , fn} of n text files, the data owner extractsmdistinct keywords {w1, w2, . . . , wm} and then generatesa file collection Fwi for each wi such that the files in Fwi

containwi . For each keywordwi , it constructs oneBloomfilter BFi as follows. Assume the keyword wi contains lcharacters:wi [1], wi [2], . . . , wi [l]. It adds r hash valuesof j ||wi [ j] for j ∈ [1, l] to the Bloom filter BFi with rprivate keys, i.e., it sets the positions of the Bloom filterthat the r hash values determined to be 1. Let u be anupper bound of the number l. Pick (u− l) · r random val-ues and set the respective bits of the Bloom filter to be 1.This random operation can prevent the number of 1’s intheBloomfilter from revealing the length of the keyword.Each Bloom filter BFwi of the keyword wi is followedby the encryption index I ′ = Encsk(Fwi , wi , rd), wheresk is the symmetric private key and rd is a random valuegenerated by a random function.

– Trapdoor(K, w): The data owner receives the searchkeyword request w from the data user and generates thetrapdoor Tw with its secret key including the r privatekeys k1, k2, . . . , kr as follows: For a query keyword w

that contains wildcards, all characters of w are repre-sented as w = {w[1], w[2], . . . , w[l]}. The data ownerfirst initializes a Bloom filter BF , and then it adds thehash value of each character except the wildcards in thekeyword: i ||w[i], for i ∈ [1, l] to the Bloom filter BF ,i.e., the positions of the Bloom filter that the r hash val-ues determine are set to be 1. Then, the trapdoor for thekeyword w is the Bloom filter Tw = BF .

– SearchIndex(Tw , I ): Given the trapdoor Tw = BF forw and the index I , it searches the indexes I to find all thematched Bloom filters that is 1 in the positions that therelative positions in the trapdoor are 1. Then, it outputs thesearch result in the index. Finally, it needs to decrypt theirI ′ to get the files identifiers. To improve the efficiency,the binary search tree can be used to speed up the search.For more details, please refer to [2].

Note that this scheme is a typical inverted index searchablesymmetric encryption scheme; ifwe add the dynamicmethodsuch as [10] to this scheme, itwill support dynamic operation.In this method, it needs to associate the file identifiers in Fwi

with the encrypted pointers. For more details, please refer to[10].

4 Searchable symmetric encryption scheme forwildcard search

In this section, wewill show a newwildcard search techniqueand propose a searchable symmetric encryption scheme thatsupports wildcard search. Previous schemes such as [2] only

123

Page 5: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

Efficient wildcard search over encrypted data 543

deal with the case that one wildcard � represents one char-acter. In this section, we will show our wildcard searchtechnique that one wildcard � represents any number of char-acters and the new searchable symmetric encryption schemebased on our new wildcard technique.

4.1 Wildcard search technique

First, we consider a simple case that there is only one wild-card in the querying keyword. The technique of buildingBloom filter for the trapdoor and the index is concludedin the following: In the trapdoor generation, the keywordw with length l including one wildcard is recorded as:w[1], w[2], . . . , w[i], �, w[i + 2], . . . , w[l]. Then, it willget a collection of Strapdoor = {w[1]||1, w[2]||2, …, w[i]||i ,w[i + 2]||(−(l − i − 1)), …, w[l − 1]||(−2), w[l]||(−1)}.This means that the character before the wildcard � arerecorded in the positive order, and the characters after thewildcard � are recorded in the reverse order. In searchindex building, it records the keyword w′ with length l ′ as:w′[1], w′[2], . . . , w′[l ′], and then it will get a collection ofSindex = {w′[1]||1, w′[2]||2, …, w′[l ′]||l ′, w′[1]||(−l ′), …,w′[l ′−1]||(−2),w′[l ′]||(−1)}, thismeans that it records eachcharacter of the keyword bothwith the positive order andwiththe reverse order. By now, we can see that if the keyword ofthe trapdoor is matched with the keyword of the index, thenthe collection of Strapdoor will be part of the collection ofSindex. Finally, it adds the hash values of each collection intotheir Bloom filters by k1, . . . , kr . For example, if the key-word of the trapdoor is d � g, then in the trapdoor generationof d � g, we add the hash values of d||1 and g||(−1) to itsBloomfilter. In search index building, we regard the keyworddog as an example; it adds the hash value of the collection:d||1, o||2, g||3, g||(−1), o||(−2), d||(−3) to the Bloom fil-ter. Moreover, if the keyword of the index is drag, then thecollection is: d||1, r ||2, a||3, g||4, g||(−1), a||(−2), r ||(−3),d||(−4). Figure 2 shows the process of the above example.It is clear that both keywords dog and drag match the queryd�g, so all the hash values of the trapdoormatch the positionsof the Bloom filters of dog and drag.

Next, we consider the case that there are even more wild-cards in the querying keyword. Then, the above techniqueis not suitable for this case. The characters between the firstwildcard and the last wildcard do not give an appropriateorder. For the example c � o � d, the above technique cannotdescribe the appropriate order of the character o followingthe above simple case. To solve this problem, we introduce atricky measure. In the trapdoor generation, it needs to showthe existence of the character c, o and d. We use the methodof adding the hash values of c||0, o||0 and d||0 to its Bloomfilter.Meanwhile, the searching index also needs to reflect theexistence of each character of the keyword to its Bloom filter.If the keyword is cloud, it adds the hash values of 15 char-

Fig. 2 One wildcard in the trapdoor keyword

acters c||1, l||2, o||3, u||4, d||5, d||(−1), u||(−2), o||(−3),l||(−4), c||(−5), c||0, l||0, o||0, u||0, d||0 to its Bloom fil-ter. Figure 3 describes the above example. We can see thatour technique is as follows: For the keyword w=w[1], w[2],…, w[i], �, w[i + 2], …, w[ j], �, w[ j + 2], …, w[l] withlength l of the trapdoor, it gets the collection of Strapdoor ={w[1]||1, w[2]||2, …, w[i]||i , w[ j + 2]||(−(l − j − 1)), …,w[l − 1]||(−2), w[l]||(−1), w[1]||0, w[2]||0, …, w[i]||0,w[i + 2]||0, …, w[ j]||0, w[ j + 2]||0, …, w[l]||0}, thismeans that the characters before the first wildcard � arerecorded in the positive order, the characters after the lastwildcard � are recorded in the reverse order, and all char-acters except the wildcards are recorded by w[i]||0. For thesearching index, it records the keyword w′ with length l ′as: w′[1], w′[2], . . . w′[l ′], and then we will get a collectionof Sindex={w′[1]||1, w′[2]||2, …, w′[l ′]||l ′, w′[1]||(−l ′), …,w′[l ′ − 1]||(−2), w′[l ′]||(−1), w′[1]||0, …, w′[l ′]||0}, i.e.,it records each character of the keyword with the positiveorder, the reverse order and its existence. It is easy to see thatif the keyword of the trapdoor matches with the keyword ofthe index, the collection of Strapdoor is part of the collectionof Sindex. Then, we add the hash values of each collection totheir Bloom filter. Figure 3 shows the above example.

In conclusion, our wildcard search method is as follows:

– For the keyword in the trapdoor, the characters beforethe first wildcard � are recorded in the normal order, andthe characters after the last wildcard � are recorded in thereverse order, and for the other characters, we only recordtheir existence. Finally, it needs to add the hash values ofelements in such a collection to the Bloom filter.

– For the keyword in the search index, each character of akeyword is recorded in the normal order, the reverse orderand their existence. Then, it needs to add their characterswith their order to their Bloom filters.

Note that there are small error possibilities for this case.For example, the Bloomfilter of “c�uo�d” and “c�od�” also

123

Page 6: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

544 C. Hu, L. Han

Fig. 3 Any wildcard in the trapdoor keyword

matches the keyword cloud, but this keyword is not what wewant. Such a false-positive case can occur with very smallprobability since there are not much querying keywords thathave twowildcards and the characters between twowildcardsare the same. Even if it happens, the users can also eliminatethe wrong result by its semantic environment. And later, wewill show a method to overcome this limitation.

4.2 Searchable symmetric encryption for wildcardsearch

With the technique described above, we can build thesearchable symmetric encryption scheme.We summarize oursearchable symmetric encryption scheme to supportwildcardsearch as follows.

– KeyGen(s): Given the security parameter s, the dataowner outputs a s-bit string sk at random and generatesr random numbers k1, k2, . . . , kr as the private inputs ofthe hash function. Then, the private key of searchablesymmetric encryption is K = {sk, k1, k2, . . . , kr }.

– BuildIndex(K, F): From documents collection F ={ f1, f2, . . . , fn} of n text files, the data owner extractsmdistinct keywords {w1, w2, . . . , wm}, and then for eachwi generates a file collection Fwi such that the file inFwi contains wi . For each keyword wi , it constructsone Bloom filter BFi as follows. Assume the keywordwi contains l characters: wi [1], wi [2], . . . , wi [l]. Then,it will compute the set Sindex={wi [1]||1, wi [2]||2, …,wi [l]||l, wi [1]||(−l), …, wi [l − 1]||(−2), wi [l]||(−1),wi [1]||0, …, wi [l]||0}. It adds r hash values of each ele-ment in Sindex to the Bloom filter BFi with r private keysk1, . . . , kr . In detail, it sets the positions of the Bloomfilter that the r hash values determined to be 1. Let u bean upper bound of the number l. Pick (u − l) · r random

values and set the respective bits of the Bloom filter to be1. EachBloomfilter BFwi for the keywordwi is followedby the encrypted index I ′ = Encsk(Fwi , wi , rd), wheresk is the symmetric private key and rd is a random valuegenerated by a random function.

– Trapdoor(K, w): The data owner receives the searchkeyword request w from the data user and generates thetrapdoor Tw with its secret key including the r privatekeys k1, k2, . . . , kr as follows: For a querying keywordw that contains wildcards, all characters of w are repre-sented asw =w[1],w[2], …,w[i], �,w[i +2], …,w[ j],�,w[ j+2], …,w[l]. First initializing a Bloom filter BF ,then the data owner computes the collection of Strapdoor ={w[1]||1, w[2]||2, …, w[i]||i , w[ j + 2]||(−(l − j − 1)),…, w[l − 1]||(−2), w[l]||(−1), w[1]||0, w[2]||0, …,w[i]||0,w[i+2]||0,…,w[ j]||0,w[ j+2]||0,…,w[l]||0}.That means the characters before the first wildcard arerecorded in the normal order, and the characters after thelast wildcard are recorded in the reverse order, and forthe other characters it only records their existence. Then,it adds r hash values of each element in the set Strapdoorto the Bloom filter BF , i.e., it sets the positions of theBloom filter that the r hash values determined to be 1.Then, the trapdoor for the keyword w is the Bloom filterTw = BF .

– SearchIndex(Tw , I ): Given the trapdoor Tw and theindex I , it searches the indexes in which all bits set to1 in trapdoor are 1 and outputs the search result in theindex I . Finally, it needs to decrypt their I ′ to get thefiles identifiers. To improve the efficiency, we can usethe binary search tree to speed up the search.

4.3 Complete solution

Now, we show how to overcome the above limitation. Beforeintroducing ourmethod, we first show the concept of n-gram.Let |s| be the length of s, s[i] be the i th character of s, ands[i, j] be the sub-string from its i th character to its j th char-acter. An n-gram is a contiguous sequence of n charactersfrom a string s. Given a string s and a positive integer n, itsn-gram of s is the set of g, where g is the n-gram of s startingat the i th character, i.e., g = s[i, i + n − 1]. Given a strings, there exist |s| + n − 1 overlapping n-grams. For exam-ple, if the string s =university, then its n-gram set for n = 3is {uni,niv,ive,ver,ers,rsi,sit,ity}. We use G(s, n) to expressthe sets of each (n, g) pairs. For the above example, G(s, n)

is the set of {(3,uni), (3,niv), (3,ive), (3,ver), (3,ers), (3,rsi),(3,sit), (3,ity)}. Moreover, we can use G(s) to represent thesets of all G(s, n), for 1 ≤ n ≤ |s|.

Now,we showhow to use n-gram to rebuild the searchablesymmetric encryption to overcome the limitation above.

123

Page 7: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

Efficient wildcard search over encrypted data 545

– In the trapdoor generation, if the keyword is w, then thecharacters before the first wildcard � are recorded in thenormal order, and the characters after the last wildcard �

are recorded in the reverse order, which is the same withprior solution. But instead of hashing the existence ofevery character, it adds the hash values of every elementsin G(w) to its Bloom filter. Certainly, if the keywordcontains wildcards, it omits the wildcards and regards thekeyword as several short sub-keywords. For example, ifthe keyword is c � ou � d, then it will add the sets ofG(“c”), G(“ou”) and G(“d”) to its Bloom filter.

– For the keyword in the search index, each character ofa keyword w′ is recorded in the normal order and thereverse order. Then, it also needs to add the hash valuesof every elements in G(w′) to their Bloom filters.

And, it is easy to see if the trapdoor keyword w is matchedwith the keyword w′ in the index, then G(w) will be part ofG(w′). Meanwhile, our method can overcome the limitationmentioned above.

5 Security proof

Let A be an adversary and C be a challenger. We say that anadversary A(u, ε, v/r)-breaks a scheme if AdvA is at leastε after A takes at most u times and queries trapdoors to Cv times. In this section, we analyze the security of our newschemes.

Theorem 1 If the number of pseudo-random functions isr , our scheme is A(u, ε, v/r)-IND-CKA secure if f is a(u, ε, v)-pseudo-random function.

Proof We prove this theorem using its contrapositive (simi-lar to the proof given in [4]). Suppose that our scheme is not(t, ε, q/r)-IND-CKA secure, that is, there is an algorithmAwhich (t, ε, q/r)-breaks our scheme. Then, we can constructthe algorithm B which distinguishes whether f is a pseudo-random function or a random function. Given x ∈ {0, 1}n ,B can use an oracle O f which outputs f (x) ∈ {0, 1}s forunknown function f .B evaluates f with a query toO f when-ever computing any four index algorithms.

The algorithm B makes the simulation for A as follows.Setup. C creates a set S of q keywords and gives this to

A. A chooses some subsets S∗, that is, the keywords fromS, and gives this to C. After receiving S∗, C runs KeyGen togenerate a secret key K, and C computes the indexes for allsubsets of S∗. with BuildIndex. Finally, C gives all indexesand related subsets S∗ to A after computing all indexes. Wenote that the correspondence relation between the indexesand S∗. is unknown to A.

Query.A can query trapdoor Tx for word x to C. For eachindex I,A can execute SearchIndex for Tx , I to tell whetherI matches x .

Challenge. A picks two non-empty subsets V0, V1 ∈ S∗.such that |V0−V1| �= 0, |V1−V0| �= 0 and |V0| = |V1|. Here,A must not have queried C for the trapdoor of any keywordin V0 � V1. A cannot query any trapdoor for a keyword inV0 � V1 . A gives V0 and V1 to C. C chooses b from {0, 1}at random. C computes BuildIndex(Vb,K) to get an indexcorresponding to Vb and gives it back to A. After C givesthe challenge (i.e., BuildIndex(Vb,K)) toA,A cannot queryany trapdoor for any keyword x ∈ V0 � V1 to C.

Response.A outputs b′ finally. B outputs 0 if b = b′, thatis, f is a pseudo-random function. Otherwise, B outputs 1.

B takes at most t time because A takes at most t time. Bsends at most q queries to O f because there are only q/rkeywords, A creates at most q/r queries, and B creates rqueries for A’s single query.

Finally, based on the following lemmas, B has an advan-tage greater than ε to determine whether the unknownfunction f is a pseudo-random function or f is a randomfunction because we have the following equation:

|Pr[B f (·,r) = 0|k R←− {0, 1}s] − Pr[Bg = 0|g R←− {F : {0, 1}n→ {0, 1}s}]| > ε

This contradicts the assumption of pseudo-random func-tions. Therefore, our scheme is (t, ε, q/r)-IND-CKA secureif f is a (t, ε, q)-pseudo-random function. �

Lemma 1 |Pr[B f (·,r) = 0|r R←− {0, 1}s] − 12 ]| ≥ ε if f is a

pseudo-random function.

Lemma 2 Pr[Bg = 0|g R←− {F : {0, 1}n → {0, 1}s}] = 12 if

g is a pseudo-random function.

We prove these two lemmas simply according to [4].

Proof Lemma 1 is obvious as the algorithm B simulatescompletely the challenger C in an IND-CKA game if f isa pseudo-random function.For Lemma 2, we first discuss the challenge subsets V0, V1,as other subsets of S does not leak any information aboutchallenge subsets.Without loss of generality, supposeV0�V1has two characters x and y where x ∈ V0 and y ∈ V1 andthe adversaryA guesses b correctly with advantage δ. Givenf (z), it means that A can determine if z = x or z = y withadvantage δ. That is to say, A can distinguish the output ofa random function f with advantage δ. But if f is a randomfunction,A cannot distinguish the output, so we have δ = 0.Therefore, A can guess b with the probability of at best 1/2.Lemma 2 is proved.

Note that our scheme is not yet secure against adap-tive chosen-keyword attacks (CKA-2). The security can be

123

Page 8: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

546 C. Hu, L. Han

strengthened by utilizing the method proposed in [9,10]. Inthis method, it just needs to associate the file identifiers in Fw

with the encrypted pointers. Then, the scheme can be secureagainst adaptive chosen-keyword attacks (CKA-2). For moredetails, please refer to [9,10]. �

6 Performance

In this section, we will analyze the computation and storagecomplexity of our scheme. The size of Bloom filter playsa main role in the complexity. Therefore, we first show theparameters of Bloom filter, and then we compare the com-plexity of our scheme with prior schemes in detail.

6.1 Bloom filter parameters

First, let us give some notations. Assume δ is the maximumnumber of hash values in each Bloom filter, t is the upperbound of hash function’s output, and r is the number of hashfunctions. p defines the false-positive rate. As presented in[4], we can obtain the relation p ≈ (1 − e−rδ/t )r . The opti-mal value r = (t ln 2)/δ minimizes the false-positive rateto (1/2)r . From this, we can choose optimal Bloom filterparameters as follows:

1. Choose the false-positive rate p.2. From p, the number r of hash functions is determined by

r = − log2 p.3. With r and δ, the array size t is given by t = δr/ ln 2.

Now,we list suitable parameters ofBloomfilter for specialfunctions. If p is 0.001, then r is 10 by computing r =− log2 p. That is to say, 10 hash functions such as SHA-2are required for p = 0.001.

For the wildcard case, we set the length of the keywordsto be n. Then, we need to add about δ = n×3 hash values tothe Bloom filter. The average length of the keywords is about5, and then t = δr/ ln 2 ≈ 498 and the size of Bloom filterarray is 498 bits. For the complete solution in Sect. 4.3, weneed to add about

(n1

) + (n2

) + · · · + ( nn−1

) + (nn

) = 2n hashvalues for the gram instead of n hash values for the existenceof each character. Then, the overall number of hash valuesweneed to add to the Bloom filter is 2n +2×n. So if n = 5, thenδ = 42 and the size the Bloom filter is t = δr/ ln 2 ≈ 1395.As r need not necessarily so big, then we can choose thelength of the Bloom filter to be 256 bits or 512 bits.

6.2 Complexity

Now, we give the computation and storage complexity of ourscheme and compare them with previous schemes.

6.2.1 Computation complexity

First, we discuss the computation complexity. For the index-building algorithm of our scheme, one Bloom filter is builtfor each keyword. So, the computation complexity is O(m),wherem is the number of distinct keywords in the whole filecollection. For the trapdoor generation, only one Bloom filterneeds to be built, so the complexity of generating trapdoor isO(1). As a search algorithm, the computation complexity isO(m) because it needs to search through the entire index. Forreducing the search complexity, we can use a binary searchtree to implement the search algorithm. We list the compu-tation complexity in Table 1.

In this table, we use the column of “Wildcard” to repre-sent the number of characters that onewildcard can represent.From Table 1, we can see that the computational complexityin our scheme seems to be same as [2]. But our scheme candeal with the case that one wildcard � can represent any num-ber of characters, while [2] is only suitable for the case thatone wildcard � represents one character. So, our scheme ismore efficient and practical than [2]. In general, the numberof documents n is much larger than the number of keywordsm, so our scheme is better than [3] in computation complex-ity. Moreover, more hash values of different keywords in [3]need to be added into one Bloom filter, so their computationcomplexity is larger.

6.2.2 Storage complexity

In this subsection, we consider the storage complexity. Forthe index, each keyword is stored as one Bloom filter withthe size of 256 bits. So, the storage complexity is O(m). Thetrapdoor has only O(1) storage complexity. For the searchresult, its storage complexity is O(m). Table 2 lists the com-parison between our scheme and previous schemes in thestorage complexity. In this table, we also use the column

Table 1 Computational complexity

Scheme Wildcard Trapdoor Index Search

Bosch [3] Any O(1) O(n) O(n)

Suga [2] One O(1) O(m) O(m)

Our scheme Any O(1) O(m) O(m)

Table 2 Storage complexity

Scheme Wildcard Trapdoor Index Search

Bosch [3] Any O(1) O(n) O(n)

Suga [2] One O(1) O(m) O(m)

Our scheme Any O(1) O(m) O(m)

123

Page 9: Efficient wildcard search over encrypted datacourse/DBMS/papers/han.pdf · Efficient wildcard search over encrypted data 541 Fig. 1 Model of searchable symmetric encryptions as shown

Efficient wildcard search over encrypted data 547

“Wildcard” to represent the number of characters that onewildcard can represent.

Similar to the computation complexity, from this table wecan see that the storage complexity in our scheme is almostsame as [2], and our scheme is better than [3]. Moreover, thesize of Bloom filter in our scheme is much less than [3].

Above all, compared with the work in [2,3], our schemeis better both in computation complexity and in storage com-plexity.

7 Conclusion

In this paper, we propose an efficient searchable symmet-ric encryption scheme to support wildcard search where onewildcard can represent any number of characters. By analy-sis of computation and storage complexity, we show that ourscheme ismore efficient than previous schemes.We also pro-pose that the new scheme is secure against adaptive attackersby chosen appropriate keywords. Moreover our scheme cansupport dynamic operation, i.e., addition and deletion. Ourwildcard search technique is of independent interest.

Acknowledgments This work was supported by China Postdoc-toral Science Foundation (No. 2013M532104), 973 program (No.2013CB834205) and Natural Science Foundation of Zhejiang Province(No. LZ12F02005). Also, we thank the reviewers for their constructivecomments.

References

1. Armbrust, M., Fox, A., Griffith, R., et al.: A view of cloud com-puting. Commun. ACM 53(4), 50–58 (2010)

2. Suga, S., Nishide, T., Sakurai, K.: Secure keyword search usingBloom filter with specified character positions. In: Proceedings ofProvSec 2012, LNCS 7496, pp. 235–252 (2012)

3. Bosch, C., Brinkman, R., Hartel, P., Jonker, W.: Conjunctive wild-card search over encrypted data. In: Proceedings of SDM 2011, pp.114–117(2011)

4. Goh, E.-J.: Secure indexes. IACR Cryptology ePrint Archive 2003report 216 (2003). http://eprint.iacr.org/2003/216

5. Goldreich, O., Ostrovsky, R.: Software protection and simulationon oblivious RAMs. J. ACM 43(3), 431–473 (1996)

6. Song, D.,Wagner, D., Perrig, A.: Practical techniques for searchingon encrypted data. In: IEEE Symposium on Security and Privacy(SSP), pp. 44–55 (2000)

7. Chang, Y., Mitzenmacher, M.: Privacy preserving keywordsearches on remote encrypted data. In: Proceedings of AppliedCryptography and Network Security (ACNS), pp. 442–455 (2005)

8. Watanabe,C.,Arai,Y.: Privacy-preserving queries for aDASmodelusing encrypted Bloom filter. In: Proceedings of DASFAA 2009.LNCS 5463, pp. 491–495 (2009)

9. Curtmola, R., Garay, J., Kamara, S., Ostrovsky, R.: Searchablesymmetric encryption: Improved definitions and efficient con-structions. In: Proceedings of ACM Conference on Computer andCommunications Security (CCS), pp. 79–88 (2006)

10. Kamara, S., Papamanthou, C., Roeder, T.: Dynamic searchablesymmetric encryption. In: Proceedings of ACM Conference onComputer and Communications Security (CCS), pp. 965–976(2012)

11. Kamara, S., Papamanthou, C.: Parallel and dynamic searchablesymmetric encryption. In: Proceedings of Financial Cryptographyand Data Security (FC’13), LNCS 7859, pp. 258–274 (2013)

12. Stefanov, Emil, Papamanthou, Charalampos, Shi, Elaine: Practi-cal Dynamic Searchable Encryption with Small Leakage. NDSS(2014)

13. Naveed, M., Prabhakaran, M., Gunter, C.A.: Dynamic SearchableEncryption via Blind Storage. IACR Cryptology ePrint Archive2014 report 219, (2014). https://eprint.iacr.org/2014/219

14. Cash, D., Jaeger, J., Jarecki, S., et al.: Dynamic searchable encryp-tion in very large databases: data structures and implementation.NDSS (2014)

15. Sedghi, S., van Liesdonk, P., Nikova, S., Hartel, P., Jonker, W.:Searching keywords with wildcards on encrypted data. In: Pro-ceedings of SCN 2010, LNCS 6280, pp. 138–153 (2010)

16. Wang, C., Ren, K., Yu, S.: Achieving usable and privacy-assuredsimilarity search over outsourced cloud data. In: Proceedings ofIEEE INFOCOM 2012, pp. 451–459 (2012)

17. Wang, B., Yu, S., Lou, W., Hou, Y.T.: Privacy-Preserving Multi-Keyword Fuzzy Search over Encrypted Data in the Cloud. In IEEEINFOCOM (2014)

123