design and analysis of password-based key derivation functions

6
3292 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 9, SEPTEMBER 2005 Design and Analysis of Password-Based Key Derivation Functions Frances F. Yao and Yiqun Lisa Yin Abstract—A password-based key derivation function (KDF)—a function that derives cryptographic keys from a password—is necessary in many security applications. Like any password-based schemes, such KDFs are subject to key search attacks (often called dictionary attacks). Salt and it- eration count are used in practice to significantly increase the workload of such attacks. These techniques have also been specified in widely adopted industry standards such as PKCS and IETF. Despite the importance and widespread usage, there has been no formal security analysis on existing constructions. In this correspondence, we propose a general security framework for password-based KDFs and introduce two security definitions each capturing a different attacking scenario. We study the most commonly used construction and prove that the iteration count , when fixed, does have an effect of stretching the password by bits. We then analyze the two standardized KDFs in PKCS #5. We show that both are secure if the adversary cannot influence the parameters but subject to attacks otherwise. Finally, we propose a new password-based KDF that is provably secure even when the adversary has full control of the parameters. Index Terms—Cryptography, dictionary attack, exhaustive key search, key derivation function (KDF), password-based security. I. INTRODUCTION A. Background and Motivation Cryptographic keys are essential in virtually all security application. In practice, however, inputs to an application are typically raw key ma- terials, such as passwords, that are not yet in the form to be used as keys. Therefore, a key derivation function (KDF)—a function that de- rives cryptographic keys from raw key materials—is often a necessary component in any security application. There are many usage scenarios for KDFs depending on the form of raw key materials. For example, the key materials can be a user pass- word, a random seed value from some entropy source, or a long output from a cryptographic operation such as Diffie–Hellman key agreement. The second scenario is typically handled by a pseudorandom number generator (e.g., the FIPS PRNG in [1]), while the third scenario is han- dled by hashing the long output down to the required key length (e.g., the KDF1 in [2]). In both scenarios, there is usually enough entropy in the raw key materials. In this correspondence, we focus our study on password-based KDFs. Unlike the other two scenarios mentioned above, passwords, in particular those chosen by a user, often are short or have low entropy. Therefore, special treatment is required in the key derivation process to defend against exhaustive key search attacks. Manuscript received August 3, 2004; revised May 13, 2005. The material in this correspondence was presented at the Cryptographers’ Track, RSA Confer- ence 2005, San Francisco, CA, February 2005. F. F. Yao is with the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China (e-mail: [email protected]). Y. L. Yin was with Princeton Architecture Laboratory for Multimedia and Security, Princeton University, Princeton, NJ 08544 USA (e-mail: [email protected]). Communicated by T. Johansson, Associate Editor for Complexity and Cryp- tography. Digital Object Identifier 10.1109/TIT.2005.853307 The basic approach for designing a password-based KDF is to derive the key from the password and a random known value (called salt), by applying a function (such as hash, keyed hash, or block cipher) for a number of iterations (called iteration count). For example, the following is a typical construction: 1 key Intuitively, the salt serves the purpose of creating a large set of possible keys corresponding to a given password, among which one key is selected according to the salt used in each execution of the KDF. The iteration count serves the purpose of increasing the cost of de- riving each key, thereby significantly increasing the workload of key search attacks. These two techniques have been commonly used in practice and also specified in widely adopted industry standards, in- cluding PKCS [3] and IETF [4]. In a more general setting, we can view a password-based KDF as a method for stretching any short keys (not necessarily passwords) into long keys. An efficient key stretching method can be useful in strength- ening security without any architectural or policy changes for complex systems (e.g., a credit-card system). In addition, password-based KDF can be used in a straightforward way to define password-based encryp- tion schemes and message authentication schemes. Despite their importance and widespread usage, there has been no formal security analysis of existing password-based KDFs. Further- more, there has been no general security framework for analyzing such KDFs. It is possible that the above popular KDF construction is in- secure against some sophisticated key-search attacks even though the parameters are chosen large enough and the underlying hash function is sound (e.g., it can be considered as a random oracle). B. Our Framework In this correspondence, we propose a general security framework for studying password-based KDFs. Our framework aims at capturing various types of key-search attacks and allowing concrete analysis of attacker’s success probability relating to its available computational re- source for launching key-search attacks. We also model salt and itera- tion count in a way as they are used in practice so that their impact on the overall security of the scheme can be quantified. Key search attacks on password-based KDFs can be quite sophis- ticated [5], but they generally target at the construction of the KDF and treat the underlying function as a black-box transformation. This motivates us to model the underlying primitive as a random function and hence its internal structure is ignored in the analysis. We define two levels of security for the KDF depending on the capability of the adversary :—in the weak model, can only observe the output of while in the strong model can query on inputs of its choice. In both models, can make queries to , and the maximum number of such queries captures ’s available computational power to launch key search attacks. Roughly, the construction of is secure under each model if with the given capability cannot distinguish the output of for an unknown password from a random string. C. Main Results Using the proposed framework, we first study the security of the iterative construction. We prove that if an adversary makes at most queries to , then the success probability that it can distinguish the 1 Throughout the correspondence, we use to denote that is applied times and use to denote the concatenation of two strings. 0018-9448/$20.00 © 2005 IEEE

Upload: yl

Post on 24-Sep-2016

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Design and Analysis of Password-Based Key Derivation Functions

3292 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 9, SEPTEMBER 2005

Design and Analysis of Password-BasedKey Derivation Functions

Frances F. Yao and Yiqun Lisa Yin

Abstract—A password-based key derivation function (KDF)—a functionthat derives cryptographic keys from a password—is necessary in manysecurity applications. Like any password-based schemes, such KDFs aresubject to key search attacks (often called dictionary attacks). Salt and it-eration count are used in practice to significantly increase the workload ofsuch attacks. These techniques have also been specified in widely adoptedindustry standards such as PKCS and IETF. Despite the importance andwidespread usage, there has been no formal security analysis on existingconstructions.In this correspondence, we propose a general security framework

for password-based KDFs and introduce two security definitions eachcapturing a different attacking scenario. We study the most commonlyused construction ( ) and prove that the iteration count , whenfixed, does have an effect of stretching the password by log bits. Wethen analyze the two standardized KDFs in PKCS #5. We show that bothare secure if the adversary cannot influence the parameters but subjectto attacks otherwise. Finally, we propose a new password-based KDFthat is provably secure even when the adversary has full control of theparameters.

Index Terms—Cryptography, dictionary attack, exhaustive key search,key derivation function (KDF), password-based security.

I. INTRODUCTION

A. Background and Motivation

Cryptographic keys are essential in virtually all security application.In practice, however, inputs to an application are typically raw key ma-terials, such as passwords, that are not yet in the form to be used askeys. Therefore, a key derivation function (KDF)—a function that de-rives cryptographic keys from raw key materials—is often a necessarycomponent in any security application.

There are many usage scenarios for KDFs depending on the form ofraw key materials. For example, the key materials can be a user pass-word, a random seed value from some entropy source, or a long outputfrom a cryptographic operation such as Diffie–Hellman key agreement.The second scenario is typically handled by a pseudorandom numbergenerator (e.g., the FIPS PRNG in [1]), while the third scenario is han-dled by hashing the long output down to the required key length (e.g.,the KDF1 in [2]). In both scenarios, there is usually enough entropy inthe raw key materials.

In this correspondence, we focus our study on password-basedKDFs. Unlike the other two scenarios mentioned above, passwords, inparticular those chosen by a user, often are short or have low entropy.Therefore, special treatment is required in the key derivation processto defend against exhaustive key search attacks.

Manuscript received August 3, 2004; revised May 13, 2005. The material inthis correspondence was presented at the Cryptographers’ Track, RSA Confer-ence 2005, San Francisco, CA, February 2005.

F. F. Yao is with the Department of Computer Science, City University ofHong Kong, Kowloon, Hong Kong, China (e-mail: [email protected]).

Y. L. Yin was with Princeton Architecture Laboratory for Multimediaand Security, Princeton University, Princeton, NJ 08544 USA (e-mail:[email protected]).

Communicated by T. Johansson, Associate Editor for Complexity and Cryp-tography.

Digital Object Identifier 10.1109/TIT.2005.853307

The basic approach for designing a password-based KDF is to derivethe key from the password p and a random known value s (called salt),by applying a function H (such as hash, keyed hash, or block cipher)for a number of iterations c (called iteration count). For example, thefollowing is a typical construction:1

key = H(c)(p k s):

Intuitively, the salt s serves the purpose of creating a large set ofpossible keys corresponding to a given password, among which onekey is selected according to the salt used in each execution of the KDF.The iteration count c serves the purpose of increasing the cost of de-riving each key, thereby significantly increasing the workload of keysearch attacks. These two techniques have been commonly used inpractice and also specified in widely adopted industry standards, in-cluding PKCS [3] and IETF [4].In a more general setting, we can view a password-based KDF as a

method for stretching any short keys (not necessarily passwords) intolong keys. An efficient key stretching method can be useful in strength-ening security without any architectural or policy changes for complexsystems (e.g., a credit-card system). In addition, password-based KDFcan be used in a straightforward way to define password-based encryp-tion schemes and message authentication schemes.Despite their importance and widespread usage, there has been no

formal security analysis of existing password-based KDFs. Further-more, there has been no general security framework for analyzing suchKDFs. It is possible that the above popular KDF construction is in-secure against some sophisticated key-search attacks even though theparameters are chosen large enough and the underlying hash functionH is sound (e.g., it can be considered as a random oracle).

B. Our Framework

In this correspondence, we propose a general security frameworkfor studying password-based KDFs. Our framework aims at capturingvarious types of key-search attacks and allowing concrete analysis ofattacker’s success probability relating to its available computational re-source for launching key-search attacks. We also model salt and itera-tion count in a way as they are used in practice so that their impact onthe overall security of the scheme can be quantified.Key search attacks on password-based KDFs can be quite sophis-

ticated [5], but they generally target at the construction of the KDFF and treat the underlying function H as a black-box transformation.This motivates us to model the underlying primitive H as a randomfunction and hence its internal structure is ignored in the analysis. Wedefine two levels of security for the KDF depending on the capabilityof the adversaryA:—in theweakmodel,A can only observe the outputof F while in the strong model A can query F on inputs of its choice.In both models, A can make queries to H , and the maximum numberof such queries captures A’s available computational power to launchkey search attacks. Roughly, the construction of F is secure under eachmodel if A with the given capability cannot distinguish the output ofF for an unknown password p from a random string.

C. Main Results

Using the proposed framework, we first study the security of theiterative construction. We prove that if an adversary A makes at mostt queries to H , then the success probability that it can distinguish the

1Throughout the correspondence, we use to denote that is appliedtimes and use to denote the concatenation of two strings.

0018-9448/$20.00 © 2005 IEEE

Page 2: Design and Analysis of Password-Based Key Derivation Functions

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 9, SEPTEMBER 2005 3293

derived key for an unknown p from a random string of the same lengthn satisfies

bt=cc

jPW j< Adv(t) <

bt=cc

jPW j+O

(c+ t)2

2n:

The second term is negligible in practical settings since the derivedkey length n is at least 128 bits. In our analysis, H is represented asan unknown random functional graph, whose properties are graduallyrevealed to the attacker via an explicit query graph. The attacker’s ad-vantage corresponds to a certain conditional probability based on thequery graph, and its analysis is done through a nontrivial sequence ofreductions to facilitate the estimation of the conditional probability.

The above result implies that, for a fixed iteration count, there is noshortcut in the key search other than computingH iteratively for eachpassword. In other words, the iteration count c increases the workloadof exhaustive key search by a factor of c. So the iteration constructioneffectively stretches a k-bit key to a (k + log2 c)-bit key.

We then focus our attention on practical password-based KDFs andanalyze the two KDFs in PKCS #5 [3]—the de facto standard for pass-word-based cryptography. Our analysis on the iterative constructionimplies that the two KDFs are weakly secure as long as the compu-tational resource available to the attacker is much less than cjPW j (al-though it can be much larger than jPW j). We also show that neitherKDF is secure in the strong model and discuss how such security weak-ness may be exploited to mount attacks in practical scenarios.

Finally, based on the insight gained from our earlier analysis, we pro-pose a new password-based KDF that is provably secure in the strongmodel. The new proposal preserves the same efficiency as existingKDFs with enhanced security.

D. Related Work

Rigorous analysis of password-based key derivation schemes seemsto have received relatively little attention compared to other types ofcryptographic schemes. In [5], the term key-stretching is used for con-version of low-entropy keys into longer keys by mechanisms such asiterated hash. A connection was made between the cost of computingH(c) and the cost of finding a collision for H . Roughly speaking,if H(c) could be computed on average with fewer than c=2 calls toH , then this would lead to a collision search for H faster than thenaive birthday attack. Although it would be hard to translate the re-sult into a standard concrete security model, it is certainly of practicalinterest. Our results onH(c) are well quantified; moreover, the frame-work for KDF is rigorously defined and the effects of salt and iterationare studied in a standard distinguish-from-random model with respectto specific query types.

KDFs in a general sense share some similarity in their design, such asthe use of hash functions to process the raw key materials. For instance,the pseudorandom number generator (PRNG) defined in FIPS [1] ishash based and can derive long keys from a random seed. The exactconstruction, however, is quite different: in the password-based KDF,the hash is applied iteratively while in the PRNG, the hash outputs areconcatenated to produce the key. Therefore, security analysis of PRNGtype of key derivation [6] does not directly apply to password-basedkey derivation.

Another related subject is password-based authentication and key ex-change protocols, and the goal of such protocols is to authenticate twoparties who share a common password. Existing protocols use variouspublic-key techniques such as RSA or Diffie-Hellman in a way that themessages exchanged between the parties provide little or no help for

an attacker to guess the password. A survey of well-analyzed protocolscan be found in [7].

II. SECURITY FRAMEWORK AND DEFINITIONS

A. KDF Model

We denote a password-based KDF as y = F (p; s; c), where p ispassword, s is salt, c is iteration count, and y is derived key of lengthn. Let the set of passwords, salts, and iteration counts be PW; S;, andC . For a fixed p 2 PW , we can view y = Fp(s; c) as a functionwith input (s; c) and output y. We interchangeably write F (p; s; c) orFp(s; c) depending on the context. For simplicity, we assume that thelength of pkskc is at most n, although our analysis can be applied tothe more general case.We denote the underlying primitive that is used to construct F as

H . For example, H can be a hash function such as SHA-1. Since keysearch attacks typically treatsH as a black-box transformation withoutexploiting its internal structure, wemodelH as a random function fromf0; 1gn to f0; 1gn. That is, we focus our analysis on how F is con-structed based on H rather than the structure of H itself.

B. Attack Model

Consider a typical usage scenario of a password-basedKDF inwhichtwo users Alice and Bob share a password p. To encrypt a message,Alice sets s; c and derives a key y = F (p; s; c). She then uses y toencrypt and obtain the ciphertext z. Alice sends (z; s; c) in the clear toBob. Bob derives the key y by computing y = F (p; s; c) and uses y todecrypt z.From the attacker’s point of view, the salt s and the count c are both

known. The attacker usually does not have control of s or c, but incertain scenarios they can be chosen. The derived key y may be hiddenfrom the attacker, but it can become known for various reasons (e.g., itwas leaked out due to system security holes), in which case the attackerobtains a tuple (y; s; c) corresponding to some unknown password p.In our attack model, we assume that s and c can be either known

or chosen, and the derived key y is always known. An attacker A to apassword-based KDF is a polynomial-time algorithm that may use thefollowing two types of oracle queries.

• Type 1: Query the underlying functionH on input x and obtainH(x). The number of such queries is usually determined by theadversary’s available computational resource for performing anoffline key search attack.

• Type 2: For an unknown password p, query the KDF F on input(s; c) and obtain the derived key y = Fp(s; c). The number ofsuch queries is very limited, since it is usually determined by thesecurity design and policy of the system, not the attacker.

C. Security Definition

We introduce two security definitions—weakly secure and stronglysecure—depending on the attacker’s capability. In the weak model, weassume that the attacker A can make only Type 1 queries, while in thestrong model, we assume that A can make both Type 1 and Type 2queries. The goal of the attacker A is to distinguish the derived keyy = Fp(s; c) for p

r

PW from a random string of the same length.2

Definition 1 (Weakly/Strongly Secure KDF): Let y = Fp(s; c) bea password-based KDF. Let b 2 f0; 1g. We consider the followingexperiment depending on b:

2If is a set, than denotes selecting uniformly at randomfrom .

Page 3: Design and Analysis of Password-Based Key Derivation Functions

3294 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 9, SEPTEMBER 2005

Experiment EA;b

p0r PW // password is generated at random

s0 S; c0 C//salt and count are fixed and knownIf b = 0, then y0 = F (p0; s0; c0), else y0

r f0; 1gn

i 0repeati i + 1if weak model,A chooses xi and is given H(xi)if strong model,A first decides which type of queriesIf Type 1, A chooses xi and is given H(xi)If Type 2, A chooses (si; ci) 6= (s0; c0)and is given yi = F (p0; si; ci)

until A reaches the maximum number of queriesA outputs either 0 or 1

The success probability of A is defined as

weak model : AdvA(t) = Pr[EA;0 = 1]� Pr[EA;1 = 1]

strong model : AdvA(t;m) = Pr[EA;0 = 1]� Pr[EA;1 = 1]

where t and m denote the maximum number of Type 1 and Type 2queries, respectively. The maximum success probability achievable byany adversary A in the weak/strong model is denoted by Adv(t) andAdv(t;m), respectively.

III. PASSWORD-BASED KDFS IN PRACTICE

Password-based KDFs are commonly used in practice. They are alsospecified in industry standards such as PKCS#5, PKCS #12, IETF, andopenPGP. Here we describe the KDFs in PKCS#5, which is consideredthe de facto standard for password-based cryptography. KDFs in otherstandards mostly follow similar designs.

Two password-based KDFs are specified in PKCS#5 v2.0 [3]:PBKDF1 and PBKDF2. Some recommendations are given regardingthe use of salt and iteration count. For example, s should be at least 64bits, and c should be at least 1000.

The underlying function in PBKDF1 is a hash functionH( ) such asMD2, MD5, or SHA1. The derived key is defined as y = H(c)(p k s).Most other standards or implementations use this construction.

PBKDF2 was intended to provide more security. The underlyingfunction in PBKDF2 is a keyed hash function Hk( ), such as HMAC(the Keyed-Hash Message Authentication Code)[8]. The password p isused as the key k in each invocation ofHk(). The derived key is definedas y = U1 � U2 � � � � � Uc; where Ui = H

(i)p (s) for i = 1; . . . ; c.

The EXCLUSIVE–ORs adds an extra layer of protection, but at the coreof the construction is still the iterative application of Hp.

IV. EFFECTS OF ITERATION COUNT

In this section, we focus our analysis on iteration count and quantifyits effect on the security of KDF. A KDF function with an iterativestructure is of the form

y = H(c)(p; s) = H(c)(u)

where u is typically pks. For simplicity, we assume that the lengthof pks is at most n. We will show that an adversary with only ac-cess to Type 1 queries and computational resources significantly lessthan cjPW j has only a small probability of success in distinguishingthe derived key from a random string, which is stated in the followingtheorem.

Theorem 1: In the weakly secure model for KDF, if the adversarymakes at most t Type 1 queries, then the maximum success probabilityAdv(t) satisfies

bt=c0c

jPW j< Adv(t) <

bt=c0c

jPW j+O

(t+ c0)2

2n:

The Proof of Theorem 1 proceeds in three steps. In the first step, weset up a random functional graph representationGH forH . In this rep-resentation,H(c) corresponds to an unknown random path of length c,while the knowledge gained by the adversary corresponds to an explicitsubgraph QH of GH with t edges. The adversary’s advantage is thusa conditional probability of guessing H(c) correctly based on QH . Inthe second step, we show that, with at most anO( (t+c )

2) effect on the

conditional probability, the query graph Q may be reduced to certainsimple forms. These reductions allow us to prove in step 3 that, so longasQ does not contain the full path from u toH(c)(u), the randomnessofH(c)(u) is not affected much by the conditioning imposed by Q.

A. Graph Representation for cth Iterate Function

Our analysis uses a graph to represent the querying process for H .Let GH be a directed graph on the vertex set f0; 1gn; a directed edge(x; y) exists in GH if and only if H(x) = y. Hence, every vertex hasout-degree 1 and GH contains 2n edges. The adversary, by probinga sequence of t edge “H(x) =?,” discovers a subgraph QH of GH

which is referred to as the query graph. Since the same query subgraphQH can arise from different functions H , it is sometimes convenientto write Q without referring to a specific H .It may be helpful to review some mathematical background on the

cth iterate of a random function. Although random functions have beenstudied extensively in the literature, the cth iterate function H(c) hasreceived relatively little attention. For example, it is well known thatthe image of a random function has size (1 � e�1)2n. What aboutthe image of H(c)? At what rate does the image size decrease with c?In a 1990 paper, Flajolet and Odlyzko [9] provided answers to thesequestions by deriving the following recurrence.

Image Size ofH(c): The image size ofH(c) is (1��c)2n, where �csatisfies the recurrence �0 = 0; �c+1 = e�1+� . Furthermore, asymp-totically 1 � �c = 2=c.

In other words, the image size ofH(c) decreases arithmetically withc (not geometrically as one might guess at first). This illustrates thatH(c) is significantly different from a random function as c gets larger,and its analysis is a nontrivial matter. It is also interesting to note that,although the image size ofH(c) goes down by a factor of 2=c comparedwith that of H , Theorem 1 shows that the attacker’s workload mustincrease by a factor of c.

B. Reduction of Query Graphs

We focus on the problem of computing H(c)(u) for u = pks0 viaH queries, where p is a password in PW and s0 is fixed. We will an-alyze the probability of H(c)(u) = v for any vertex v, conditioned ona query graph Q with t edges. To make the probability model precise,let Eu;v denote the eventH(c)(u) = v, and probQ[Eu;v] be the prob-ability for the event to be true among all random functions H whosecorresponding graph GH is consistent with Q (i.e., GH contains Q asa subgraph). We call probQ[Eu;v] the conditional probability of Eu;v

relative toQ. The overall success probability � for Eu;v is the averagevalue of probQ [Eu;v] over all possible QH :

� =H

�H � probQ [Eu;v] (1)

where �H = 1=2n2 for each random function H .

Page 4: Design and Analysis of Password-Based Key Derivation Functions

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 9, SEPTEMBER 2005 3295

We will simplify the analysis of � by making certain reductions onQH . We remark that these reductions are nontrivial and are based onspecial properties of the function H(c) for a random H (such as itscyclic structure and vertex symmetry).

Definition 2: A query graph QH is said to be linear if Q containsno cycles, and Q contains no vertex with in-degree >1.

Lemma 1: With probability at least 1 � O( t2

), a query graph Qwith t edges is linear.

Proof: The probability of a collision occurring within t stepsis at most O( t

2). The query graph Q is linear if no such collisions

occur. QED

In a further reduction step, we shall ignore all edges in a linear Qthat are not on the same path as either vertex u or vertex v. Let Pu andPv denote the paths inQ that contain vertex u and v, respectively, andlet Q0 be the query graph that contains only paths Pu and Pv .

Lemma 2: If Q is linear with t edges, then

probQ[Eu;v] � probQ [Eu;v] +O(t2=2n):

Proof: The proof is based on the reasoning that, starting fromany isolated random vertex in Q0, a short path disjoint from Pu andPv can be found with high probability in a randomGH . Therefore, theactual existence of such paths inQH does not conveymuch informationabout H .

Suppose Q � Q0 consists of k disjoint paths P1; P2; . . . ; Pk , withsource vertices fq1; q2; . . . ; qkg. We can reconstruct Q from Q0 prob-abilistically as follows. Let R be a random function in the set of allfunctions from f0; 1gn to f0; 1gn. Generate a path from q1 by suc-cessively adding jP1j random edges (q1; R(q1)); (R(q1);R(R(q1))),etc. Likewise generate a random path of length jPj j from vertex qj foreach j. Let QR be the resulting graph. Clearly, each QR occurs withuniform probability ( 1

2)L where L = k

1 jPij < t. The probabilitythatQR is linear, that is, all k paths generated are cycle free, vertex dis-joint from each other as well as from Pu and Pv is at least 1� � where� = O( t

2). In other words, the resulting graph QR is isomorphic to

Q = QH with probability at least 1� �. Thus, we can write

probQ [Eu;v] =Q

1

2nL� probQ [Eu;v]

�Q linear

1

2nL� probQ [Eu;v]

� (1� �)probQ[Eu;v]

which implies probQ[Eu;v] � probQ [Eu;v] + � as claimed. QED

Now it is sufficient to analyze probQ [Eu;v] for a linear query graphQ0 which consists of at most two paths Pu and Pv . We first define

Tu = (u;H(u); . . . ; H(c )(u)):

Recall the experiment in the definition for weakly secure KDF. Ob-viously, if the full path Tu for u = p0ks0 is in the query graph Q0,then given (y0; s0; c0) the adversary can distinguish y0 from a randomstring. In the following, we want to show that if the the full path Tu isnot in Q0, then the success probability of the adversary is negligible.

Lemma 3: Let u = p0ks0. If any edge on the path Tu is missing inthe query graph Q0, then for any vertex v

probQ [Eu;v] � O((c0 + t)2=2n):

Proof: Consider the simplified query graph Q0 with only thepaths Pu and Pv . There are two cases depending on whether or notPu = Pv .

Case 1: Pu and Pv are disjoint.

In this case, path T must intersect with path Pv at some vertex. Letthe first such intersection occur at the ith vertex on T and jth vertex onPv . There are at most ct different combinations, each happening withprobability 1=2n. The probability for T to intersect Pv is thus at mostO(ct=2n).Case 2: Pu = Pv .In this case, there is a path from vertex u to vertex v with distance

< c. In order for [H(c)(u) = v] to be true, it must be the case that inthe graph GH , the path T includes a cycle and v is on the cycle. Thisimplies that a collision occurs and the probability of such collision isat most O(c2=2n). QED

Proof of Theorem 1: By Lemma 1, the query graphQ has a prob-ability greater than 1 � O( t

2) of being linear. We can also eliminate

the possibility that the vertex p0 k s0 appears as a nonstarting point ona path (an O( t

2)-probability event). In such a linear Q with t edges,

the probability it contains the full path Tp k s is at most bt=c cjPWj

. If Qdoes not contain the full path Tp k s , then the success probability of

the attacker is at most O( (c +t)2

) by Lemmas 2 and 3. Therefore, theoverall success probability is at most

bt=c0c

jPW j+O

(t+ c0)2

2n

as stated.The lower bound can be achieved easily by computing full paths of

length c0 for bt=c0c passwords in PW . QED

C. Discussions

To better understand the result, we consider a concrete scenario inwhich the computational resource of the adversary is more than jPW jbut significantly less than cjPW j. For example, let jPW j = 240; n =128; c = 216; t = 244. Setting c = 216 adds little overhead at theuser end for deriving a single key3 while a straightforward dictionaryattack now requires cjPW j = 256 queries to H . Using t queries, theattacker can certainly correctly compute a fraction of t=c

jPW j= 2�12 of

the derived keys. Our result shows that this is the best the attacker cando, since the probability for correctly computing more than 2�12 of thederived keys is at most O( (c+t)

2) = O(2�40). In other words, there

is no shortcut in launching a dictionary attack.Theorem 1 proves tight bounds for the password space. When t < c,

the lower bound on Adv(t) in the theorem becomes zero. I Lemma4, we derive a nontrivial lower bound by describing a strategy of theattacker.

Lemma 4: With t < c queries, the attacker can achieve

AdvA(t) >t2

2n+1+ 1�

t2

2n+1d[c� t]

2n

where d[x] denotes the number of divisors of x.Proof: The attacker simply computes u;H(u);H2(u); . . . iter-

atively and hopes that the sequence becomes periodic, i.e., a repeatedvalue occurs before Ht(u) so that H(c)(u) is determined. The prob-ability for this to happen is t(t + 1)=2n+1. In the case, the chain(u;H(u); . . . ; Ht(u)) is not periodic, the adversary will select a pointon the path v = H�(u) where � is chosen to maximize the number ofdivisors d[c� �] for 0 � � � t. The probability of success in this caseis at least d[c� �]=2n as stated. QED

We note that, somewhat surprisingly, the lower bound gets betterwith larger c, as the attacker may find a number c � � in the range[c � t; c] with a large number of divisors so that it is more likely tohave H(c)(u) = H(�)(u) through periodicity.

3On a Pentium 4 running at 2.1 GHz, 2 SHA-1 operations take less than0.02 s according to the benchmarks for Wei Dai’s CRYPTO++ Library.

Page 5: Design and Analysis of Password-Based Key Derivation Functions

3296 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 9, SEPTEMBER 2005

V. SECURITY ANALYSIS OF KDFS IN PKCS#5

In the preceding section, we analyzed the basic iterative constructionH(c). The analysis implies that the two KDFs in PKCS#5 are weaklysecure as long as the adversary’s computational resource is far less thancjPW j, even though it can be much larger than jPWj.

In what follows, we analyze the security of the two KDFs under thestrongly secure model—that is, the adversary is allowed a few Type 2queries. We show that neither KDF is secure under this model and wealso explore how such security weakness can be exploited to launchattacks in practical settings.

A. PBKDF1

The attack on PBKDF1 is straightforward. For any salt s and twoiteration counts c0 < c1, let yi = F (p; s; ci) = H(c )(p k s). Then, itis easy to see that y1 = H(c �c )(y0). This relation allows an attackerto easily distinguish y0 from a random function with one Type 2 query(s; c1) and (c1 � c0) Type 1 queries.

Note that if the key y0 = H(c )(p k s) were ever compromised forsome reason, then any key derived using the same salt s and an iterationcount larger than c0 would all be compromised. This might happen inpractice if the user (or the security administrator of the system) decidesto increment the iteration count. And it is contrary to the naive assump-tion that a larger iteration count would imply higher security.

B. PBKDF2

The derived keys in PBKDF2 also suffer from nonrandomness, al-though the relations among keys are slightly more complicated. Let sbe any salt value and let c1; c2; c3 be three consecutive iteration counts.For i = 1; 2; 3, define yi = F (p; s; ci) = U1 � � � � � Uc . Then, wehave y1 � y2 = Uc and y2 � y3 = Uc . This yields the followingrelation among the three keys: (y2�y3) = Hp(y1�y2), whereHp( )is the underlying function HMAC.

This relationship among keys opens the door to dictionary attacks.The attacker simply computes the HMAC function Hp(Uc ) for allpossible passwords p, and the password that gives Hp(Uc ) = Uc isvery likely to be the correct password used in the scheme. Once p isknown, it is easy to distinguish the derived key from a random string.We remark that the workload of the above attack is jPW j, no matterwhat c is. This implies the iteration count does not add much (or any)protection in PBKDF2 against dictionary attacks.

VI. EFFECTS OF SALT

A salt serves the purpose of creating a large set of possible long keyscorresponding to a password p. If the salt is s bits long, then the numberof possible long keys can be as large as 2s. Each time the KDF is ex-ecuted with a salt, either selected by the user or generated at random,one of the 2s long keys is selected.

One natural question is the following: Suppose that an adversary hascomputed the long keys corresponding to all the passwords p 2 PWfor a salt s1. That is, the adversary has a table of size jPW j in whicheach entry contains the value (p;Hc(p k s1)) for some p 2 PW . Doesthis table provide the adversary some shortcuts to derive long keysusing a different salt s2 6= s1?

The answer is certainly “no,” which is well known in practice. Usingthe graph-based approach, we can show that the set of paths corre-sponding to s1 and the set of paths corresponding to s2 are all disjointwith high probability, and hence the table for s1 provides essentially noinformation for derived keys using s2. The detailed analysis is similarto that for Theorem 1 and thus omitted here.

VII. A NEW PROPOSAL FOR STRONGLY SECURE KDF

In this section, we present a new proposal for strongly secure KDFbased on our study on the effects of iteration counts and salt, as well asthe analysis on existing KDFs.We first note that the graph-based analysis provides insights on the

exact way that each parameter contributes to the overall security: Thecomputation process for deriving a key corresponds to a path in thequery graph. So , choosing a larger iteration count forces the attacker totraverse a longer path for deriving each key, while choosing a differentsalt value forces the attacker to traverse a different path in the compu-tation. It is also easy to see why the KDFs in PKCS#5 are not stronglysecure using the query graph. For example, in the case of PBKDF1, theType 2 queries (s; c0) and (s; c1) correspond to two paths that overlap.This extra information allows the attacker to distinguish the derivedkey from a random string.Based on the preceding discussion, we can see that a strongly secure

KDF should be constructed in a way that the values of y = F (p; s; c),for different p; s; and c, are nearly independent of each other. Cer-tainly, there are various ways of achieving this goal. Here we proposea simple construction that maintains the same efficiency as the KDF’sin PKCS#5. The idea is to include iteration count explicitly as an inputto the hash function H . More specifically, the new KDF is

y = F �(p; s; c) = H(c)(pkskc): (2)

In what follows, we prove that the above KDF is strongly se-cure—secure even when the adversary can choose (s; c) and makeType 2 queries. We assume that there are lower and upper limits tothe ci acceptable in Type 2 queries, that is, c� < ci < c�. Indeed,without a lower limit, the adversary can always set c = 1 in the Type2 query and then perform an offline key search attack with complexityO(jPW j).

Theorem 2: In the strongly secure model for KDF, if the adversarymakes at most t Type 1 queries and at mostm Type 2 queries, then themaximum success probability Adv(t;m) satisfies

max(b(t� c�)=c�c; m)

jPW j< Adv(t;m)

<bt=c�c+ 2m

jPW j+O

(t+ c�m)2

2n:

Proof: We first prove the upper bound to Adv(t;m). Note thatthe attacker’s algorithm is adaptive, so it can choose each query, eitherType 1 or Type 2, based on the answers to earlier queries.Consider the query graphQ1 corresponding to the t Type 1 queries.

With probability at least 1�O(t2=2n); Q1 is linear. If the attacker usesonly Q1, its success probability is bounded by bt=c c

jPW j+ O( (c +t) )

2)

as shown in Theorem 1, except that c0 is replaced with the lower andupper limits for c in the two places.We next consider query graph Q2 corresponding to the m Type 2

queries (si; ci). Each query is an (invisible) path of length ci with aknown endpoint yi = F �

p (si; ci). Since the paths all start at differentvertices, with probability at least 1�O((c�m)2=2n); Q2 is linear.We now analyze how much advantage the attacker can gain by using

Type 2 queries to verify passwords corresponding to vertices in Q1

which are not the starting point of a path. To maximize the successprobability, Type 2 queries f(si; ci) 1 � i � mg should be chosensuch that the total number of vertices of the form fpksikci;1 � i �mg is maximized in Q1. If we denote the number by m0, then thesuccess probability using Type 2 queries is at most m0=jPW j. Notethat m0 = m + q, where q is the expected number of collisions inskc among all the vertices pkskc in Q1. Since the expected value ofq is tjPW j=2n << 1, it can be shown that the probability thatm0 =m + q � m +m = 2m is negligible.

Page 6: Design and Analysis of Password-Based Key Derivation Functions

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 9, SEPTEMBER 2005 3297

Combining all the probabilities, we prove thatAdv(t;m) is boundedby

bt=c�c+ 2m

jPW j+O

(t+ c�m)2

2n

as stated.For the lower bound, we describe two strategies from which the ad-

versary can pick the one yielding better success probability dependingon the parameters. In strategy A, the adversary computes separate pathsof length c� for b(t � c�)=c�c passwords pikskc� using t � c� Type1 queries. He then makes a Type 2 query asking for y = F �

p (s; c�).With probability b(t�c�)=c�c=jPW j, vertex y coincides with the end-point of one of the paths, thus revealing the password p0. In such anevent, the adversary then makes c0 more Type 1 queries to computey0

0 = H(c )(p0ks0kc0) and answers 1 if y0

0 = y0. All together theadversary used at most t Type 1 and one Type 2 queries to achieve suc-cess probability of b(t � c�)=c�c=jPW j.

In strategy B, the adversary constructs Q1 to be a single pathof length t starting from an arbitrary pkskc. With probability1 � O(t2=2n), the path will be cycle free. Its first t � c� verticespiksikci have their full paths Tp ks kc completely contained in Q1.Assuming m to be much smaller than t � c�, the adversary can pickm vertices piksikci along the path with distinct pi and make at mostm Type 2 queries with the corresponding (si; ci)’s. With probabilitym=jPW j, it can identify the password. This completes the proof ofTheorem 2. QED

VIII. CONCLUSION

Password-based KDFs are necessary in many security application.Despite their importance and widespread usage, rigorous analysis ofsuch functions seems to have received relatively little attention in theliterature compared to many other cryptographic schemes.

In this paper, we define a general security framework for password-based KDFs where salt and iteration count are included as parameters.Under this framework, we focus on the most commonly used construc-tion H(c)(p k s) and prove that the iteration count c, when fixed, doeshave an effect of stretching the password by log2 c bits. Our analysisis done using a random functional graph representing H , conditionedupon a query graph representing information revealed to the attacker.It provides insights on the exact way that each parameter contributes tothe overall security.

We then analyze twowidely deployed KDFs defined in PKCS#5.Weshow that both are secure if the adversary cannot influence the param-eters, but are subject to attacks otherwise. We also consider how suchsecurity weaknesses can be exploited in practice.

Finally, based on the insight gained from our earlier analysis, wepropose a new password-based key derivation that is provably secureeven when the attacker has full control of the salt and iteration count.The new proposal achieves stronger security while preserving the sameefficiency as existing KDFs. We expect that the new proposal will findits application in practical implementations.

ACKNOWLEDGMENT

The authors wish to thank the Associate Editor and the anonymousreviewers for their careful reading and useful comments.

REFERENCES

[1] Digital Signature Standard, 1994. FIPS PUB 186-2, National Instituteof Standards and Technologies.

[2] Standard Specifications for Public-Key Cryptography, 2000. IEEE Std1363-2000, IEEE Computer Society.

[3] PKCS#5 v2.0: Password-Based Cryptography Standard, 1999. RSALaboratories.

[4] T. Dierks and C.Allen, “The TLS protocol version 1.0. IETFRFC 2246,”Internet Request for Comments, Jan. 1999.

[5] J. Kelsey, B. Schneier, C. Hall, and D. Wagner, “Secure applications oflow-entropy keys,” in Proc. First Int. Workshop ISW’97. New York:Springer-Verlag, 1998.

[6] A. Hevia, A. Desai, and Y. L. Yin, “A practical-oriented treatment ofpseudorandom number generators,” inAdvances in Cryptology—EURO-CRYPT’02. Berlin, Germany: Springer-Verlag, 2002.

[7] IEEE P1363.2: Standard Specifications for Password-Based Public-KeyCryptographic Techniques, Draft D15 (2004, May). [Online]. Available:http://grouper.ieee.org/groups/1363/passwdPK/draft.html

[8] M. Bellare, R. Canetti, and H. Krawczyk, “Keyed hash func-tions for message authentication,” in Advances in Cryptology—CRYPTO’96. Berlin, Germany: Springer-Verlag, 1996.

[9] P. Flajolet and A. M. Odlyzko, “Random mapping statistics,” in Ad-vances in Cryptology—EUROCRYPT’89. Berlin, Germany: Springer-Verlag, 1990.

Toward a General Correlation Theorem

Kishan Chand Gupta and Palash Sarkar

Abstract—In 2001, Nyberg proved three important correlation theoremsand applied them to several cryptanalytic contexts. We continue the workof Nyberg in a more theoretical direction. We consider a general functionalform and obtain itsWalsh transform. Two ofNyberg’s correlation theoremsare seen to be special cases of our general functional form. S-box lookup,addition modulo 2 , and X-OR are three frequently occurring operationsin the design of symmetric ciphers. We consider two methods of combiningthese operations and in each apply our main result to obtain the Walshtransform.

Index Terms—Boolean function, correlation, linear approximation, S-box, symmetric cipher, Walsh transform.

I. INTRODUCTION

Symmetric ciphers are a basic cryptographic primitive. In practice,symmetric ciphers are designed using nonlinear Boolean functions andS-boxes. One of the most effective methods of attacking symmetric ci-phers is the technique of linear cryptanalysis [5]. The efficacy of thistechnique depends upon the ability to obtain good linear approxima-tions of the constituent Boolean functions and S-boxes.Linear approximations are studied using the technique of Walsh

transform analysis. While it is usually easy to apply the Walsh trans-form to an individual constituent of a symmetric cipher, in general it ismore difficult to apply the technique when a combination of primitivesare used. This requires the development of a general methodology ofWalsh transform applications.

Manuscript received December 5, 2004; revised June 5, 2005.K. C. Gupta was with the Applied Statistics Unit, Indian Statistical Institute,

Kolkata 700 108, India. He is now with the Centre for Applied CryptographicResearch, University of Waterloo, Waterloo, ON N2L 3G1, Canada. (e-mail:[email protected]).

P. Sarkar is with the Applied Statistics Unit, Indian Statistical Institute, Cal-cutta 700 108, India (e-mail: [email protected]).

Communicated by E. Okamoto, Associate Editor for Complexity and Cryp-tography.

Digital Object Identifier 10.1109/TIT.2005.853326

0018-9448/$20.00 © 2005 IEEE