1 theoretical modeling and analysis of content fingerprinting

1

Theoretical Modeling and Analysis of

Content FingerprintingAvinash L. Varna,Student Member, IEEE,and Min Wu,Senior Member, IEEE.

Abstract

Multimedia identification via content fingerprints is used in many applications, such as content

filtering on user-generated content websites, automatic multimedia identification and tagging. A compact

“fingerprint” is computed for each multimedia signal that captures robust and unique properties of the

perceptual content, which is used for identifying the multimedia. Several different multimedia finger-

printing schemes have been proposed in the literature and have been evaluated through experiments.

To complement these experimental evaluations and provide guidelines for choosing system parameters

and designing better schemes, this paper develops models for content fingerprinting and provides an

analysis of the identification performance under these models. Firstly, fingerprinting schemes that generate

independent and equally likely fingerprint bits are examined and bounds are obtained on the identification

accuracy. Guidelines for choosing the fingerprint length toattain a desired accuracy are derived, and it

is shown that identification with a false alarm requirement is similar to joint source-channel coding.

A Markov Random Field based model for fingerprints with correlated components is proposed and a

statistical physics inspired approach for computing the probability of detection is described. The analysis

shows that the commonly used Hamming distance detection criterion is susceptible to correlations among

fingerprint bits, whereas the optimal log-likelihood ratiodecision rule yields5−20% improvement in the

accuracy over a wide range of correlations. Simulation results demonstrate the validity of the theoretical

predictions.

Index Terms

Content fingerprinting, content identification, error exponents, Markov Random Fields, Wang-Landau

density of states estimation.

The authors are with the Department of Electrical and Computer Engineering, University of Maryland, College Park, MD20742, USA (Email:[email protected], [email protected]).

DRAFT

2

I. INTRODUCTION

In recent years, user generated content (UGC) websites suchas Youtube have grown in popularity

and revolutionized multimedia consumption and distribution. Increasingly, the Internet is being seen as

a medium for delivering multimedia content to consumers. These new distribution channels have also

raised concerns about the posting of copyrighted content onUGC websites [1]. Several UGC websites are

deploying content filtering schemes to identify and filter such copyrighted videos. These filtering schemes

rely on an emerging technology calledcontent fingerprintingto identify multimedia content uploaded to

the UGC sites.

A fingerprint is a compact signature that represents robust and unique characteristics of the multimedia

and can be used to identify the document. Content fingerprints are used for identifying multimedia in a

variety of applications in multimedia management. Fingerprints are employed by services such as Shazam,

Midomi, VCAST, etc. to perform automatic music identification. Given a noisy recording of an audio

captured using a mobile phone, these services identify the original audio track and provide metadata

information, such as the album, where to buy the track, etc. Fingerprints have also been used to perform

automatic tagging of audio collections and create automatic playlists based on user preferences [2].

Watermarking, which is a proactive technique wherein a special watermark signal is embedded into the

host at the time of content creation, can also be used for content identification. This embedded signal can

later be extracted and used to identify the content and retrieve associated metadata [3]. Watermarking

techniques are suitable if the embedder has control over thecontent creation stage. This requirement

may be difficult to satisfy in many practical applications, including content filtering on UGC sites. In

particular, a large volume of existing multimedia content does not have embedded watermarks and cannot

be identified using this approach. Content fingerprints, on the other hand, do not require access to the

content at the time of creation and can be used to identify existing multimedia content that does not have

embedded information.

Content fingerprints are designed to be robust to minor content preserving operations while being able

to discriminate between different multimedia objects. At the same time, the fingerprints must be compact

to allow for efficient matching with databases containing millions of multimedia works with different

content. In this respect, content fingerprinting shares similarities with robust hashing [4], [5]. Traditionally,

robust hashing was studied in the context of authentication, where the main objective was to prevent an

adversary from forging an image that has the same hash as a given image or video. In contrast, while

collisions or false alarms are also a concern in content fingerprinting, the main threat model is an adversary

making minor modifications to a given multimedia document that would result in a significantly different

fingerprint and prevent identification. Another differencebetween fingerprinting and robust hashing is that

fingerprinting applications typically involve large databases with several millions of hours of video and

DRAFT

3

audio, whereas traditional applications of image hashing typically focus on authenticating a smaller set

of images. However, many hashing schemes with good robustness properties can be adapted for content

identification purposes and hence the terms “content fingerprinting” and “robust hashing”‘ are often used

interchangeably in the literature.

Multimedia identification also shares some similarities with content-based multimedia retrieval [6], [7]

where multimedia objects are retrieved from a database based on their perceptual similarity to a query.

Such ideas have also been built into the MPEG-7 standard to facilitate similarity comparison, search

and retrieval of video [8]. However, the concept of perceptual similarity is not well-defined and cannot

be expressed in objective terms. In contrast, the multimedia identification problem is better defined, as

two objects are considered similar if they can both be obtained from the same underlying object through

content-preserving transformations. Moreover, in many practical applications involving fingerprinting,

there are stringent requirements on scalability and computational complexity that are less of a concern

in many retrieval applications.

Content fingerprinting has received a lot of interest in the research community and several different

approaches for fingerprinting have been proposed, some of which are reviewed in Section I-A. Most of

these works addressed the problem of designing fingerprinting schemes that are robust to different kinds

of processing. This paper focuses on developing a theoretical model and analyzing the performance of

fingerprinting schemes. Such a theoretical framework wouldcomplement existing experimental evalua-

tions and allow the prediction of how the identification accuracy scales with the size of the database.

Theoretical analysis can also allow us to determine fundamental limits on the performance and provide

guidelines for designing better fingerprinting schemes.

In this paper, we examine content identification under a hypothesis testing framework and examine the

influence of various system parameters on the identificationperformance. We then derive bounds on the

error probabilities of an identification scheme that generates fingerprints with independent and identically

distributed (i.i.d.) bits and provide guidelines for choosing the hash length to achieve a desired accuracy.

As many practical schemes generate fingerprints with correlated components, we propose a Markov

Random Field (MRF) model to capture local dependencies among the bits and use techniques inspired

by statistical physics to determine the influence of the correlation among fingerprint components on the

probability of detection.

A. Related Prior Work

Content fingerprinting has attracted a lot of research and several audio and video fingerprinting

techniques have been proposed in the literature. A robust fingerprinting technique for audio identification

based on the signs of the differences between the energy in different frequency bands of overlapping

DRAFT

4

frames was proposed in [9]. A similar approach for video, coupled with efficient indexing strategies was

proposed in [10]. Ranks of the block average luminance of sub-sampled frames were used as fingerprints

in [11], while signs of significant wavelet coefficients of spectrograms were used to construct fingerprints

in [12]. Moment invariants that capture appearance and motion were proposed as features for fingerprints

in [13].

In the robust hashing literature, hash generation by quantizing projections of images onto smooth

random patterns was proposed in [4], which is used as a building block in many fingerprint constructions

such as [13]. Hashes resilient to geometric transforms based on properties of Fourier transform coefficients

were proposed in [5]. Spatiotemporal video hashes based on 3-D transforms were proposed in [14].

Several other hashing schemes with different robustness properties have been proposed in the literature.

The reader is referred to [15] and the references therein fora more exhaustive review and comparison

of various fingerprinting and hashing techniques.

Regarding theoretical aspects of fingerprinting, qualitative guidelines for designing multimedia hash

functions were provided in [16], with a focus on bit assignment and the use of suitable error-correcting

codes to improve the robustness. Robust hashing was considered as a classification problem in [17]. As

a null-hypothesis and false alarms were not explicitly considered in the formulation of [17], the analysis

cannot be directly applied to the problem of content identification. In the related field of biometrics, the

capacity of biometrics-based identification was studied in[18]. Capacity was defined as the maximum

rateR such that2LR distinct biometrics could be identified with an asymptotic error probability of zero,

as the length of the fingerprintsL → ∞. However, as noted in the paper, while designing practical

systems, we are more interested in determining the best performance obtainable using a given length

of the fingerprint, which is one of the contributions of this paper. We note that while our results are

presented in the context of multimedia content identification, the results are equally applicable in other

related areas such as the biometrics-based identification considered in [18].

B. Organization of the Paper

Section II provides a brief overview of the framework of hypothesis testing adopted in this paper. In

Section III we examine the impact of various system parameters on the identification performance of

binary i.i.d. fingerprints and bounds on the accuracy are derived in Section IV. An MRF based model

for binary fingerprints with correlated bits is proposed in Section V and the impact of the correlation

on the performance is examined. Section VI presents simulation results and Section VII summarizes the

findings of the paper.

DRAFT

5

FingerprintComputation

Videos

V1

V2...

VN

Fingerprints

X(1)

X(2)

...X(N) DistortionVj

W /∈ {Vi}FingerprintComputation

YZ

(a) Database Creation (b) Detection Stage

Fig. 1. System Model

II. H YPOTHESISTESTING FRAMEWORK

Hypothesis testing has been commonly used to model identification and classification problems [19].

We adopt a similar framework in this paper for analyzing content identification. For ease of presentation,

we describe the framework using the example of a video identification application, but the analysis and

results apply to other identification tasks as well.

The system model for a fingerprint-based video identification scheme is shown in Fig. 1. Suppose that

the detector has a collection ofN videosV1, V2, . . . , VN which would serve as a reference database for

identifying query videos. For example, in a UGC website application, the videos{Vi} may correspond to

copyrighted videos that should not be uploaded to the website by users. In the initial creation stage, the

fingerprintX(i) corresponding to videoVi is computed and stored in the database as shown in Fig. 1(a).

Given a query videoZ that needs to be identified, the detector computes the fingerprint Y of the

uploaded video and comparesY with the fingerprints{X(i)}Ni=1 stored in its database. In general, the

query Z may be some videoW that does not correspond to any video in the database or a possibly

distorted version of some videoVi in the database. These distortions may be caused by incidental

changes that occur during transmission and storage, such ascompression and transcoding, or they may

be intentional distortions introduced by an attacker to prevent the identification of the content.

We consider two different detection objectives based on therequirements of different applications. In

some applications, such as a video sharing website implementing content filtering, it may be sufficient

to determine if the content is subject to copyright protection or not. In this case, the detector is only

interested in determining whether a given video is present in a database of copyrighted material or not.

We refer to this scenario as thedetection problem, which can be formulated as a binary hypothesis test:

H0 : Z does not correspond to any video in{V1, V2, . . . , VN},

H1 : Z corresponds tosomevideo in {V1, V2, . . . , VN}. (1)

Under this setting, the performance of a particular fingerprinting scheme with the associated decision

rule δD(·) can be evaluated using the probability of false alarmPf = Pr(δD = 1|H0) and the probability

of correct detectionPd = Pr(δD = 1|H1). In some situations, it may be more convenient to work with

DRAFT

6

the probability of false negativePfn = 1− Pd instead ofPd.

In some applications, such as automatic tagging of content,the detector is further interested in

identifying the original video corresponding to a query video. We refer to this scenario as theidentification

problem. The identification problem can be modeled as a multiple hypothesis test with each hypothesis

corresponding to one original video and a null hypothesis corresponding to the case that the uploaded

video is not present in the database:

H0 : Z is not from the database{V1, V2, . . . , VN},

Hk : Z is a (possibly distorted) version ofVk , k = 1, 2, . . . , N. (2)

In this scenario, the probability of correctly identifyinga query videoPc, the probability of misclas-

sifying a videoPm, and the probability of false alarmPf can be used to quantify the performance of a

given fingerprinting scheme and the corresponding detector. In the remainder of this paper, we examine

the performance of binary fingerprinting schemes under thishypothesis testing framework.

III. F INGERPRINTS WITH INDEPENDENTBITS

Binary strings are commonly employed in fingerprinting schemes such as [9], [10] since comparison

of binary strings can be performed efficiently. From the designer’s point of view, it is desirable for the

fingerprint bits to be independent of each other, so that an attacker cannot alter a significant number of

fingerprint bits at once by making minor changes to the content. Further, if the bits are equally likely to be

0 or 1, the overall entropy is maximized and each bit conveys the maximum amount of information. If the

bits are not equally likely to be0 or 1, they can be compressed into a shorter vector with equiprobable bits,

in order to meet the compactness requirement of the fingerprint. Also, from a game-theoretic perspective,

it has been shown that using equally likely bits is advantageous for the designer [20]. Binary strings

with independent and identically distributed (i.i.d.) bits also arise in biometric identification [21]. Hence,

in this and the next section, we focus our analysis on the performance of fingerprinting schemes with

i.i.d. equally likely bits and assume that each fingerprintX(i) consists ofL bits that are distributed i.i.d.

according to a Bernoulli(0.5) distribution. Binary fingerprints with correlated bits will be examined later

in Section V.

Distortions introduced into the content translate into changes in the fingerprint of the content. By

a suitable choice of features used for constructing the fingerprint and appropriate preprocessing and

synchronization, such attacks can be modeled as additive noisen in the hash space [16]. Since the hash

bits considered in this section are designed to be i.i.d., wemodel the effect of attacks on the multimedia

content as altering each bit of the hash independently with probability p < 0.5, i.e. the components of

n are i.i.d. Bernoulli(p). The maximum possible value ofp is proportional to the maximum amount of

DRAFT

7

distortion that may be introduced into the multimedia content and will be referred to as the distortion

parameter in the rest of the paper.

A. Detection Problem

Under the assumptions outlined above, thedetection problem, where the detector is only interested in

identifying whether a given content is present in a databaseor not, becomes:

H0 : Y 6= X(i) + n for i = 1, 2, . . . , N,

H1 : Y = X(i) + n, for somei ∈ {1, 2, . . . , N} (3)

whereY, X(i), i = 1, 2, . . . , N and the noisen are all binary vectors of lengthL. Under hypothesisH0,

Y can take any value with equal probability, since the fingerprint bits are i.i.d. with equal probability of

being0 or 1, so thatPr(Y = y|H0) =12L ,∀y ∈ {0, 1}L. The distribution of the fingerprintY, given that

it is a modified version ofX(i), Pr(Y|X(i)) can be specified by considering their Hamming distance.

Let di = d(Y,X(i)) be the Hamming distance between the fingerprint of the query video and a given

fingerprintX(i) in the database. Since the probability of a bit being altereddue to the noise isp, the

probability that exactlydi bits are altered isPr(Y|X(i)) = pdi(1− p)L−di .

The alternative hypothesisH1 is thus a composite hypothesis, as the computed fingerprintY can have

different distributions depending on which original fingerprint it corresponds to. The optimal decision

rule for composite hypothesis testing is given as [19]:

DecideH1 ifp(Y|H1)

p(Y|H0)> τ ′′ (4)

where the thresholdτ ′′ can be chosen to satisfy some optimality criterion. If the priors of the hypotheses

and the associated costs are known, thenτ ′′ can be computed so as to minimize the expected Bayes

risk. If the costs are known, but the priors are unknown, the thresholdτ ′′ can be chosen to minimize

the maximum expected risk. In this paper, we use a Neyman-Pearson approach [19] to maximize the

probability of detectionPd subject to the constraint that the probability of false alarm Pf ≤ α.

To simplify the analysis, we assume that all videos in the database are equally likely to correspond to

a query. In situations where some popular videos may be queried more often than others, the analysis

can be applied by appropriately modifying the prior probabilities. With this assumption, the likelihood

ratio test in Eqn. (4) becomes:∑N

i=1 p(Y|X(i))p(X(i)|H1)

p(Y|H0)> τ ′′.

DRAFT

8

Substitutingp(Y|H0) =12L , p(Y|X(i)) = pdi(1− p)L−di , andp(X(i)|H1) =

1N

, we get:

N∑

i=1

(

pdi

L (1− p)1−di

L

)L

> τ ′ (5)

where the constants have been absorbed into the thresholdτ ′. We note that the left hand side is a sum

of exponentials, and for a reasonably largeL, only the largest term would be relevant. Further, since

px(1− p)1−x is a decreasing function ofx for p < 0.5, the largest term in the left hand side of Eqn. (5)

would be the one with the smallest value ofdi. Thus, we arrive at the decision rule:

δD =

1 if dmin < τ,

1 with probability q if dmin = τ ,

0 otherwise,

(6)

wheredmin = mini=1,2,...,N

di. Hereτ is an integer threshold expressed in terms of the Hamming distance,

andτ andq are chosen to achieve a desired probability of false alarmα. Based on this decision rule, the

query is detected as being present in the database (δD = 1), if the minimum Hamming distance between

the fingerprint of the query and the fingerprints in the database is less than a specified thresholdτ .

1) ComputingPd and Pf : The probability of false alarmPf for a thresholdτ is given byPf (τ) =

Pr(dmin < τ |H0)+ qPr(dmin = τ |H0). To compute the value ofPf (τ), consider the Hamming distance

betweenY andX(i), which can be expressed asdi = d(Y,X(i)) = wt(Y ⊕X(i)), where wt(·) denotes

the Hamming weight of a binary vector and⊕ denotes addition over the binary field (XOR). Under

H0, since each bit ofY andX(i) are equally likely to be0 or 1, each component ofY ⊕X(i) is also

Bernoulli(0.5). The probability distribution ofdi = wt(Y ⊕X(i)) thus corresponds to the weight of a

random binary vector with i.i.d. uniform entries, which is abinomial distribution with parametersL and

0.5. Denote the probability mass function (p.m.f.) of a binomial random variable with parametersL and

0.5 by f0(k) , 12L

(

Lk

)

and the tail probability byF0(k) ,∑L

j=k f0(j). ThenPr(di = k|H0) = f0(k)

andPr(di ≥ k|H0) = F0(k).

As the fingerprintsX(i), i = 1, 2, . . . , N are independent, we havePr(dmin ≥ τ |H0) =∏N

i=1 Pr(di ≥

τ |H0) = [F0(τ)]N . The probability of false alarm can now be written as

Pf (τ) = (1− [F0(τ)]N ) + q([F0(τ)]

N − [F0(τ + 1)]N )

= 1− (1− q)[F0(τ)]N − q[F0(τ + 1)]N (7)

To compute the probability of detection, denote the p.m.f. of a binomial random variable with pa-

rametersL and p by f1(k) ,(

Lk

)

pk(1 − p)L−k and the tail probability byF1(k) ,∑L

j=k f1(j).

The probability of detection is given asPd(τ) = Pr(dmin < τ |H1) + qPr(dmin = τ |H1). Suppose

DRAFT

9

that H1 is true and that the query video is actually a distorted version of videoVs. As the noise is

assumed to change each fingerprint bit independently with probability p, Pr(ds = k|H1, s) = f1(k) and

Pr(ds ≥ τ |H1, s) = F1(k). For i 6= s, sinceX(i) is independent ofY and has i.i.d. equally likely bits,

Y ⊕ X(i) has i.i.d. Bernoulli(0.5) components. Thus the distancedi = wt(Y ⊕X(i)), i 6= s follows a

binomial distribution with parametersL and 0.5, which is the same as the distribution underH0. Now

consider

Pr(dmin ≥ τ |H1, Vs) = Pr(ds ≥ τ |H1, Vs)∏

i 6=s

Pr(di ≥ τ |H1, Vs)

= F1(τ)[F0(τ)]N−1.

The probability of detection can then be written as

Pd(τ) = 1− [F1(τ)][F0(τ)]N−1 + q( [F1(τ)][F0(τ)]

N−1 − [F1(τ + 1)][F0(τ + 1)]N−1 )

= 1− (1− q)[F1(τ)][F0(τ)]N−1 − q[F1(τ + 1)][F0(τ + 1)]N−1. (8)

2) Numerical Results:In Fig. 2, we show the receiver operating characteristics (ROC) computed using

Eqns. (7) and (8) for various values of the parametersL, N , andp. Fig. 2(a) shows the ROC curves as

the distortion parameterp is increased from0.2 to 0.3 for N = 230 fingerprints in the database each of

length256 bits. We observe that as the distortion parameterp increases, the probabilityPd of detecting a

copyrighted video reduces for a given probability of false alarmPf . As p approaches0.5, the probability

of detection approaches the lower boundPd = Pf . Fig. 2(b) examines the influence of the number of

fingerprints in the databaseN on the detector performance for a fixed fingerprint lengthL = 256 bits

and distortion parameterp = 0.3. As N increases, the probability of false alarm increases. As a result,

for a givenPd, the Pf is higher, or equivalently, for a fixedPf , the probability of detection is lower.

Fig. 2(c) shows that under a given distortion, the detector performance can be improved by using a longer

fingerprint. As the fingerprint length is increased,Pd increases for a givenPf .

B. Identification Problem

We now consider theidentification problemfor binary fingerprinting schemes, where the detector is

interested in identifying the specific video that the query corresponds to. As discussed in Section II, this

scenario can be modeled as a multiple hypothesis test:

H0 : Y 6= X(i) + n, for i = 1, 2, . . . , N,

Hi : Y = X(i) + n, i = 1, 2, . . . , N. (9)

As before, we assume that the fingerprint bits are i.i.d. and equally likely to be 0 or 1, the noise

DRAFT

10

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf

Pd

p = 0.2p = 0.25p = 0.3

(a) N = 230, L = 256 bits

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf

Pd

N = 210

N = 220

N = 230

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf

Pd

L = 512L = 256L = 128

(b) p = 0.3, L = 256 bits (c)N = 230, p = 0.3

Fig. 2. Receiver Operating Characteristics (ROC) for the binary hypothesis testing problem obtained from theoreticalanalysis.

independently changes each bit with probabilityp and that the prior probability of each hypothesis is the

same. Under this model, the Maximum Likelihood (ML) decision rule can be derived as:

δI =

i if di ≤ τ and i = argminj=1,2,...,N

dj ,

0 otherwise,

(10)

where di = d(Y,X(i)). If fingerprints of several copyrighted videos have the samedistance to the

fingerprint of the query videoY, one of them is chosen randomly as the match.

We now compute the performance metrics for the ML detectorδI . The probability of false alarmPf

DRAFT

11

is given by

Pf (τ) = Pr(at least one ofd1, d2, . . . , dN ≤ τ |H0),

= 1− Pr(none ofd1, d2, . . . , dN ≤ τ |H0),

= 1− [F0(τ + 1)]N .

As the fingerprints{X(i)} are identically distributed and equally likely to be queried, and the distribution

of the noisen under each of the hypotheses is the same, the overall probability of correct identification

Pc will be equal to the probability of correct identification under any given hypothesis, for exampleH1.

Under this hypothesis,d1 has p.m.f.f1 anddi, i 6= 1 has p.m.f.f0, so that:

Pc(τ) = Pr(δI = 1|H1)

= Pr(d1 ≤ τ∧

d1 < mini>1

di|H1) + Pr(mini>1

di = d1∧

d1 ≤ τ∧

δI = 1|H1),

=

τ∑

j=0

f1(j)

[

{F0(j + 1)}N−1 +

N−1∑

k=1

1

k + 1

(

N − 1

k

)

[f0(j)]k [F0(j + 1)]N−1−k

]

.

Similarly, the probability of misclassification can be computed as:

Pm(τ) = Pr(δI ∈ {2, 3, . . . , N}|H1),

= Pr(mini>1

di ≤ τ∧

mini>1

di < d1|H1) + Pr(mini>1

di = d1∧

d1 ≤ τ∧

δI > 1|H1),

=

τ∑

j=0

[

N−1∑

k=1

{(

N − 1

k

)

f0(j)k[F0(j + 1)]N−1−k ×

(

F1(j + 1) +k

k + 1f1(j)

)}

]

.

Fig. 3 shows the influence of the various parameters on the identification accuracy of the ML detector in

Eqn. (10). Fig. 3(a) shows the influence of the distortion parameterp. We observe that asp increases, the

probability of correct identificationPc at a given false alarm probabilityPf reduces, and the probability of

misclassificationPm increases. The influence of the number of videosN on the accuracy of identification

is shown in Fig. 3(b). As the number of videos in the database increases, the probability of false alarm

increases, or equivalently, at a givenPf , the value ofPc is lower. Fig. 3(b) shows that the probability of

correct identification under a given distortionp and a givenPf can be increased by increasing the hash

length. Thus, given the number of videosN and a desired probability of false alarmPf , the content

identification system can be made more robust by choosing a longer hash lengthL. These results are

similar to that obtained for the detection problem in the previous section.

DRAFT

12

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf

Pc (p=0.2)

Pc (p=0.25)

Pc (p=0.27)

Pm

(p=0.27)

Pm

(p=0.25)

Pm

(p=0.2)

(a) N = 230, L = 256 bits

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf

Pc (N=220)

Pc (N=230)

Pm

(N=230)

Pm

(N=220)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf

Pc (L=1024)

Pc (L=256)

Pm

(L=256)

Pm

(L=1024)

(b) p = 0.25, L = 256 bits (c)N = 230, p = 0.25

Fig. 3. ROC curves for the multiple hypothesis testing problem obtained from theoretical analysis.

IV. ERROR EXPONENTS AND PERFORMANCEBOUNDS

In Section III, we have derived expressions for the probability of correct identification and false alarm

for a given set of parameters and examined the tradeoff between identification accuracy, robustness and

the fingerprint length. In practice, we are often interestedin choosing system parameters to ensure that the

probability of error is below a certain threshold. While theexpressions forPd andPf in Section III can

be used to choose the parameters, the equations are non-linear and cannot be solved easily. Hence, in this

section, we derive bounds on the achievable error probabilities using fingerprints of a given length and

provide guidelines for choosing the fingerprint length required to achieve a desired detection accuracy.

We provide an intuitive interpretation of these bounds and show that content identification with a false

alarm requirement is analogous to the problem of joint source channel coding.

DRAFT

13

A. Error Exponents

Consider the detection problem where the detector is only interested in deciding whether a query

video is a modified version of some video in the database or not. As before, we examine the case of

i.i.d. binary fingerprints with the corresponding decisionrule given by Eqn. (6). As we are interested

in deriving bounds, we assume, for simplicity, thatq = 1 in the decision rule. The probability of false

alarm is given by

Pf (τ) = Pr(

N⋃

i=1

{d(Y,X(i)) < τ}|H0),

≤

N∑

i=1

Pr(d(Y,X(i)) < τ |H0),

= N Pr(d(Y,X(1)) < τ |H0), (11)

where we have used the union bound and the fact that the fingerprints X(i) are i.i.d. As discussed in

the previous section, underH0, Y andX(1) are independent with each component being equally likely

to be 0 or 1. Thus, the XOR ofY andX(1) is uniformly distributed over all binary strings of length

L. The Hamming distanced(Y,X(1)) = wt(Y ⊕ X(1)) and as a result,Pr(d(Y,X(1)) < τ |H0) =

12L

∑

x∈{0,1}L 1({wt(x) < τ}) = 12LSL,τ , where1(·) is the indicator function andSL,τ is the number of

binary vectors within a sphere of radiusτ in {0, 1}L. Let λ = τL

be the normalized radius. The volume

of the sphereSL,Lλ, for λ ≤ 12 can be bounded as

SL,Lλ ≤ 2Lh(λ),

whereh(p) = −p log2 p− (1− p) log2(1− p) is the entropy function [22]. By combining this result with

Eqn. (11), the probability of false alarm can be bounded fromabove as

Pf (Lλ) ≤ N2−LSL,τ

≤ N2−L(1−h(λ)) (12)

where τ = Lλ. The same result can been obtained by applying the Chernoff bound to upper bound

Pr(d(Y,X(1)) < Lλ) for λ < 12 , with d(Y,X(1)) being a binomial random variable with parameters

L and 12 [23]. However, we prefer this approach as it provides an intuitive explanation of the bounds,

which is discussed in Section IV-C.

We next consider the probability of a false negative (misseddetection)Pfn = 1 − Pd. Suppose that

X(i) is the fingerprint of a videoVi in the database and thatY is the fingerprint of a modified version of

Vi. A false negative occurs if no fingerprint in the database is within a distanceτ of the query fingerprint

DRAFT

14

Y. The probability of a false negative can thus be bounded by the probability that the distance between

Y and the original fingerprintX(i) is larger thanτ :

Pfn(τ) ≤ Pr(d(Y,X(i)) > τ |H1).

SinceY is generated by flipping each bit ofX(i) with a probabilityp, d(Y,X(i)) is distributed according

to a binomial random variable with parametersL andp so thatPfn ≤ Pr(Binomial(L, p) > τ). By the

Chernoff bound [23], the tail probability of the binomial distribution can be bounded as

Pr(Binomial(L, p) ≥ Lλ) ≤ 2−LD(λ||p)

whereD(λ||p) is the Kullback-Leibler distance between two Bernoulli distributions with parametersλ

andp respectively. Thus, the probability of false negative whenτ = Lλ can be bounded as

Pfn(Lλ) ≤ 2−LD(λ||p) (13)

Eqns. (12) and (13) show the tradeoff between the probability of false alarmPf , the probability of

missed detectionPfn and the number of fingerprintsN in the database. For example, givenN videos,

reducing thePf would require1 − h(λ) to be as large as possible, or equivalently,λ must be as small

as possible. However, reducingλ leads to an increase in thePfn. To further examine this tradeoff, let us

define the rateR asN = 2LR, the false alarm error exponent asEf = 1−h(λ)−R, and the false negative

error exponent asEfn = D(λ||p), so thatPf ≤ 2−LEf andPfn ≤ 2−LEfn . From the properties of the

Chernoff bound, we know that these bounds are asymptotically tight, i.e., limL→∞1Llog2 Pfn = Efn

defined above. In the Neyman-Pearson setting, given a certain number of videosN and fingerprint length

L, suppose we wish to ensure thatPf ≤ ǫ = 2−L∆ and minimizePfn. This is equivalent to maximizing

Efn for a fixed rateR while ensuring thatEf ≥ ∆:

maxλ

Efn = D(λ||p) subject to1− h(λ)−R ≥ ∆. (14)

Fig. 4 shows the maximum achievable false negative error exponentEfn as a function of the false

alarm error exponent∆, for a fixed rateR, whenp = 0.3. From the figure, we observe that at a given

rate R, Efn reduces as a function of∆, which implies that for a fixed number of fingerprints in the

database, reducing the false alarms leads to an increase in the number of missed detections, and vice

versa. From the figure, we also observe that for a fixed value of∆, Efn reduces asN increases. This

trend matches the results presented in Section III.

To ensure thatPfn < 0.5, the decision thresholdτ = Lλ should be greater than the mean of the

binomial distributionLp. As the entropy functionh(λ) is monotonically increasing forλ < 0.5, this

would in turn imply that the false alarm exponent∆ = 1− h(λ)−R ≤ 1− h(p)−R. Hence, to ensure

DRAFT

15

0 0.002 0.004 0.006 0.008 0.010.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

∆

Efn

R=0.001R=0.01R=0.05

Fig. 4. Error exponent for false negatives as a function of the rate for different values of the false positive error exponent.

thatPf ≤ ǫ = 2−L∆, we require thatR+∆ ≤ 1− h(p), or equivalently,

1

Llog2

N

ǫ≤ 1− h(p). (15)

Thus, given a video database of sizeN , to ensure that the probability of false alarmPf ≤ ǫ when the

attack alters on average a fractionp of the hash bits, the length of the fingerprints used for identification

should be chosen large enough to satisfy Eqn. (15). The corresponding probability of false negative is

then guaranteed to be less than2−LEfn , whereEfn can be computed from Eqn. (14).

B. Bounds on the Error Probabilities for the Identification Problem

Similar bounds may be derived on the various errors that may occur in the identification problem.

As the expression for the false alarm in the identification problem is identical to that in the detection

problem, the bound on the false alarm remains the same, i.e.Pf ≤ 2−L(1−h(λ)−R). Now consider the

probability of correct classificationPc. Given thatHi is true for a giveni 6= 0 is true, the detector does

not make a correct identification ifd(Y,X(i)) > Lλ or if d(Y,X(j)) < d(Y,X(i)), for somej 6= i.

Thus the probability of not making a correct decision can be bounded as

1− Pc ≤ Pr(d(Y,X(i)) > Lλ|Hi) + Pr(d(X(i),X(j)) < 2Lλ, i 6= j)

≤ Pr(d(Y,X(i)) > Lλ|Hi) +∑

j 6=i

Pr(d(X(i),X(j)) < 2Lλ)

≤ 2−LD(λ||p) + 2−L(1−h(2λ)−R)

where the second term corresponds to the bound on the probability that two uniformly chosen binary

strings of lengthL have a distance less than2Lλ and is obtained using the Chernoff bound, or by

DRAFT

16

considering the volume of a binary sphere of radius2Lλ. We also note thatP (δI = 0|Hi, i 6= 0)+Pm+

Pc = 1, as these are the only possible decisions that a detector canmake. Therefore,Pm < 1 − Pc ≤

2−LD(λ||p) + 2−L(1−h(2λ)−R).

C. A Joint Source Channel-Coding Perspective

In the previous subsections, we have examined the relation between the rateR, the false negative error

exponentEfn, and the false alarm error exponent∆. We now provide an intuitive explanation of the

theoretical results obtained.

Consider the space of all binary strings of lengthL, represented by the dashed circle in Fig. 5. Let the

N binary fingerprintsX(i), i = 1, 2, . . . , N present in the database be represented by the solid dots in

the figure and the circles around the dots represent the detection regions for the respective fingerprints.

Any query fingerprint that falls within such a sphere is identified as the fingerprint represented by the

center of the sphere. The number of such spheres controls therate R, and the volume of the spheres

determines the probability of false alarm and missed detections.

To ensure a low probability of false negatives when the probability of a bit flipping is p, the detection

region around each fingerprint should include all binary strings that are within a Hamming distance

Lp from the fingerprint. The volume of such a sphere of radiusLp is SL,Lp, which for largeL is

approximatelySL,Lp ≈ 2Lh(p). As we have assumed that in the null hypothesis, the fingerprints of the

videos absent from the database are uniformly distributed over the entire space of binary strings of length

L, the probability of false alarm is approximately

Pf =N × SL,Lp

2L⇒ ǫ ≈

N2Lh(p)

2L(16)

which upon rearrangement gives Eqn. (15). To achieve a higher rate, we would like to pack more such

spheres into the binary space, but this would increase the probability of false alarms. Similarly, to reduce

the probability of false negatives, the volume of the decoding region around each fingerprint has to be

increased, which would also increasePf and reduce the number of spheres that can be packed into the

binary space.

We see that the fingerprinting problem shares some analogieswith source and channel coding. In

channel coding, to achieve capacity, we are interested in packing as many spheres as possible into the

binary space such that their overlap is minimum. In source coding with a fidelity criterion (rate-distortion

theory), we are interested in covering the entire space withas few spheres of fixed size as possible. Here,

to minimize the probability of false alarms, we would like tocover the space as sparsely as possible,

but the conflicting objective of increasing the rate requires packing as many spheres as possible. Thus,

fingerprinting can be thought of as being similar to joint source-channel coding.

DRAFT

17

Space of allbinary stringsof length L Ball of

radius Lp

Fig. 5. Error exponent for false negatives as a function of the rate for different values of the false positive error exponent.

V. B INARY FINGERPRINTS WITHCORRELATED BITS

Our analysis so far has been focused on content identification using binary fingerprints with i.i.d. equally

likely bits. This analysis provides useful guidelines for designing and characterizing the performance

bounds of fingerprinting schemes. Many practical fingerprinting schemes, however, generate fingerprints

with correlation among components. While it is possible to include an explicit decorrelation stage to

remove such dependencies and obtain a shorter fingerprint with independent bits, to meet stringent

computational requirements in a practical scheme, it may become preferable in practice to use the

correlated fingerprint bits directly and not to incur additional computation for decorrelation in order to

meet stringent computational requirement in large-scale practical deployment. To capture the correlations

in the fingerprints and the noise introduced through distortion of the content, we propose a Markov

Random Field based model and study the impact of the correlation on the identification performance.

We then describe an approach inspired by statistical physics to compute the probability of errors.

A. Markov Random Fields

Markov Random Fields (MRFs) are a generalization of Markov chains in which time indices are

replaced by space indices [24]. MRFs are undirected graphical models and represent conditional inde-

pendence relations among random variables. In this section, we briefly review key concepts of MRFs

that are relevant to modeling content fingerprints.

An MRF consists of an undirected graphG = (V, E) with a set of nodesV and a set of edgesE between

nodes. Each nodeX ∈ V represents a random variable, and we will useX to denote the node and the

random variable interchangeably. The vectorX denotes all random variables represented by the MRF. Two

nodesXi andXj are said to be neighbors if there is an edge between them, i.e.(i, j) ∈ E . A set of nodes

C is called a maximal clique if every pair of nodes inC are neighbors and no node inV\C is a neighbor

of everynode inC. An energy functionEC({xC}) is associated with every maximal cliqueC that maps

DRAFT

18

(a)

OriginalFingerprint

Noise inFingerprint

(b)

Fig. 6. Markov Random Field model for (a) fingerprint components and (b) fingerprint and noise.

the values{xC} of the nodes inC to a real number. The joint probability distribution of all the random

variables represented by the MRF is then given asp(X = x) = 1Zexp (−

∑

C EC({xC})), whereZ is a

normalization constant called the partition function. Theterm in the exponent,E(x) =∑

C EC({xC}),

is sometimes referred to as the energy of the configurationx.

MRFs have been used in image processing [25] and computer vision [26] as they can represent local

correlations among random variables. In the next section, we develop a model for content fingerprints

using MRFs to capture local dependencies.

B. Model for a Block-based Fingerprinting Scheme

We model content fingerprints as a Markov Random Field where each fingerprint value is represented

as a node in the MRF, and pairs of nodes that have dependenciesare joined by edges. We illustrate

our model using a representative fingerprinting scheme thatpartitions each video frame into blocks

and extracts one bit from each block [15]. While we use a simple two-dimensional model for ease of

illustration, the analysis can be extended to three-dimensional and more complex models.

Suppose that each video frame of sizePH ×QW is partitioned intoPQ blocks of sizeH ×W each

and one bit of the fingerprint is extracted from each block. For example, the fingerprint bit could be

obtained by thresholding the average luminance of a block. Due to underlying correlations among the

blocks of the frame, these bits are likely to be correlated. We represent the bit extracted from each block

as a node in a graphG0 = (V0, E0), with the nodeXi,j representing the bit from the(i, j)th block. Each

node may take one of two values±1, with bit ‘b’ represented as(−1)b, and is connected to the four

nearest neighbors, so that the overall graph satisfies 4-connectivity as shown in Fig. 6(a). For convenience,

we use a vectorX to represent the bitsXi,j, 1 ≤ i ≤ P, 1 ≤ j ≤ Q, which could be obtained by any

form of reordering, such as raster scanning.

As described in Section V-A, the joint probability distribution of the fingerprint can be specified by

defining an energy function for the model. We use the energy function that has been commonly used for

DRAFT

19

modeling binary images [26]:

E0(x) = −h∑

i

xi − η∑

(j,k)∈E0

xjxk. (17)

This corresponds to the 2-D Ising model that has been widely used in statistical physics to model

ferromagnetism arising out of interactions between individual spins. Hereη controls the correlation

between nodes that are connected andh determines the marginal distribution of the individual bits. A

higher value forη would increase the correlation among neighboring bits, andlarge h would bias the

bits to be+1. The joint distribution can then be written asp0(x) = 1Z0

exp(−E0(x)), with Z0 being the

normalization constant to ensure that the distribution sums to 1.

The above model describes the fingerprint bits obtained fromthe original video frame. In many practical

applications, fingerprints are extracted from possibly modified versions of the video and may be noisy.

The noise components may be correlated and also dependent onthe fingerprint bits. To accommodate

such modifications, we propose a joint model for the noise bits and the fingerprint bits of the original

unmodified video, which is shown in Fig. 6(b). The filled circles represent the noise bits and the open

circles represent the fingerprint bits. The solid edges capture the dependencies among the fingerprint

components, while the dashed and dotted edges represent thelocal correlations among the noise bits.

The dashed edges capture the dependence between the noise bits and the fingerprint bits. The noise may

be causally dependent on the fingerprint of the original video, but the fingerprint bits of the original

video should not be influenced by the noise. However, the addition of these undirected edges makes the

graph symmetric with respect to the fingerprint and noise bits and does not accurately reflect the causal

dependence. Factor graphs can be used to represent this dependence, and will be addressed in our future

work.

In this paper, we consider the case where the noise bits may bemutually dependent, but are independent

of the fingerprint bits, implying that the dashed edges between the noise bits and the fingerprint bits are

absent. In this case, the model for the noise bits{Ni,j} reduces to a 2-D Ising modelG1 = (V1, E1)

similar to that for the fingerprints. The energy function fora configurationn can be defined as:

E1(n) = −α∑

i

ni − γ∑

(j,k)∈E1

njnk, (18)

and the distribution is specified asp1(n) = 1Z1

exp(−E1(n)). The parametersα andγ control the marginal

distribution and the pairwise correlation among the noise bits, respectively.

The above MRF can be used to model block based binary video fingerprints computed on a frame

by frame basis. For other fingerprinting schemes, differentgraphs can be used to capture the local

dependencies among the fingerprint components.

DRAFT

20

C. Hypothesis Testing

As discussed in Section II, content identification using fingerprints can be modeled as a multiple

hypothesis test. Under the MRF model, since it is difficult tocompute the error probabilities for the

(N+1)-ary hypothesis test, we approximate the multiple hypothesis test as a series of binary hypothesis

tests, where the detector compares the query fingerprint sequentially with each fingerprint in the database.

The first fingerprint that matches with the query is output as the response to the query. While this decision

rule is suboptimal compared to the Maximum Likelihood decision for the multiple hypothesis test, it is

often a good approximation in practice, as the fingerprints in the database are expected to be well

separated and the probability that a query fingerprint is close to two distinct fingerprints in the database

is small. With this approximation, we first examine a simplerbinary hypothesis test of comparing the

query fingerprint to agivenfingerprint in the database and then use the results for the binary hypothesis

test to compute the error probabilities for the overall identification process.

Given a query videoZ and a reference videoV in its database, consider the problem where the

detector has to decide whetherZ is derived fromV or whether the two videos are unrelated. To do so,

the detector computes the fingerprintsy andx from the videosZ andV , respectively. The detector then

performs a binary hypothesis test with the null hypothesisH0 that the two fingerprints are independent

and the alternate hypothesisH1 that the fingerprinty is a noisy version ofx:

H0 : (x,y) ∼ p0(x)p0(y),

H1 : (x,y) ∼ p0(x)p1(n), (19)

wherep0(·) is the distribution of the fingerprints,p1(·) is the distribution of the noise and the noise is

the element-wise product of the two fingerprintsn = x⊗ y, with the fingerprint bits being represented

using±1.

We consider a Neyman-Pearson setting, where the detector seeks to maximize the probability of

detectionP (b)d under the constraint that the probability of false alarmP (b)

f ≤ ǫ. The optimal decision rule

is obtained by comparing the log likelihood ratio (LLR) to a threshold:

LLR(x,y) = E0(y)− E1(n)H1

≷H0

τ, (20)

where the constants have been absorbed into the thresholdτ , which is chosen such thatP (b)f = ǫ. In

cases where the LLR is discrete, it may be necessary to incorporate randomization when the LLR equals

the threshold.

For example, for the block-based binary fingerprinting scheme model described in Section V-B, the

DRAFT

21

LLR is given by:

LLR(x,y) = −h∑

i

yi − η∑

E0

yjyk + α∑

i

ni + γ∑

E1

njnk.

If the fingerprint bits are i.i.d. and equally likely to be±1, corresponding toη = h = 0, and the noise bits

are independent (γ = 0), the optimum decision rule reduces to a comparison of the Hamming distance

betweenx andy to a threshold, as derived in [27]. However, when the bits arenot independent, a decision

rule that compares the Hamming distance to a threshold issuboptimal. We would like to quantify the

accuracy using the optimal decision rule and the performance loss when using the Hamming distance as

opposed to using the optimal decision rule.

Define the probability of detection under this binary hypothesis test asP (b)d = Pr(LLR(x,y) > τ |H1)

and probability of false alarmP (b)f = Pr(LLR(x,y) > τ |H0). A main challenge in accurately estimating

these tail probabilities is that these events have small probability of occurrence and are rarely observed

in a typical Markov Chain Monte Carlo simulation. We take a different approach inspired by statistical

physics to first estimate the so-called density of states andthen utilize this information to estimate these

probabilities.

D. ComputingP (b)d andP

(b)f

For ease of illustration, we again use the example of the binary fingerprint model described in

Section V-B. Suppose we defineM(x) =∑

i xi andEcorr(x) = −∑

(j,k)∈E0xjxk, the LLR in Eqn.( 20)

can be written asLLR(x,y) = −hM(y)+ηEcorr(y)+αM(n)−γEcorr(n), sinceE0 = E1 in this model.

Similarly, the energy for the fingerprint bits and the noise,E0(x) andE1(n), described in Eqns. (17)

and (18) can be rewritten in terms of these functions. Thus, the tuple

S(x,y) = (M(x), Ecorr(x),M(y), Ecorr(y),M(n), Ecorr(n)),

captures all necessary information regarding the configuration (x,y). Defineg(s) = g(mx, ex,my, ey,mn, en)

as the number of configurations(x,y) that haveM(x) = mx, Ecorr(x) = ex,M(y) = my, Ecorr(y) =

ey,M(n) = mn, andEcorr(n) = en. The functiong is referred to as the “density of states” in the physics

literature and it depends only on the underlying graphical model and is independent of the parameters

(h, η, α, γ) of the distributions.

The probability of detectionP (b)d can then be rewritten as:

P(b)d (τ) =

∑

(x,y)

1{LLR(x,y) > τ} p0(x) p1(n) (21)

=∑

s

g(s)1{LLR(s) > τ} p0(s) p1(s), (22)

DRAFT

22

where the summation in Eqn. (22) is over all possible values of s = (mx, ex,my, ey,mn, en) and

p0(s)p1(s) is the probability underH1 of any configuration(x,y) with S(x,y) = s. Similarly,

P(b)f (τ) =

∑

s

g(s)1{LLR(s) > τ}p0(s)p0(s). (23)

As theLLR and the probabilitiesp1(n) and p0(x) depend only ons, knowledge ofg(s) allows us to

computeP (b)d andP (b)

f . Moreover, the number of states is a polynomial function of the number of bits

and thus the summations in Eqns. (22) and (23) have manageable computational complexity. The problem

of computingP (b)d and P

(b)f has been converted into one of estimating the density of states g(s). An

algorithm to estimate the density of states was proposed by Wang and Landau in [28] and is summarized

in the Appendix. The main idea is to construct a Markov chain that has 1g(s) as its stationary distribution

and ensuring that all states are visited approximately equally often. An advantage of this “Wang-Landau”

algorithm is that states with low probability of occurrenceare also visited as often as high probability

states, enabling us to estimate their probabilities accurately. We first use this algorithm [28] to estimate

the density of statesg(s) and then computeP (b)d andP (b)

f using Eqns. (22) and (23).

E. Error Probabilities for Overall Matching

Given the values ofP (b)d andP (b)

f obtained using the above technique, we now compute the probability

of correct identification for the overall matching process.Consider the probability of false alarmPf . When

the multiple hypothesis test is approximated by a series of binary hypothesis tests, there is a false alarm

in the overall matching if there is a false alarm inanyof the individual binary hypothesis tests. The false

alarm probability is thus given by

Pf = 1− (1− P(b)f )N ≈ NP

(b)f .

Now suppose that the query video is actually a modified version of Vi. A misclassification can occur if a

false alarm occurs in the binary hypothesis test comparing the fingerprint of the query videoY to X(j)

for any j < i. Thus, the probability of misclassification can be bounded as:

Pm ≤ 1− (1− P(b)f )N−1 ≈ (N − 1)P

(b)f .

An incorrect decision happens when either a misclassification occurs, or a missed detection occurs in the

binary hypothesis test involvingX(i), implying that

1− Pc ≤ 1− P(b)d + Pm

⇒ Pc ≥ P(b)d −NP

(b)f .

DRAFT

23

−40 −20 0 20 400

0.5

1

1.5

2

2.5

3

3.5

4x 10

−3

E

ε(g I(E

))

Fig. 7. Relative error in the estimation of density of statesfor a 4x4 Ising model with periodic boundary conditions.

*

(a)

*

(b)

0

0.2

0.4

0.6

0.8

1

*

(c)

Fig. 8. Typical correlation structure among the various fingerprinting bits. Correlation coefficients for the (a)(1, 1)th bit, (b)(2, 1)th bit, (c) (2, 2)th bit and the remaining bits. The ‘*’ denotes the bit under consideration.

Thus, given a desired overall probability of correct identification and false alarm, suitable values ofP(b)f

andP (b)d can be computed and used to choose the appropriate thresholdin the binary hypothesis test.

VI. N UMERICAL EVALUATION

We use the MRF model coupled with the technique for computingP(b)d and P

(b)f described in the

previous section to study the influence of correlation amongthe fingerprint components on the overall

detection performance. We focus on binary fingerprinting schemes and provide numerical results for the

model described in Section V-B. As most binary fingerprint schemes generate equally likely (but not

independent) bits, we set the parameterh = 0 in our simulations. This also helps reduce the parameter

space from a 6-D space(mx, ex,my, ey,mn, en) to a 4-D space(ex, ey,mn, en), as the expressions for

theLLR and probability distributions will not involvemx andmy.

A. Density of States Estimation

We evaluate the accuracy of the estimation algorithm using known exact results for the density of

energy statesgI(E) for the 2-D Ising model [29]. To enable comparison, periodicboundary conditions

are imposed on the graphG0 - the nodesX1,j in the top row are connected to the corresponding nodes

XM,j in the bottom row, and the nodes in the first column are similarly connected to the nodes in the

last column, so that every node is 4-connected. 4-connectivity is similarly achieved for the noise nodes

DRAFT

24

{Ni,j}. We then use the Wang-Landau algorithm to estimate the density of statesg(s) = g(ex, ey,mn, en)

by performing a random walk in the 4-D parameter space [28] and use the obtainedg(s) to estimate the

density of energy statesgI(E) by summing over all other variables and normalization,

gI(E) =1

2PQ

∑

(ey,mn,en)

gI(E, ey ,mn, en).

In our simulations, we use the parameters suggested in [28] and the maximum number of iterations is

capped at1010.

We measure the accuracy of estimation by computing the relative error ε(gI(E)) in the estimate of

the density of states, defined asε(x) = |x−xest|x

. Fig. 7 shows the relative error in the estimation of the

density of states for a 2-D Ising model of size4 × 4 with periodic boundary conditions. We observe

from the figure that the maximum relative error is approximately 0.37%, and the mean relative error is

0.1%. These results demonstrate that accurate estimates of the density of states can be obtained using

the Wang-Landau algorithm. The estimation accuracy can be improved by suitably altering parameters

in the algorithm as necessary.

B. Performance of Correlated Fingerprints

To examine the performance of correlated fingerprints, we use the model without periodic boundary

conditions, as practical fingerprints are not expected to have such periodic relationships. The nodes at

the corners are only connected to their two closest neighbors, the remaining nodes at the borders are

connected to their three closest neighbors, and all the other nodes are 4-connected.

The correlation among the fingerprint components is estimated from108 MCMC iterations by retaining

only 1 out of 100 iterations to reduce the effect of correlations between successive iterates in the MCMC

simulations. Fig. 8 shows the correlation among the fingerprint bits for a 4 × 4 model, obtained by

settingη = 0.3, α = 0.3, andγ = 0.1. Fig. 8(a) shows the correlation between the(1, 1)th bit (top left

corner) and every other bit while Fig. 8(b) and (c) show the same for the(2, 1)th bit and the(2, 2)th bit,

respectively. Due to symmetry, other bits in correspondingpositions have similar correlations. We observe

that the correlation coefficient between each bit and its nearest neighborρx ≈ 0.3 and the correlation

decays with distance. This is the typical correlation behavior observed in our model and reflects the

correlation expected in practice - bits extracted from adjacent blocks are expected to be more correlated

than bits extracted from blocks farther apart. We observe a similar correlation among the noise bitsNi,j

as well, as the models for these bits are similar.

Using the estimated density of states, we compute the probabilities P(b)d and P

(b)f as described in

Section V-D and study the effect of different parameters on the detection performance. Although errors

in the estimation of the density of states will also affect the accuracy of the estimates ofP (b)d andP (b)

f ,

DRAFT

25

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf(b)

Pd(b

)

LLR pn = 0.12

Hamming pn = 0.12

LLR pn = 0.2

Hamming pn = 0.2

Fig. 9. Influence ofpn on the detection performance for4× 4 bits per frame,ρx = 0.2 andρn = 0.2.

we have shown in Section VI-A, these errors are small, and theaccuracy can be improved by obtaining

a better estimate of the density of states.

First, we examine the effect of the noise on the detection accuracy. We characterize the noise by the

probability pn of a noise bit being ‘−1’ which is the equivalent of a binary ‘1’ bit, and the correlation

among the noise bitsρn, which are estimated from the MCMC trials. Fig. 9 shows the ROC curves for

a fingerprint of size4× 4 bits with correlationρx = 0.2 under two differentpn and fixedρn = 0.2, for

detection using the Log Likelihood Ratio (LLR) statistic and the Hamming distance statistic. We observe

that for a given noise level, the LLR statistic gives5− 10% higherP (b)d at a givenP (b)

f compared to the

Hamming distance detector. As expected, the performance for any given detector is worse when there is

a higher probability of the noise changing the fingerprint bits.

Fig 10 shows the influence of the noise correlation on the detection performance. The figure indicates

that for a fixed correlation among the fingerprint bitsρx = 0.2 and a fixed marginal probability of the

noise bitspn = 0.3, detection using the LLR statistic is not significantly affected by the noise correlation.

This is due to the fact that the LLR takes into account the correlation among the noise bits. On the other

hand, using the Hamming distance leads to some degradation in the performance as the correlation

increases. This can be explained by the fact that as the noisecorrelation increases, noise vectors with

large Hamming weights become more probable, leading to higher missed detections.

Next, we examine the influence of the correlation among the fingerprint bits on the detection accuracy.

Fig. 11 shows the ROC curves for content identification usingfingerprints of size4 × 4 for different

correlations, where the noise parameterspn = ρn = 0.2. We again observe that detection using the LLR

statistic, which compensates for the correlation among thefingerprint bits, is not significantly affected by

the correlation. For the Hamming distance statistic, thereis an increase in false alarms at a givenP(b)d as

DRAFT

26

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf(b)

Pd(b

)

LLR ρn = 0.3

LLR ρn = 0.2

LLR ρn = 0.1

Hamming ρn = 0.1

Hamming ρn = 0.2

Hamming ρn = 0.3

Fig. 10. ROC for different noise correlationρn at fixedpe = 0.3 andρx = 0.2.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf(b)

Pd(b

)

LLR ρx=0.4

LLR ρx=0.3

LLR ρx = 0.2

Hamming ρx = 0.2

Hamming ρx=0.3

Hamming ρx=0.4

Fig. 11. Influence of correlation of the fingerprint bits on the detection performance (pn = ρn = 0.2).

the correlation among the fingerprints increases, as similar configurations with smaller distances become

more probable.

C. Simulation Results using Image Database

We compare the performance predicted by the theoretical analysis with simulation results obtained

using an image database. For our experiments, we use a database of1000 images downloaded from the

Flickr photo hosting service by searching for the tag “panda”. For extracting the fingerprints, each image

is divided into 16 blocks in a4 × 4 grid and the average luminance within each block is computed.

The average luminance of each block is then quantized to one bit accuracy according to whether it is

larger or lesser than the grayscale value of128, giving a 16-bit fingerprint for each image. Histogram

equalization is then performed on the luminance portion of the image, and the fingerprint is computed to

obtain the noisy version of the fingerprint. The hypothesis test described in Sec. V-C is then performed

DRAFT

27

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf(b)

Pd(b

)

LLR (theory)LLRHammingHamming (theory)

Fig. 12. Comparison of theoretical and simulation results for a database consisting of 1000 images.

using the noisy fingerprints. Additionally,1000 pairs of original fingerprints are randomly chosen and

compared to each other to obtain an estimate of the false alarm probability. We also estimate the Ising

model parameters for the fingerprints and the noise using theleast squares method proposed in [30] and

obtain the theoretical predictions for the ROC curves as described in Section V-D.

Fig. 12 compares the ROC curves obtained from theory and simulation for the LLR detector and the

Hamming distance based detector. The figure shows that the simulation results agree very well with

the theoretical predictions. In simulations with some other databases, we observed that when the MRF

model does not accurately capture the distribution of the fingerprints, there is a discrepancy between the

theoretical predictions and the simulation results. In ourfuture work, we will develop better techniques

to predict the performance of practical fingerprinting schemes.

VII. C ONCLUSIONS

In this paper, we have analyzed content identification usingfingerprints under a hypothesis testing

framework. We first considered the case of fingerprinting schemes that generate i.i.d. equally likely

bits and modeled distortions on the host content as alteringeach fingerprint bit independently with

probability p. We derived expressions for the probability of correct identification under this model and

studied the tradeoff between the number of fingerprints, therobustness, the identification performance

and the length of the fingerprints. To understand the fundamental limits on the identification capability,

we next derived bounds on the achievable error probabilities and characterized the tradeoff between the

detection probability and the number of fingerprints in terms of the error exponents. We then derived

guidelines for choosing the fingerprint length to attain a desired performance objective and provided an

interpretation of our results from a joint source-channel coding perspective.

To better predict the performance of practical fingerprinting schemes that have correlated components,

we proposed a Markov Random Field model that captures local correlations among individual fingerprint

DRAFT

28

bits. Under this model, we examined fingerprint matching as ahypothesis testing problem and proposed

a statistical physics inspired approach to compute the probability of detection and the probability of false

alarm. Our analysis showed that Hamming distance based detection, which is commonly employed in

many applications, is suboptimal in this setting and is susceptible to correlations among the fingerprint

bits or the noise. The optimal log-likelihood ratio detector provides5− 20% higher detection probability

and the detection accuracy is relatively stable for different correlations among the fingerprint and noise

components. Simulation results using an image database corroborate our theoretical results.

Establishing models to facilitate performance studies on practical fingerprints and developing an

understanding of the limits of the identification accuracy achievable using fingerprints are the main

contributions of this paper. Results from our modeling and analysis not only provide qualitative guidelines

for content fingerprinting algorithm and system design, butto the best of our knowledge, these studies

for the first time allowquantitative understandings of the impact of major design and operational

parameters on the identification performance. The results have been mainly presented in the context

of content fingerprinting, but they are also applicable in many other applications, such as biometrics

based identification. For example, Vetro et al. recently showed that by suitably transforming features

extracted for human fingerprint matching using minutiae, the biometric data can be transformed into i.i.d.

Bernoulli(0.5) bits and that the distortions caused while recapturing thefingerprint can be modeled by a

binary symmetric channel [21]. The results derived in Section III would then be directly applicable for

analyzing this biometric identification problem.

ACKNOWLEDGEMENT

The authors thank Mr. Suriyanarayanan Vaikuntanathan at the University of Maryland for suggesting

the reference on the Wang Landau algorithm for estimating the density of states.

REFERENCES

[1] Wall Street Journal, “YouTube removes 30,000 files amid Japanese copyright concerns.” [Online]. Available:

http://online.wsj.com/article/SB116133637777798831.html

[2] C.-W. Chen, R. Cook, M. Cremer, and P. DiMaria, “Content identification in consumer applications,” inProc. of IEEE Int.

Conf. on Multimedia & Expo, Jul. 2009, pp. 1536–1539.

[3] M. Barni and F. Bartolini, “Data hiding for fighting piracy,” IEEE Signal Process. Mag., vol. 21, no. 2, pp. 28–39, Mar.

2004.

[4] J. Fridrich and M. Goljan, “Robust hash functions for digital watermarking,” inInt. Conf. on Information Technology:

Coding and Computing, 2000, pp. 178–183.

[5] A. Swaminathan, Y. Mao, and M. Wu, “Robust and secure image hashing,”IEEE Trans. on Information Forensics and

Security, vol. 1, no. 2, pp. 215–230, Jun. 2006.

[6] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,”ACM Computing

Surveys, no. 2, pp. 1–60, 2008.

DRAFT

29

[7] S.-F. Chang, Q. Huang, T. Huang, A. Puri, and B. Shahraray, “Multimedia search and retrieval,” inMultimedia Systems,

Standards, and Networks, A. Puri and T. Chen, Eds. New York: Marcel Dekker, 2000.

[8] T. Sikora, “The MPEG-7 visual standard for content description - an overview,”IEEE Transactions on Circuits and Systems

for Video Technology, vol. 11, no. 6, pp. 696–702, Jun 2001.

[9] J. Haitsma, T. Kalker, and J. Oostveen, “Robust audio hashing for content identification,” inInt. Workshop on Content-Based

Multimedia Indexing, Brescia, Italy, Sept. 2001.

[10] J. Oostveen, T. Kalker, and J. Haitsma, “Feature extraction and a database strategy for video fingerprinting,” inProc. of

the Int. Conf. on Recent Advances in Visual Information Systems, Lecture Notes in Computer Science, vol. 2314, 2002,

pp. 117–128.

[11] R. Mohan, “Video sequence matching,” inIEEE Conf. on Acoustic, Speech and Signal Processing, vol. 6, May 1998, pp.

3697–3700.

[12] S. Baluja and M. Covell, “Content fingerprinting using wavelets,” inProc. IET Conf. on Multimedia, London, England,

Nov. 2006.

[13] R. Radhakrishnan and C. Bauer, “Video fingerprinting based on moment invariants capturing appearance and motion,”in

Proc. of IEEE Int. Conf. on Multimedia & Expo, 2009, pp. 1532–1535.

[14] B. Coskun, B. Sankur, and N. Memon, “Spatio-temporal transform based video hashing,”IEEE Trans. on Multimedia,

vol. 8, no. 6, pp. 1190–1208, Dec. 2006.

[15] J. Lu, “Video fingerprinting for copy identification: From research to industry applications,” inProc. SPIE/IS&T Media

Forensics and Security, San Jose, CA, Jan. 2009.

[16] E. McCarthy, F. Balado, G. Silvestre, and N. Hurley, “A framework for soft hashing and its application to robust image

hashing,” inIEEE Int. Conf. on Image Processing, vol. 1, Oct. 2004, pp. 397–400.

[17] S. Voloshynovskiy, O. Koval, F. Beekhof, and T. Pun, “Robust perceptual hashing as classification problem: Decision-

theoretic and practical considerations,” inIEEE Workshop on Multimedia Signal Processing, Oct. 2007, pp. 345–348.

[18] F. Willems, T. Kalker, J. Goseling, and J.-P. Linnartz,“On the capacity of a biometrical identification system,” inIEEE

Int. Symp. on Information Theory, June 2003, p. 82.

[19] H. V. Poor,An Introduction to Signal Detection and Estimation, 2nd ed. Springer, 1994.

[20] A. L. Varna and M. Wu, “Theoretical modeling and analysis of content identification,”Proc. of IEEE Int. Conf. on

Multimedia & Expo, Jul. 2009.

[21] A. Vetro, S. C. Draper, S. Rane, and J. S. Yedidia, “Securing biometric data,” inDistributed Source Coding, P. L. Dragotti

and M. Gastpar, Eds. Academic Press, Jan. 2009, ch. 11, pp. 293–323.

[22] R. M. Roth, Introduction to Coding Theory. Cambridge University Press, 2006.

[23] A. Shwartz and A. Weiss,Large Deviations for Performance Analysis. Chapman and Hall, 1995.

[24] R. Kinderman and J. L. Snell,Markov Random Fields and their Applications. American Mathematical Society, 1980.

[25] A. K. Jain, Fundamentals of Digital Image Processing. Prentice-Hall, 1989.

[26] C. M. Bishop,Pattern Recognition and Machine Learning. Springer, 2006.

[27] A. L. Varna, A. Swaminathan, and M. Wu, “A decision-theoretic framework for analyzing binary hash-based content

identification systems,” inProc. ACM Workshop on Digital Rights Management, Oct. 2008, pp. 67–76.

[28] F. Wang and D. P. Landau, “Efficient, multiple-range random walk algorithm to calculate the density of states,”Physical

Review Letters, vol. 86, no. 10, pp. 2050–2053, Mar. 2001.

[29] P. D. Beale, “Exact distribution of energies in the two-dimensional Ising model,”Physical Review Letters, vol. 76, pp.

78–81, 1996.

[30] H. Derin and H. Elliott, “Modeling and segmentation of noisy and textured images using Gibbs random fields,”IEEE

Trans. on Pattern Analysis and Machine Intelligence, vol. 9, no. 1, pp. 39–55, Jan. 1987.

DRAFT

30

APPENDIX

WANG-LANDAU SAMPLING

In statistical physics, the density of statesg(E) is an important quantity that enables the computation

and characterization of various thermodynamic propertiesof a physical system. Given a physical system

which may exist in different configurations or states, the density of statesg(E) is defined as the number

of states that have a given energyE. An advantage of using the density of states for determining

properties of the physical systems is thatg(E) is independent of temperature and can be used to compute

thermodynamic properties of the system at any temperature.Traditional MCMC methods do not allow for

accurate estimation of the density of statesg(E) directly. Wang and Landau [28] proposed a technique

for estimating the density of states by performing a random walk in the energy space that results in a

flat histogram of energies visited. We illustrate the algorithm using the example of a physical system that

has spins±1.

Initialize the system randomly and start with an initial value for the density of states, e.g.g(E) = 1.

At each iteration, a random spin is flipped with probabilityp(E → E′) = min(

g(E)g(E′) , 1

)

, whereE is

the energy of the current state andE′ would be the energy of the resultant state if the spin were to

be flipped. After this trial, ifE∗ is the energy of the resultant state, the density of states isupdated as

g(E∗)← g(E∗)× fi, wherefi is an update factor. Initially,f0 is chosen to be large enough so that all

the energy levels are visited quickly, e.g.f0 = e. A histogram of the number of times each energy level

has been visited is also stored. Once this histogram is “flat enough”,fi is reduced and the histogram is

reset. This process is continued untilfi becomes small enough, e.g.fi < exp(10−8). Theg(E) obtained

after convergence is relative and is normalized to obtain anestimate of the density of states.

The above algorithm can also be used to estimate the density of states in multiple parameters, e.g.

g(M,E) as a function of the magnetizationM =∑

i xi and energyE, by performing the random

walk in the appropriate parameter space. For large systems,the parameter space can be divided into

several regions and independent random walks can be performed over each of these regions for faster

convergence. The overall density of states can then be reconstructed from these individual estimates by

ensuring continuity at the boundaries [28].

DRAFT

1 theoretical modeling and analysis of content fingerprinting

Documents