genomic privacy and limits of individual detection...

Genomic Privacy and Limits of IndividualDetection in a Pool: Supplementary Material

Sriram Sankararaman1,∗, Guillaume Obozinski2,∗, Michael I. Jordan1,2 andEran Halperin3

Affiliation:

1. Computer Science Division, University of California Berkeley, Berkeley, CA94720, USA

2. Department of Statistics, University of California Berkeley, Berkeley, CA94720, USA

3. International Computer Science Institute, 1947 Center St., Berkeley, CA94704, USA

∗ These authors contributed equally to this work.

1

Nature Genetics: doi:10.1038/ng.436

Contents

Supplementary Figures 4

Supplementary Methods 12Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . 12

Supplementary Note 19Empirical behavior of the LR test . . . . . . . . . . . . . . . . . . . 19E1 Experiments on simulated and WTCCC data. . . . . . . . . . . . 19

E1.1 Simulated data. . . . . . . . . . . . . . . . . . . . . . . 20E1.2 WTCCC . . . . . . . . . . . . . . . . . . . . . . . . . . 20

E2 Discrimination between pools. . . . . . . . . . . . . . . . . . . 21E3 Genotyping errors decrease the power of the approximate LRtest. 21E4 Transferability of the LR-test properties across populations . . . . 22E5 Detecting relatives . . . . . . . . . . . . . . . . . . . . . . . . . 22E6 Validity of Equation 1 . . . . . . . . . . . . . . . . . . . . . . . 23Theoretical Analysis of the LR test . . . . . . . . . . . . . . . . . . . 24T1 Presentation of the log-likelihood ratio statistics. . . . . . . . . . 24

T1.1 A model for independent SNPs in the haplotype case. . . 24T1.2 The likelihood-ratio statistic. . . . . . . . . . . . . . . . 26

T2 Overall summary of the analysis. . . . . . . . . . . . . . . . . . 26T2.1 Main results. . . . . . . . . . . . . . . . . . . . . . . . . 26T2.2 Behavior of the LR test under variations on the setting. . 27

T3 Guarantees provided by the Neyman-Pearson lemma. . . . . . . 28T4 Analysis of the exact LR statistic for largen . . . . . . . . . . . . 29

T4.1 Approximation to the LR statistic. . . . . . . . . . . . . 29T4.2 The LR test when allele frequencies have to be estimated. 32T4.3 Discrimination between two pools. . . . . . . . . . . . . 35

T5 The model for genotypes. . . . . . . . . . . . . . . . . . . . . . 36T5.1 Main analysis. . . . . . . . . . . . . . . . . . . . . . . . 36

2


T6 Genotyping errors. . . . . . . . . . . . . . . . . . . . . . . . . . 38T7 Detecting relatives . . . . . . . . . . . . . . . . . . . . . . . . . 40T8 Technical Results. . . . . . . . . . . . . . . . . . . . . . . . . . 41

T8.1 Simultaneous central limit theorem inm andn . . . . . . 41T8.2 Lindeberg-Feller CLT . . . . . . . . . . . . . . . . . . . 44

A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47A1 Proof for the approximation to thelog . . . . . . . . . . . 47A2 Concentration of binomial random variables. . . . . . . . 48A3 Bounds on moments of1p . . . . . . . . . . . . . . . . . . 49A4 LR moment bounds for large deviations of the binomial. 49

3


Supplementary Figures

−3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Pow

er

Homer et alLRLR theory

−3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Pow

er


Figure S 1: ROC curves comparing the power attained by the approximateLR-test and by the statistic used in Homer et al. [1] when applied to all 33138independent, common SNPs in the WTCCC data, under two different settings.(Left) In the setting studied in this paper, individuals in the pool and the finite ref-erence dataset, and the individual of interest are all sampled independently fromthe same distribution under the null hypothesis. Under the alternative hypothe-sis, the tested individual is randomly sampled from the pool. (Right) The settingconsidered in [1] has the identical alternative hypothesis whereas, under the nullhypothesis, the individual is randomly sampled from the finite reference dataset.Both the LR statistic and the statistic proposed in [1] are markedly more power-ful in the second setting. When extrapolated to a false positive level of10−6, thepower (both theoretical and empirical) is less than0.5 (0.30 for the approximateLR-test and0.47 for the LR theory) in the first setting. In contrast, at the samefalse positive level, the approximate LR test and LR theory attain a power of0.95and0.99 respectively, in the second setting.

4


−3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Pow

er


−3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rateP

ower


Figure S 2: ROC curves comparing the power attained by the approximateLR-test and by the statistic used in Homer et al [1] when applied to a subsetof 10000 independent, common SNPs from the WTCCC data, under two dif-ferent settings(see the caption of FigureS1for descriptions of the settings). For afalse positive level of10−3, the power of either test is at least doubled in the secondsetting.

5


−3 −2.5 −2 −1.5 −1 −0.5 00

0.2

0.4

0.6

0.8

1

False positive rate (Log base 10)

Pow

er

LR (theory)LR (No error)LR (1% error)LR (10% error)

Figure S 3: ROC curves for the approximate LR-test under different geno-typing error rates. As the genotyping error rate increases from1% to 10%, thepower of the LR-test (constructed with the assumption of no genotyping errors)decreases significantly. We used the set of all 33138 independent, common SNPsin the WTCCC dataset.

6


−3 −2 −1 00

0.2

0.4

0.6

0.8

1


Pow

er

LR (gamma=1)LR theory (gamma=1)LR (gamma=0.5)LR theory (gamma=0.5)LR (gamma=0.25)LR theory (gamma=0.25)LR (gamma=0.125)LR theory (gamma=0.125)

Figure S4:ROC curve of the approximate LR-test on the task of detecting rel-atives. The power to detect relatives is considerably smaller relative to the powerto detect the tested individual: it decreases significantly even when testing first-order relationships (γ = 1

2 for siblings and parents). We used the set of all 33138independent, common SNPs from the WTCCC data.

7


−3 −2.5 −2 −1.5 −1 −0.5 00

0.2

0.4

0.6

0.8

1


Pow

er

LR (Independent SNPs)LR (theory)LR (All SNPs)

Figure S 5: ROC curve of the approximate LR test, constructed under amodel of independent SNPs, applied on all 358,053 dependent SNPs from theWTCCC data. The power of the test decreases slightly when SNPs in linkagedisequilibrium are included. We compare the power of the approximate LR testapplied on the 358,053 SNPs from the WTCCC data (the set of all SNPs publishedin the WTCCC study with MAF> 0.05) to the test applied on the set of 33138independent, common SNPs.

8


0 500 1000 1500−1

−0.5

0

0.5

1

1.5

2

2.5

3

n

mea

ns a

nd v

aria

nces

µ0.n

µ1.n

σ20.n

σ21.n

Figure S6:Rescaled means and variances of the LR-test.Means (µ0, µ1) andvariances (σ2

0 , σ21) of the exact LR statistic, under the null and the alternative, com-

puted numerically from the frequencies of the 33138 independent, common SNPsin the WTCCC data. The means and variances have been rescaled by the pool sizen so that the rescaled valuesnµ0, nµ1, nσ2

0 and nσ21 approach−0.5, 0.5, 1 and

1, respectively, asn becomes large.

9


0 500 1000 15008

10

12

14

16

18

20

22

24

26

n

m/n

Figure S 7:Validity of the theoretical analysis of the LR-test. Ratio mn of the

smallest numberm of SNPs necessary to construct a test with false positive rateα = 0.01 andβ = 0.99 as a function of the pool sizen. This curve was com-puted using the 33138 independent, common SNPs from the WTCCC dataset. Theasymptote of the curve (dotted line) corresponds to the value ofm

n obtained fromEquation1, which is a constant that depends onα and β. (The shape of the curveis invariant to a common scaling ofα andβ). Note that the asymptotic behavior isattained for pool sizesn as small as100 implying that the Equation1 is valid forpools of size larger than100.

10


Supplementary Methods

Supplementary Methods

Model assumptionsIn association studies, individuals in the pool are assumedto be chosen randomly from a pure population. For a pool ofn individuals weexposem SNPs, for which the allele frequencies in the population and the poolarep1, . . . , pm, and p1, . . . , pm, respectively. In the models we consider, we as-sume that the SNPs areindependent. This is motivated by the fact that, in practice,the SNPs that we choose to expose can be selected sufficiently far apart on thechromosome that they can be considered independent; moreover, this assumptionmakes our theoretical analysis tractable. We also assume that the SNP-allele fre-quencies are bounded away from zero and one; i.e., there existsa > 0 such thata ≤ pj ≤ 1 − a, j ∈ 1, . . . ,m. This is a natural assumption because it isusually the case that only those SNPs whose minor alleles are sufficiently wellrepresented in the population are considered in association studies; moreover, theexposed SNPs can be explicitly selected to have a prespecified minimal minor al-lele frequency.

HypothesesTo construct a likelihood ratio test, we must first specify the modelscorresponding to the null and the alternative hypotheses respectively. Since theSNPs are assumed independent, we describe the model for a single SNP.Null hypothesis: We assume that the pool is constituted ofn individuals drawnindependently from a reference population, in which the SNP-allele frequency isp(two alleles are drawn independently for each individual). We assume that the poolfrequency for that SNP is obtained by averaging the binary values of the allelesof all individuals, so that2np is a binomial random variable, Bin(2n, p). The twoalleles of the individual of interest, i.e., the individual whose genotype is beingtested for presence in the pool, are drawn independently from a Bernoulli variablewith parameterp, since, under the null, that individual is drawn independently ofthe pool from the same reference population.

11


Alternative hypothesis: We assume that the pool is constituted of the individual ofinterest whose alleles are drawn from a Bernoulli variable with parameterp, whichis merged with a pool ofn− 1 individuals obtained as under the null. Thusp is theaverage of2n−2 alleles of then−1 individuals in the pool and the two alleles of theindividual of interest. For moderately largen the model can be approximated bya simpler model which consists in sampling a pool of sizen, computing the allelefrequency in the pool, and drawing the two alleles of the individual of interest asBernoulli with parameterp.

The LR-test For an individual with genotype(x1, . . . , xm) ∈ 0, 1, 2m, the LR-test is based on the log likelihood ratio statistic:

L =

m∑

j=1

2∑

k=0

1xj=k logπk

j

πkj

, (M1)

where1xj=k is 1 if xj = k and0 otherwise, andπkj and πk

j are the genotypefrequencies in the population and in the pool, derived frompj andpj, respectively,under an assumption of Hardy-Weinberg equilibrium.

We note that the LR-test is an abstract test that cannot be constructed exactlyin practice since it requires knowledge of the population allele frequenciespj. Inpractice, these frequencies can only be estimated from an independent referencedataset drawn from the same population. We therefore differentiate theexact LR-test from theapproximate LR-test, in which an estimate of the allele frequenciesis substituted forpj. An important property of the exact test is that the Neyman-Pearson lemma [2] guarantees that the power of any test, whether based on knownpopulation frequencies or not, cannot be better than that of the exact LR-test. Byanalytically characterizing the power of the exact LR-test, for large pools and com-mon SNPs, we can bound the powerβ of any test as a function ofm,n, andα.

Detection in a single pool vs. discrimination between poolsOur experimentsdemonstrate that there is a discrepancy between the power achieved by the LR-testand the power achieved by the test described in [1]. This stems from the factthat the two experiments were based on different hypotheses. In [1], the pool wasassumed to be generated by random sampling ofn individuals from the distributiondefined byp1, . . . , pm. The alternative hypothesis in our study and in [1] is that thepool contains the tested individual. Where the two studies differ is in the definitionof the null hypothesis: in our case, under the null hypothesis the tested individualis randomly picked from the general population, while in [1], the individual isassumed to be randomly sampled from the finite reference dataset. This seemingly

12


subtle difference between the two null hypotheses leads to quite different resultsbecause the finite reference dataset is small (currently< 5000).

Consider, for example, an extreme case in which the population consists of10 individuals of which 5 are in the pool and the rest in the reference dataset; itis easy to detect any particular individual in this case. Detection is harder whenthe population consists of 1 million individuals of which 5 are in the pool. Itbecomes even harder if, out of these 1 million individuals, only a random set of 5are available in the reference dataset, which corresponds to the situation occurringin practice.

In practice, we show that the power attained using the null hypothesis of [1]more than doubles at a false positive rate of10−6 (see supplementary note).

Summary of the Analysis. For clarity, we will give an overview of the analysisfor a haploid individual; the case of genotypes is slightly more technical: it isdeveloped in the supplementary note. For haploids, the LR-test is

L =

m∑

j=1

[

xj logpj

pj+ (1 − xj) log

1 − pj

1 − pj

]

. (M2)

The Neyman-Pearson lemma guarantees that no test can have larger powerthan the likelihood ratio test. Thus, characterizing the power of the LR test, as afunction of the pool sizen, the number of independent SNPsm and a tolerablefalse positive rateα, determines the largest powerβ achievable for any test given(m,n, α); conversely, it also determines the maximal valuem so that no(α, β)-testcan be obtained for a pool of sizen.

The exact LR-test cannot be constructed in practice since it requires knowl-edge of the true SNP-allele frequencies. In practice, the test performed will be theapproximate LR-test, where the allele frequencies in the population are estimatedfrom a reference dataset. Nonetheless, we analyze the exact LR-test because itprovides an upper bound on the power of any test, whether it uses the true frequen-cies or not. We discuss the case in which frequencies need to be estimated in thesupplementary note.

If n is larger than100 and the minor allele frequency is greater than0.05, theexact LR statistic can be shown to be very well approximated under both hypothe-ses by the simpler statistic

m∑

j=1

[

1√n

xj−pj√

pj(1 − pj)Zj −

1

2n

(xj−pj)2

pj(1 − pj)Z2

j

]

, (M3)

whereZj are standard Gaussian variables. The statistic in EquationM3 can beanalyzed easily: providedn is moderately large, each term in the sum has mean

13


µ0 = − 12n under the null andµ1 = + 1

2n under the alternative and varianceσ20 =

σ21 = 1

n in both cases. But, form moderately large and MAF not too small (MAF> 0.05), the distribution of the exact LR statistic is itself approximately Gaussianand, for a Gaussian test, the relationship between sample sizem, powerβ, andfalse positive rateα is mµ0 + zασ0

√m = mµ1 − z1−βσ1

√m. In our case this

yields the fundamental relation given by the equation:

zα + z1−β ≈√

m

n, (M4)

Note that this result is independent of the allele frequencies provided MAF> 0.05.As a consequence, for pools of size greater than100, if m ≤ (zα + z1−β)2n,

any test of levelα is guaranteed to have power no larger thanβ. For small pools,µ0, µ1, σ

20 andσ2

1 can be computed algorithmically and the power can still be com-puted exactly, even though no simple analytical expression is available (the powerwould depend in this case on all SNP frequencies).

A virtue of having reduced the analysis of the LR-test to the analysis of Eqn. (M3)is that it can be used to obtain insight into the behavior of the LR-test under vari-ous interesting alternative scenarios, which are discussed briefly at the end of themethods and in more detail in the supplementary note:

• In general, the frequenciespj need be estimated. The power drops due tothe estimation procedure and we characterize the drop in power for the ap-proximate LR-test in the supplementary note. In particular, if the referencedataset used to estimatepj has the same size as the pool, the number of SNPsneeded to reach the same power is doubled.

• The LR-test for the detection of an individual in a pool is very similar to thetest for discrimination between two pools, though the test for discriminationbetween pools has higher power. We analyze this case with the same toolsand show that if both pools have the same size, the necessary number ofSNPs is halved. Combined with the drop in power due to estimatingpj

mentioned above, detection of an individual in a pool needs four times asmany SNPs as discriminating between two pools.

• Genotyping errors only decrease the power of the optimal test. This is in-tuitively clear since genotyping errors would be expected to make it harderto match an individual’s genotype to that in the pool. We show analytically(see supplementary note) that this is true under a very broad assumption thatthe errors are generated by the same mechanism under the null and the alter-native hypothesis.

14


• Detecting a relative of the individual of interest in the pool is also done atthe expense of a drop in power. We show analytically that for a given power,detecting a sibling instead of the individual of interest requires four times asmany SNPs.

Experimental setupWe compared the power of the approximate LR-test and thepower of the statistic used in [1] empirically. We also compared the empiricalresults to the theoretical prediction, using a variant of Equation (M4) with a cor-rection for the finite size of the reference dataset (see supplementary note for thederivation of the correction). The only change arising from the correction is thatthe factor 1

n in Equation (M4) is replaced by1n(1 − n

n) where n is the sum ofthe number of individuals in the pool and the reference dataset. We use this cor-rected version of Equation (M4) whenever we compare the approximate LR to ourtheoretical calculations.

To evaluate the power of the approximate LR-test empirically, we created poolscontainingn = 1000 genotypes. Allele frequencies were computed for the pool.Individuals not part of the pool were used as a reference dataset to estimate thepopulation allele frequencies. Under the null hypothesis (where the individual isnot present in the pool), we pick an individual from the pool, remove the contri-bution of the individual to the allele frequencies for the pool and then compute thestatistic for this individual; note that under the null, the individual is neither presentin the pool nor in the sample of individuals outside the pool. Under the alternativehypothesis, we simply pick an individual from the pool and compute the statisticfor this individual.

Experiments on simulated dataThe ROC curves display the power of the ap-proximate LR test, the power of the statistic used in [1] and the theoretical powerfor the approximate LR-test. We first computed the ROC curves on simulated data;we simulated independent SNPs for values ofn = 1000 andm = 1000, 10000.The allele frequencies were picked independently from a Beta distribution fitted toallele frequencies in the range[0.05, 0.95] found in the HapMap CEU population.The reference dataset consisted of2000 individuals drawn from the same allelefrequency distribution. The results (Figure1) show the close agreement betweenthe theoretical and empirical curves for the LR-test. Further, the LR-test is consis-tently more powerful than the statistic proposed by [1], particularly at low falsepositive levels. To test the statistical significance of this difference, we performeda Wilcoxon signed rank test on the AUC (area under the ROC curve) of 100 boot-strap replicates. We found that the AUC of the LR-test was significantly greaterthan the statistic in [1] for m = 1000 andm = 10000 (in both cases, the p-value

15


was3.9 × 10−18). Importantly, for a group of size1000, the power is low evenwith 10000 independent SNPs—less than0.50 at a10−3 false positive level.

Experiments on the WTCCC data We constructed pools of size1000 from theWTCCC control dataset consisting of2937 individuals. There were3004 individ-uals from the 58C and the UKBS control groups. We retained2937 individualsafter removing individuals with more than3% missing data, related individualsand individuals with non-European ancestry.

We seta = 0.05 and retained only the set of independent SNPs (we used ap-value onr2 of 10−5). This gave us a set of 33,138 autosomal SNPs from theoriginal set of 462,386 SNPs. The power is less than0.95 for a false positive levelof 10−3 even when all the independent SNPs are used (Figure1). Computing thepower at lower p-values using Equation (M4), we see that the power to detectan individual at a false positive level of10−6 is only about0.47 even when all theindependent SNPs are used. If the entire set of 358,053 SNPs with MAF above0.05is used (this set includes the set of 33,138 independent SNPs), FigureS5shows asmall reduction in the power when the same approximate LR test—which assumesindependence—is used. However, when the SNPs are no longer independent, thereis a potential risk that linkage disequilibrium could be exploited to design a morepowerful test.

Genotyping errors Thus far we have assumed that there are no genotyping errorsfor either the individual or the pool. In practice, genotyping errors occur in 0.1%-1% of the SNPs so that even when the tested individual is actually present in thepool, the tested genotype might differ from the genotype present in the pool. In-tuitively, genotyping errors should reduce the power of the best detection methodavailable, since noise is introduced, and this can be proved theoretically (see sup-plementary note). Empirically, when we randomly add genotyping errors to the setof 33,138 SNPs, we observe that the power decreases with the rate of genotypingerrors (see FigureS3).

Detecting relatives in a poolThe LR-test can be extended to test for the presenceof a specific relative of the tested individual. This scenario is similar to identifyinga genotype in the presence of errors. The modified test is parameterized byγ,the probability that the relative and the tested individual share an allele. Thus,γ = 1 reduces to the case where the test detects the individual (or an identicaltwin), γ = 1

2 denotes a test that detects siblings or parents, and so on. We usedthe independent SNPs (33,138) in the WTCCC dataset and evaluated the powerto detect individuals with different degrees of relatedness to the tested individual

16


(γ = 1, 12 , 1

4 , 18 ). We observe a sharp decrease in power when we go fromγ = 1

to γ = 12 even though all the SNPs were used. At a false positive rate of10−3,

the power decreases from around0.95 for γ = 1 to 0.22 for γ = 0.5 to 0.03 forγ = 0.25.

Transferrability across populations Our analysis provides a population indepen-dent bound on power, i.e., the power computed from Equation (M4) does not de-pend on the allele frequencies and hence, should be the same across different pop-ulations. In a further experiment, we evaluated this aspect of our analysis by re-peating our experiments on the YRI population from the HapMap. Since the num-ber of YRI individuals in the HapMap is relatively small, we simulated a datasetof 3000 individuals by sampling from the YRI allele frequencies at independentSNPs with MAF> 0.05. We computed power for a pool of size1000 individualsfor m = 1000, 10000 and 33,138 SNPs (the number obtained from the WTCCCdata). The results shown in Figure2 confirm our analysis. A caveat is that thenumber of independent SNPs with small MAF may differ across the populations.This would affect the total number of SNPs that can potentially be exposed for agiven population.

17


Supplementary Note

Empirical behavior of the approximate LR test

E1 Experiments on simulated and WTCCC data

In these experiments, we compare empirically the power of the approximate LRtest and the power of the statistic used in Homer et al. . We compare it also to itstheoretical power, calculated based on a variant of Equation1 with a correction forthe finite size of the reference dataset and which is discussed in SectionT4.2.2.The only change arising from the correction is that the factor1

n in Equation1 isreplaced by1

n(1 − nn) wheren is the sum of the number of individuals in the pool

and the reference dataset. We use this corrected version of Equation1 wheneverwe compare the approximate LR to our theoretical calculations. As discussed inSectionT4.2.2, Equation1 corresponds to the case where the allele frequenciesp1, . . . , pm are known and the power computed using this equation is higher thanwhen these allele frequencies are estimated from a reference dataset.

To evaluate the power of the approximate LR test empirically, we created poolscontainingn = 1000 genotypes. Allele frequencies were computed for the pool.Individuals not part of the pool were treated as a reference dataset and used jointlywith the individuals in the pool to estimate the population allele frequencies. Underthe null hypothesis (where the individual is not present in the pool), we pick anindividual from the pool, remove the contribution of the individual to the allelefrequencies for the pool and then compute the statistic for this individual. (Notethat under the null, the individual is neither present in the pool nor in the sample ofindividuals outside the pool.) Under the alternative hypothesis, we simply pick anindividual from the pool and compute the statistic for this individual.

The ROC curves display the powers of the approximate LR test, of the statisticused in [1] and the theoretical power for the the approximate LR test (referred toas LR theory). We computed the ROC curves first on simulated data and then ondata from the WTCCC.

18


E1.1 Simulated data

We simulated independent SNPs for values ofn = 1000 andm = 1000, 10000.The allele frequencies were picked independently from a Beta distribution fittedto allele frequencies in the range[0.05, 0.95] found in the HapMap CEU popu-lation. The reference dataset consisted of2000 individuals drawn from the sameallele frequency distribution. The results (Figure1 in the paper) show the closeagreement between the theoretical and empirical curves for the approximate LRstatistic. Further, the approximate LR statistic is consistently more powerful thanthe statistic proposed by Homer et al., particularly at low false positive levels. Totest the statistical significance of this difference, we performed a Wilcoxon signedrank test on the AUC (area under the ROC curve) of 100 bootstrap replicates. Wefound that the AUC of the approximate LR statistic was significantly greater thanthe statistic in [1] for m = 1000 andm = 10000 (in both cases, the p-value was3.9 × 10−18). Importantly, for a group of size1000, the power is low even with10000 independent SNPs—the power of the approximate LR statistic is less than0.50 at a10−3 false positive level.

E1.2 WTCCC

We constructed pools of size1000 from the WTCCC control dataset consisting of2937 individuals. There were3004 individuals from the 58C and the UKBS controlgroups. We retained2937 individuals after removing individuals with greater than3% missing data, related individuals and individuals with non-European ancestry.

We discarded rare SNPs (SNPs with MAF< 0.05) and further retained onlythe set of independent SNPs (we used a p-value onr2 of 10−5). This gave us aset of33138 independent and common autosomal SNPs from the original set of462386. The power of the approximate LR-test is less than0.95 for a false positivelevel of10−3 even when all the independent SNPs are used (Figure1). Computingthe power of the test at lower p-values using Equation1 corrected for the plug-inestimation of the allele frequencies, we see that the power to detect an individualat a false positive level of10−6 is only about0.47 even when all the independentSNPs are used. If the entire set of358, 053 SNPs with MAF above0.05 is used(this set includes the set of 33138 independent SNPs), FigureS5 shows a smallreduction in the power. However, when the SNPs are no longer independent, thereis a potential risk that linkage disequilibrium could be exploited to design a morepowerful test.

19


E2 Discrimination between pools

The results presented in SectionsE1.1 and E1.2 differ from those of [1]. Thisdiscrepancy can be attributed to differences in the null hypotheses tested. Underthe null hypothesis considered by [1], the tested individual is one of the individualsof the finite reference dataset. This differs from the null hypothesis that we test inthis paper where the tested individual and all individuals of the reference datasetare independent draws from the reference population; the finite reference datasetdoes not contain the genotype of the tested individual.

Note that the alternative hypothesis is the same in our setup and that of [1]. Inparticular, under the alternative, the pool is formed by samplingn − 1 individualsat m independent SNPs and including the tested individual who is sampled inde-pendently from the same allele frequency distribution. Thus, the setting consideredin [1] corresponds to discriminating if an individual is in one of two pools.

We consider these different null hypotheses from a theoretical point of view inSectionsT4.2.2andT4.3respectively.

This seemingly subtle difference between the two null hypotheses has impor-tant consequences. FiguresS1 and S2 compare the two settings. When all theindependent common SNPs in the WTCCC are used, for a false positive level of10−6 the approximate LR test attains a power of1 and0.88 under the alternatesetting of [1] and the original setting respectively. The ROC curves can beextrapo-lated to lower false positive levels using Equation1 corrected for a finite referencedataset. We then see that for a false positive level of10−6, the power of the ap-proximate LR test is0.96 and0.31 under the two settings. FigureS2 shows thesame trend when only10000 independent common SNPs are used— for discrim-ination between two pools, the power more than doubles at a false positive levelof 10−6. We have also theoretically analyzed the power in this alternate setting inSectionT4.3 and we can show that when the size of the reference dataset is thesame as the size of the pool, the number of SNPs needed drops by a factor of fourin this setting.

E3 Genotyping errors decrease the power of the approx-imate LR test

We examine the effect of genotyping errors on the power of the approximate LRtest. As a result of genotyping errors, even when an individual is present in the pool,the tested genotype might differ from the genotype in the pool. In our simplisticmodel of genotyping errors, the allele at each individual can flip independentlywith an error probability ofǫ. We used the WTCCC data withn = 1000 and

20


a = 0.05 and independent SNPs (with a p-value onr2 smaller than10−5). As weexpect, the power of the approximate LR test decreases as we increaseǫ from 0to 1% to 10% (FigureS3). We prove in SectionT6 that genotyping errors do notincrease the power of the optimal test under a very broad assumption that the errorsare generated by the same mechanism under the null and the alternative hypothesis.We would therefore expect the detection power to be also lower than that reportedin the earlier subsection.

E4 Transferability of the LR-test properties across popu-lations

In this section we show empirically that the behavior of the approximate LR testdoes not depend on the population. Indeed, the power computed from Equation1does not depend on the allele frequencies and hence, should bethe same acrossdifferent populations. To test if this is true in practice, we ran the approximate LRtest on a set of YRI individuals from HapMap phase II data. Since the number ofYRI individuals is small, we simulated a dataset of3000 individuals by samplingfrom the YRI allele frequencies at independent SNPs with MAF> 0.05 that weretyped by the Affymetrix 500K chip set. We computed the power for a pool of size1000 individuals andm = 1000, 10000 and33138 SNPs (the number obtainedfrom the WTCCC data). Figure2 shows the close correspondence between theempirical and theoretical ROC curves.

E5 Detecting relatives

We consider a modified version of the LR test formulated to detect the presenceof a specific relative of an individual. The test is parameterized by a parameterγwhich denotes the probability that the relative and the tested individual share anallele. Thus,γ = 1 reduces to the case where the test detects the individual (oran identical twin),γ = 1

2 denotes a test that detects siblings or parents, and so on.We used all the independent SNPs (33,138) in the WTCCC dataset and evaluatedthe power to detect individuals with different degrees of relatedness to the testedindividual (γ = 1, 1

2 , 14 , 1

8 ). We observe a sharp decrease in power when we gofrom γ = 1 to γ = 1

2 even though all the SNPs were used. At a false positive rateof 10−3, the power decreases from around0.95 for γ = 1 to 0.22 for γ = 0.5 to0.03 for γ = 0.25.

21


E6 Validity of Equation 1

The equationm ≈ (zα+z1−β)2n is guaranteed to be very accurate for large valuesof n and for SNPs with MAF> 0.05. However this approximation is also likely tobe valid for a broader regime in practice. In this section we explore this issue.

Given thatx and p are binomial random variables, we can compute numeri-cally the mean and variance of the log-likelihood ratioLg at a single SNP, undereither the null or the alternative hypothesis, using the exact formula for binomialprobabilities for a given value ofp. Using the frequencies of SNPs in the WTCCCdata we can then compute the overall mean and variance of the exact LR statisticsLg (see SectionT5). Since means and variances are of order1

n , we rescale themby n. FigureS6 shows that the means and variances converge to the predictedasymptotic value very quickly.

From the means and variances, we then compute for each pool sizen the num-ber of SNPs necessary to achieveα = 0.01 andβ = 0.99 and represent the ratiomn , which, according to Equation1, should approach(zα + z1−β)2 for large valuesof n. The obtained curve (see FigureS7) reaches its asymptote very quickly. Forn ≥ 100 the estimate of the ratiomn given by Equation1 is off by less than5% andfor n ≥ 1000 Equation1 is off by less than0.5%.

22


Theoretical Analysis of the LR test

T 1 Presentation of the log-likelihood ratio statistics

For clarity, we first derive the log-likelihood ratio (LR) statistic for a haploid in-dividual. There is little loss of generality here, since, as we shown in SectionT5,the analysis of its counterpart for the genotypes of diploid individuals leads to thesame quantitative conclusions to first order.

T1.1 A model for independent SNPs in the haplotype case

T1.1.1 Notation

We use the following notation:

• xj: the allele (0, 1) at SNPj

• pj: the frequency of the allele denoted1 in the population at SNPj

• pj: the frequency of allele1 in the pool

• n: the size of the pool

• m: the number of SNPs

We drop the indexj when the reference to the SNP is not necessary.

T1.1.2 Model assumptions

In the models we consider, we assume that the SNPs are independent. This is mo-tivated by the fact that, in practice, the SNPs that we choose to expose can be se-lected sufficiently far apart that they can be considered independent; moreover, thisassumptions makes our theoretical analysis tractable. We also ignore genotypingerrors for now. Finally, we make the assumption that the SNP allele frequencies arebounded away from zero and one. More precisely, we assume that there isa > 0

23


such thata ≤ pj ≤ 1 − a. This is a natural assumption because it is usually thecase that only those SNPs whose minor alleles are sufficiently well represented inthe population are considered in association studies; moreover, the exposed SNPscan be explicitly selected to have a prespecified minimal minor allele frequency.

T1.1.3 Models for the null and the alternative

Since the SNPs are assumed independent, we describe the model only for a singleSNP.

Null hypothesis We assume that the pool is constituted ofn independent individ-uals drawn from a reference population, in which the SNP allele frequencyis p. We assume that the pool frequency for that SNP is obtained by aver-aging the binary values of SNPs of all individuals, so thatnp is a binomialrandom variable Bin(n, p). (In the diploid case, each individual contributestwo alleles and in that case2np is binomial Bin(2n, p)). The SNP of theindividual of interest is drawn independently from a Bernoulli variable withparameterp, since that individual comes independently of the pool from thesame reference population.

Alternative hypothesis We assume that the pool is constituted of the individualof interest whose SNP is drawn from a Bernoulli variable with parameterp,which is merged with a pool ofn− 1 individuals obtained as under the null.Thusp is the average ofn − 1 SNPs of the individuals in the pool and theSNP of the individual of interest. This model is actually exactly equivalentto a simpler model which consists in sampling a pool of sizen, computingthe allele frequency in the pool, and drawing the SNP of the individual ofinterest as Bernoulli with parameterp.

These assumptions yield the following forms for the probability distributions underthe null and alternative hypotheses. Under the null hypothesis, the joint distributionof x andp for a single SNP , whose allele frequency isp, is

f0(x, p ; p) = px(1 − p)1−x

(n

np

)

pnp(1 − p)n(1−p)

Under the alternative, we have:

f1(x, p ; p) = px(1 − p)1−x

(n

np

)

pnp(1 − p)n(1−p)

Note that the only parameter of each model isp and that in this formulationp isnot a parameter.

24


T1.2 The likelihood-ratio statistic

The likelihood ratio takes the following form:

f0(x, p; p)

f1(x, p; p)=

px(1 − p)1−x

p x(1 − p)1−x

Since we assume that the SNPs considered are independent, the overall log-likelihoodratio statistic can be written as:

L =

m∑

j=1

[

xj logpj

pj+ (1 − xj) log

1 − pj

1 − pj

]

. (T1)

T2 Overall summary of the analysis

T2.1 Main results

The Neyman-Pearson lemma guarantees that no test can have larger power thanthe likelihood ratio test (see SectionT3). Thus, characterizing the power of theLR test, as a function of the pool sizen, the number of SNPsm and a tolerablefalse positive rateα, determines the largest powerβ achievable for any test given(m,n, α); conversely, it also determines the maximal valuem so that no(α, β)-testcan be obtained for a pool of sizen.

We should note that the LR test cannot be constructed in practice since it re-quires knowledge of the true SNP-allele frequencies. The fact that we use the allelefrequencies in our analysis does not mean that we make the assumption that we ac-tually know them in practice, which would not be realistic, but rather that, even ifthe frequencies were known exactly (anda fortiori if they are not), the power ofany test cannot be better than that of the LR test (for clarity we refer to this test astheexact LR-test). We study the case in which frequencies need to be estimated inSectionT4.2.2.

If n is larger than100 and the minor allele frequency is greater than0.05,the exact LR-test statistic can be shown to be very well approximated under bothhypotheses by the simpler statistic

m∑

j=1

[

1√n

xj−pj√

pj(1 − pj)Zj −

1

2n

(xj−pj)2

pj(1 − pj)Z2

j

]

, (T2)

whereZj are Gaussian variables that approximateZj =√

npj−pj√pj(1−pj)

. The latter

statistic can be analyzed easily: each term in the sum has meanµ0 = − 12n under

25


the null andµ1 = + 12n under the alternative and varianceσ2

0 = σ21 = 1

n in bothcases. But, form moderately large, the distribution of the exact LR statistic is itselfapproximately Gaussian and, for a Gaussian test, the relationship between samplesizem, powerβ, and false positive rateα ismµ0+zασ0

√m = mµ1−z1−βσ1

√m.

In our case this yields the fundamental relation

zα + z1−β ≈√

m

n

wherezα andz1−β are respectively the quantiles of level1−α andβ of the Gaus-sian distribution. Note that this result is independent of the allele frequencies pro-vided MAF> 0.05.

As a consequence, forn larger than100, if m ≥ (zα+z1−β)2n, then any test oflevel α is guaranteed to have power no larger thanβ. Empirically, we have foundthat this holds also for smalln, but in that caseµ0, µ1, σ

20 andσ2

1 can be computedalgorithmically and the power can still be computed exactly, even though no simpleanalytical expression is available (it depends a priori on the allele frequencies ofall the SNPs ).

T2.2 Behavior of the LR test under variations on the setting

The virtue of having reduced the analysis of the LR test to the analysis of Eqn. (T2)is that beyond the analysis of the original case, which leads to our main result, wecan characterize the behavior of the LR test in various interesting situations.

• In general, the frequenciespj need be estimated (we term the LR-test basedon a statistic wherepj is replaced by an estimate theapproximate LR-test).The power actually drops due to the estimation procedure and we character-ize the drop in power in SectionT4.2.2. In particular, if the separate sampleused to build estimates has the same size as the pool, the number of SNPsneeded to reach the same power is doubled.

• The likelihood-ratio test for the detection of an individual in a pool is verysimilar to the test of discrimination between two pools, which has muchhigher power. We analyze that case with the same tools in SectionT4.3. Ifboth pools have the same size, the necessary number of SNPs is halved.

• The case of diploid individuals whose genotype is available is treated inSectionT5. Interestingly, since the number of alleles doubles but theyaredrawn (for an individual) in pairs coming from the same frequencies, ourmain formula stays the same and we still havezα + z1−β ≈

√mn .

26


• Genotyping errors only decrease the power of the optimal test. This is intu-itively clear since genotyping errors would be expected to make it harder tomatch an individual’s genotype to that in the pool. We show in SectionT6that this is true under a very broad assumption that the errorsare generatedby the same mechanism under the null and the alternative hypothesis.

• Detecting a relative of the individual of interest in the pool is also done atthe expense of a drop in power. We consider this case in SectionT7 andshow that, for a given power, detecting a sibling instead of the individual ofinterest requires four times as many SNPs.

T3 Guarantees provided by the Neyman-Pearson lemma

How hard it is to detect an individual in a pool based on his or her SNPs? Canwe quantify the difficulty in term of the size of the pool and the number of SNPsavailable? The appropriate framework to answer such questions is the Neyman-Pearson framework for hypothesis testing ([2]). Indeed, besides the formalizationof a methodology, this framework provides a measure of the performance of a test:its power. Note that this implicitly defines a measure of difficulty of a problem,since for hard problems, it will be impossible to design tests with high power.

For a pair of distributionsp1 andp0 associated to hypothesesH1 andH0 re-spectively, the Neyman-Pearson lemma characterizes the maximal power achiev-able by any statistical test based on an observationx from eitherp1 or p0 andshows that the maximal power is actually achieved by the exact likelihood ratiotest (LR test). By statistical test, we mean a decision that depends on any (possi-bly random) mathematical function ofx, and therefore encompasses tests that haveperfect knowledge ofp0 andp1 as well as all practical tests that either have noaccess top0 andp1 or have a limited ability to sample from them, or approximatethem. The statement of the lemma can be summarized as follows:

Given two distributionsp0 and p1 generating the data respectively underhypothesesH0 andH1, no test, practical or abstract, can achieve betterpower than that of the exact LR test.

Therefore, to characterize the maximal power achievable by tests to detect anindividual in a pool, we can assume that the allele frequencies in the referencepopulation are known to us, since, if they were not, the power of the best practicaltest we would be able to construct could never be larger.

27


T4 Analysis of the exact LR statistic for largen

T4.1 Approximation to the LR statistic

We first note that it is possible for the LR test to take the value−∞ if xj = 0and pj = 1 or vice versa. This corresponds to a case where the individual canobviously not be in the pool. We consider this case in SectionT4.1.2.

T 4.1.1 Main regime

In essence, our analysis is based on the fact that for large values ofn which typi-cally correspond to the sizes of pools in association studies, the likelihood ratio isvery accurately approximated by the much simpler statistic:

m∑

j=1

[

(xj−pj)(pj−pj)

pj(1 − pj)− 1

2

(xj−pj)2(pj−pj)

2

p2j(1 − pj)2

]

. (T3)

For a single SNP, and forn reasonably large,p is typically very close top. In fact,the central limit theorem says that

√n(p − p) approaches a Gaussian with mean

zero and variancep(1 − p). As a consequence, in the typical case the ratiopp =

1+ p−pp is very close to1 and the logarithm can be approximated by its second-order

Taylor expansion:log(1+x) ≈ x− x2

2 . Note thatlog(1+x) can only be expandedas a Taylor series if|x| < 1, so we require that both|p−p| ≤ p (1−ǫ) and|p−p | ≤(1 − p)(1 − ǫ) so that both terms in the log-likelihood ratio can be expanded. InSectionA1 we show that these inequalities hold with overwhelming probability,and that the remainderR(p, n) beyond the second order term is negligible withrespect to the two first terms. Technically, we haveR(p, n) = Op(n

− 3

2 ). If wedenote the standardized binomial byZ

.=

√n p−p√

p(1−p), Taylor expansions yield

log

(p

p

)

= log

(

1 +p − p

p

)

≈ p − p

p− (p − p)2

2p2

log

(1 − p

1 − p

)

= log

(

1 − p − p

1 − p

)

≈ − p − p

1 − p− (p − p)2

2(1 − p)2,

so that

L = x log

(p

p

)

+ (1 − x) log

(1 − p

1 − p

)

≈ (x−p)(p−p)

p(1 − p)− 1

2

(x−p)2(p−p)2

p2(1 − p)2=

1√n

x − p√

p(1−p)Z − 1

2n

(x−p)2

p(1−p)Z2.

28


The distribution ofL can be approximated by the distribution of

L =1√n

x − p√

p(1−p)Z − 1

2n

(x−p)2

p(1−p)Z2, (T4)

whereZ is a Gaussian variable andx is Bernoulli with parameterp under the null

and p +

√p(1−p)

n Z under the alternative (see SectionT8.1.3). Defining analo-

gously, for each SNPj, Lj based on the triplet(xj , pj , pj) and Lj based on thetriplet (xj , Zj , pj), the likelihood ratio statistic isL =

∑mj=1 Lj.

Our main technical result, a simultaneous central limit theorem inm andn,guarantees that, ifm → ∞, n → ∞ and m

n → τ then bothL and∑

j Lj convergeto a Gaussian limit under both the null and the alternative. Using the notation todenote convergence in distribution andAn ! Bn to denote thatAn andBn con-verge in distribution to a common limit, our result can be formulated symbolicallyas

If m, n → ∞,m

n→ τ < ∞,

UnderH0, L!

m∑

j=1

Lj ! N(

− m

2n,m

n

)

N(

−τ

2, τ)

UnderH1, L!

m∑

j=1

Lj ! N(

+m

2n,m

n

)

N(

+τ

2, τ)

.

A more precise version of this statement is made in SectionT8.1 and proved rig-orously. However, assuming that Eqn. (T4) provides a good approximation, in thesense thatL!

∑mj=1 Lj , we justify here the form, and the mean and variance of

the limiting distribution.Under the null, x−p√

p(1−p)has mean0 and variance1 independently ofZ so that

we have

E0[L|Z] = − 1

2nZ2, E0[L

2|Z] =1

nZ2 + Op(n

− 3

2 ).

Under the alternative, sincex is Bernoulli with parameterp +

√p(1−p)

n Z we have

E1

[

x − p√

p (1−p)

∣∣∣∣∣Z

]

=Z√n

E1

[(x−p)2

p (1−p)

∣∣∣∣Z

]

= 1 +Z

√

np(1−p)+ 2

Z2

n

so that

E1[L|Z] =

(Z√n

)2

− 1

2nZ2, E1[L

2|Z] =1

nZ2 + Op(n

− 3

2 ).

29


Finally, taking expectations with respect toZ we get

E0[L] = − 1

2n, Var0(L) =

1

n+ O

(1

n3/2

)

,

E1[L] = +1

2n+ O

(1

n3/2

)

, Var1(L) =1

n+ O

(1

n3/2

)

.

Taking the sum ofm SNPs, the above means and variances are multiplied bymand since all SNPs are assumed independent, the Lindeberg-Feller version of thecentral limit theorem, which accommodates the fact that theLj are not identi-cally distributed, shows that

∑

j Lj converges under the null and the alternative toa Gaussian with the announced means and variances. Note that, as mentioned be-fore, thanks to the first Gaussian approximation, formulated in Eqn. (T4), the aboveresult is simultaneously asymptotic inm andn.

While the number of SNPs considered will in practice always be large enoughthat a central limit theorem inm holds, it is the case that for some smaller sizes ofthe pooln, the Gaussian assumptions per SNP would not be appropriate and themeans and variances of the exact LR statistic will depart from the above values.We therefore prove also a Lindeberg-Feller central limit theorem forn finite whichapplies to the quantity 1√

mL (conditionally on the fact thatL > −∞) asm →

∞ and for n fixed and which guarantees that the LR-test is well approximatedby a Gaussian test with appropriate means and variances. This can be found inSectionT8.2.

T 4.1.2 “Perfect Accept”

The LR test also captures the notion that an individual cannot be in the pool ifthe individual has an allele that nobody in the pool has. This is only reasonableif we assume no genotyping errors, as we have. We call this a “perfect accept,”because in that case the null is accepted and the p-value is zero. In the presence ofgenotyping errors, a “perfect accept” would no longer be possible. Notice that a“perfect reject” cannot occur, and that only “perfect rejects” would be a privacy is-sue: for the “perfect accept,” knowing that an individual is certainly not in the poolis indeed not problematic. The corresponding mathematical formulation is that iffor any SNP(xj , pj) ∈ (1, 0), (0, 1) thenL = −∞ and the null is accepted.In a model where genotyping errors are properly taken into account, a single SNPcould obviously not trigger the decision. Under the null, we have:

π∞∆= P0(L = −∞) = 1 −

m∏

j=1

(1 − [pj(1−pj)

n + (1−pj)pnj ])

30


so that

log(1 − π∞) =∑

j

log(1 − [pj(1−pj)

n + (1−pj)pnj ])

≥ m log(1−[0.5(1−a)n + 2−n

]),

which shows thatm should be exponential inn for a “perfect accept” to occurwith non negligible probability. In concrete terms, since the previous formula isdecreasing inm and increasing inn, if n > 350 andm ≤ 106 thenπ∞ ≤ 0.008.Moreover, the previous analysis is very conservative, considering a worse casescenario where all SNPs have extreme frequencies. In practice, SNPs potentiallyresponsible for a “perfect accept” should be removed from the list of SNPs beforeperforming a test, unless genotyping errors are modeled or accounted for. A verysimilar result holds for genotypes (cf. SectionT5.1).

T4.2 The LR test when allele frequencies have to be estimated

T4.2.1 Misspecification

If xj ∼ Bin(1, pj) andpj ∼ Bin(n, pj) but we mistakepj for qj — e.g. we use areference dataset of individuals that do not belong to the population that individualsin the pool and the individual of interest are drawn from — a bias is introduced inthe LR statistic since for a fixed SNP we then have:

L = x log

(p

q

)

+ (1 − x) log

(1 − p

1 − q

)

which can be rewritten as

L = x log

(p

p

)

+ (1 − x) log

(1 − p

1 − p

)

− x log

(q

p

)

+ (1 − x) log

(1 − q

1 − p

)

so that the bias is KL(p ‖q). Of great interest is obviously the case whereq isactually an estimator ofp, which we consider next.

T4.2.2 The LR test with a plug-in estimator of allele frequencies (Approxi-mate LR test)

If we have at our disposal a reference dataset ofn0 individuals outside the pool,we can build an estimatorp of p: p = n

n+n0p + n0

n+n0p0, with p0 the empirical

frequency in the reference dataset. In that case the likelihood ratio is

L = x log

(p

p

)

+ (1 − x) log

(1 − p

1 − p

)

. (T5)

31


If we denote byZ, Z andZ0 respectively the Gaussian approximations to the stan-dardized versions ofp, p andp0, then we have, withn = n + n0

Z =

√n

nZ +

√n0

nZ0.

We get:

L =1√n

x − p√

p(1−p)(Z −

√n

nZ) − 1

2n

[

(1 − p)x

p+ p

1 − x

1 − p

]

(Z2 − n

nZ2).

Since Z and Z are correlated, we rewrite the difference in terms of independentGaussians:

Z −√

n

nZ =

n0

nZ −

√nn0

nZ0.

As a consequence we can write:

E

[(

Z −√

n

nZ

)2]

= E

[(n0

nZ −

√nn0

nZ0

)2]

= E

[n2

0

n2Z2 +

nn0

n2Z0

2

]

=n0

n= n

(1

n− 1

n

)

.

Assuming thatn0 = Ω(n), the means and variances under both alternatives can becalculated, using similar computations as in SectionT4.1:

E0[L|p] = − 1

2nE

[

Z2 − n

nZ2]

E1[Lj |pj ] =1√n

E

[Z√n

(

Z −√

n

nZ

)]

− 1

2nE

[

Z2 − n

nZ2]

+ O(n−3/2)

Var0(Lj |pj) =1

nE

[

(Z −√

n/nZ)2]

+ O(n−3/2)

Var1(Lj |pj) =1

nE

[

(Z −√

n/nZ)2]

+ O(n−3/2)

which finally yields

E0[L|p] = −1

2

[1

n− 1

n

]

Var0(Lj |pj) =1

n− 1

n+ O(n−3/2)

E1[Lj |pj ] = +1

2

[1

n− 1

n

]

+ O(n−3/2) Var1(Lj |pj) =1

n− 1

n+ O(n−3/2).

As a final remark note that the estimatorp ′ = nn+n0+1 p + n0

n+n0+1 p0 + 1n+n0+1 x

would also be a reasonable estimator ofp. In fact, it is the maximum likelihood

32


estimator underH0. The same analysis as the previous one can be carried out.Neglecting terms of order higher that1

n , both expectations under the null and thealternative are shifted by− 1

n ; intuitively, this can be explained by the fact thatincluding x to estimatep makes the null distribution simultaneously more likelyunder both the null and the alternate, and by the same amount. On the other hand,variances are unchanged. Since only the difference of the means matters to com-pute the power, this estimator has (to first order) the same power as the previousone.

Drop in power due to estimationIf n0 = λn, our previous result shows that all means and variances are multipliedby a factor λ

1+λ . Thus for any practical use of the LR test, if the plug-in estimator

p is used, the number of SNPs would have to be increased by a factor1+λλ in order

to achieve the same power as in the case where the frequenciespj are assumedknown. If n = n0, this is a factor of two.

T4.2.3 The LR test with another estimate of the allele frequencies

The motivation for considering the case of an intuitively slightly worse estimateis that, for a certain choice of that estimate, the log-likelihood ratio is identical tothe log-likelihood ratio used for a test with a different semantics, but much morepowerful, as we discuss in the next section. The case we consider here is thereforeessential for comparison. With the same notation as inT4.2.2, instead of estimatingp with the plug-in estimatorp, one can consider usingp0 instead. This yields thelog-likelihood ratio statistic:

L = x log

(p

p0

)

+ (1 − x) log

(1 − p

1 − p0

)

, (T6)

with the asymptotic approximation

L =x − p

√

p(1−p)

(Z√n− Z0√

n0

)

−1

2

[

(1 − p)x

p+ p

1 − x

1 − p

](Z2

n− Z0

2

n0

)

, (T7)

for which we can compute equivalents of the means and variances under the nulland the alternate hypothesis assumingn0 = Ω(n).

E0[L] ∼ −1

2

[1

n− 1

n0

]

Var0(L) ∼ 1

n+

1

n0

E1[L] ∼ +1

2

[1

n+

1

n0

]

Var1(L) ∼ 1

n+

1

n0.

33


This test is not really much worse than the previous one since,if n0 = λn,calculations show that the number of SNPs would also have to be increased by afactor 1+λ

λ in order to achieve the same power as in the case where the frequenciespj are assumed known.

T4.3 Discrimination between two pools

The test described in the previous section, where a second pool is used to constructan overall estimate of the frequencies in the reference population, is similar froma methodological point of view to the setup of a test to discriminate between twofinite pools. However, it should not be confused with it because the latter test doesnot answer the same question. By discrimination between two pools we mean thefollowing. There are now two groups of individualsG1 andG0 of sizesn1 = nandn0, and the question asked is whether an individual of interestx is more likelyto belong to groupG1 or to groupG0. The alternate hypothesisH1 is, as before,that the individual belongs toG1, but the null hypothesisH0 is now thatx belongsto G0, (whereas before it was thatx is drawn from the reference population thatG0

helps estimate). We denote byp = p1 the allele frequency of the SNP inG1 forconsistency with previous notation, andp0 the allele frequency1 of the SNP inG0.In this case, the models under the null and the alternate hypothesis are respectively:

f0(x, p0, p; p) = px0(1−p0)

1−x

(n0

n0p0

)(n

np

)

pn0p0+np (1 − p)n0(1−p0)+n(1−p)

f1(x, p0, p; p) = px(1−p)1−x

(n0

n0p0

)(n

np

)

pn0p0+np (1 − p)n0(1−p0)+n(1−p).

Note that a simple treatment of the same test would reason conditionally onp0 andp1 and ignore that both of them are samples from the same reference population.Such an analysis would be correct, but would fail to capture the consequences ofthe fact thatp0 andp1 are likely to be very close whenn0 andn1 are large.

The log-likelihood ratio can then be written as:

L = x log

(p

p0

)

+ (1 − x) log

(1 − p

1 − p0

)

,

which is exactly the same log-likelihood ratio statistic as Eqn. (T6). Moreover thelikelihood ratio test has exactly the same equivalent (see Eqn. (T7)). However,although the likelihood ratio is the same, the distributions assumed for the dataunder the null are not the same, and we get a different equivalent for the mean

1Note that indices in this part refer to groups and not SNPs as previously

34


under the null hypothesis. Equivalents of the means and variances under the nulland the alternate hypothesis, assumingn0 = Ω(n), are therefore now:

E0[L] ∼ −1

2

[1

n+

1

n0

]

Var0(L) ∼ 1

n+

1

n0

E1[L] ∼ +1

2

[1

n+

1

n0

]

Var1(L) ∼ 1

n+

1

n0.

In this case the means are much better separated, which gives this test more powerthan even the test considered inT4.1, which is only possible because it correspondsto different hypotheses about the data. In particular, for this test, the case wherexdoes not belong to the finite groupsG1 or G0 is assumed impossible.

For instance, ifn0 = λn, all means and variances increase by a factor1 + 1λ .

The number of SNPs required by this test would decrease by a factor1+ 1λ relative

to the case where the frequenciespj are assumed known. Ifn = n0, the numberof SNPs required would be halved relative to the case where the frequenciespj

are known and would decrease by a factor of four relative to the case where thefrequencies are estimated as described in SectionT4.2.2.

T 5 The model for genotypes

T5.1 Main analysis

In a model for diploid individuals, since for each individual we have the genotype ateach locus, we have to model unordered pairs of allelesx, y, with x, y ∈ 0, 1.The marginal genotype frequencies of a SNP in the pool ofn individuals is dis-tributed as a rescaled binomial, Bin(2n, p). Under the null hypothesis, correspond-ing to the individual being outside of the pool and under Hardy-Weinberg equi-librium, the pair is drawn as two independent Bernoulli random variables withprobabilityp each. Denotingπ(0)

k = P(0)(x+y = k), we have

π(0)0 = (1 − p)2 π

(0)1 = 2p (1 − p) π

(0)2 = p 2.

Under the alternative, the individual is in the pool, and under Hardy-Weinbergequilibrium the distribution of its genotype is the result of two drawswithout re-placement from the2n SNPs in the pool. In other words,x+y follows a hyper-geometric distribution with parameters(2n, 2, 2np). Denoting(x)+

.= max(x, 0),

35


andπ(1)k = P(1)(x+y = k) , we can simplify the hypergeometric formulas to get

π(1)0 =

2n

2n − 1(1 − p)(1 − p − 1

2n)+

π(1)1 =

2n

2n − 12p (1 − p)

π(1)2 =

2n

2n − 1p (p − 1

2n)+.

Asn becomes large, this is very close to the distribution of two independent drawsfrom a Bernoulli of parameterp which would yield the distribution

π(1′)0 = (1 − p)2 π

(1′)1 = 2p (1 − p) π

(1′)2 = p 2.

Using the shorthandx = 1 − x, the likelihood ratio test is then:

Lg = xy log

(

π(1)0

π(0)0

)

+ (xy + xy) log

(

π(1)1

π(0)1

)

+ xy log

(

π(1)2

π(0)2

)

.

The expansion for this likelihood-ratioLg is quite close to the one obtainedfor L in the haplotype case, essentially becauseπ(1) = π(1′) + Op(

1n) so that the

likelihood ratio behaves like

Lg′ = xy log

(

π(1′)0

π(0)0

)

+ (xy + xy) log

(

π(1′)1

π(0)1

)

+ xy log

(

π(1′)2

π(0)2

)

= xy log

(p2

p2

)

+ (xy + xy) log

(p (1 − p)

p (1 − p)

)

+ xy log

((1 − p)2

(1 − p)2

)

= (x + y) log

(p

p

)

+ (x + y) log

(1 − p

1 − p

)

,

which, conditionally onp and p, is the sum of two independent copies of thelikelihood-ratio L for 2np ∼ Bin(2n, p) instead ofnp ∼ Bin(n, p) that held inthe haplotype case. DefiningLg similarly to L, we get

Lg =1√2n

(x − p) + (y − p)√

p(1−p)Z − 1

4n

(x−p)2 + (y−p)2

p (1−p)Z2.

Since the sample size2n is doubled andLg would otherwise be defined as the sumof two copies ofL independent and identically distributed givenZ, the expecta-tions of Lg are the same as that ofL. But (Lg)2 has also to first order the same

36


mean as two copies ofL2, sincex andy are conditionally independent givenZ.We therefore get

E0[Lg] = − 1

2n, Var0(L

g) =1

n+ O

(1

n3/2

)

,

E1[Lg] = +

1

2n+ O

(1

n3/2

)

, Var1(Lg) =

1

n+ O

(1

n3/2

)

.

Probability of a “perfect accept” for genotypes In the case of genotypes,there are again some configurations of the pairs(p, x, y) that are not possibleunder the alternative (assuming no genotyping errors) and lead to a “perfect ac-cept.” These configurations of(p, x, y), wherex, y is viewed as an unorderedpair, are:

( 0

2n, 1, 1), ( 0

2n, 1, 0), ( 1

2n, 1, 1) , ( 2n−1

2n, 0, 0) , ( 2n

2n, 0, 1), ( 2n

2n, 0, 0)

.

The probability of a “perfect accept” on a given SNP is therefore

p2n(1−p2) + 2np2n−1(1−p)3 + 2n(1−p)2n−1p3 + (1−p)2n(1−(1−p)2).

An upper bound on this probability is(1 + 2n)2−2n+1 + (1 + n/4)(1 − a)2n−1

and we therefore get

1 − π∞ ≥ (1 −((1 + 2n)2−2n+1 + (1 + n/4)(1 − a)2n−1

)m,

which in practice means that form ≤ 106 andn ≥ 250 we haveπ∞ ≤ 0.005 withsame conclusions as in SectionT4.1.2.

T 6 Genotyping errors

In the case of genotyping errors, we assume that a corrupted versiony of the SNPxof the individual of interest is observed and that similarly an independently alteredversionpǫ of the pool frequencyp is available. There is clearly a loss of informationdue to both errors, and intuitively this should decrease, or at least not increase, thepower of the best possible test. In this section, we prove that this is essentially true.The only assumption that we have to make is that the effect of the genotyping erroron p andx does not depend on the hypothesis. More precisely, we assume that theconditional distribution of(y, pǫ) given(x, p) does not depend on the hypothesis.

Since, when there is genotyping error,(y, pǫ) is observed instead of(x, p), wecan think of the latter as a latent variable and of the former as a random functionof it.

37


In the rest of this section, we switch to the abstract setting that captures thisnotion and we show the following:

To test a hypothesis about the distribution of a random variable h, thelikelihood ratio test based on a fixed random function ofh is nevermore powerful than the likelihood ratio test based onh itself.

Proof: Let h andx be two variables, and consider the following simple hy-potheses: underH0, the distribution ofh is p0(h) and underH1, the distribution ofh is p1(h). Moreover, the conditional distributionp(x|h) of x givenh is the sameunder either of the hypotheses. Obviously, the marginal distribution ofx dependson the hypothesis: under the null, it isp0(x) =

∫p(x|h)p0(h)dh and under the

alternativep1(x) =∫

p(x|h)p1(h)dh. We consider three settings.

• In the first setting,h alone is observed, the most powerful (MP) test is theLR test with LR: p1(h)

p0(h) .

• In the second setting, bothx andh are observed, and the largest power isobtained by the LR test with LR:p(x|h)p1(h)

p(x|h)p0(h) = p1(h)p0(h) .

• In the third setting onlyx is available, so that the MP test is the LR test withLR: p1(x)

p0(x) .

As a preliminary remark, note that the expectation of a function under the nullor the alternative does not depend on which of the three settings we have. In eachcase,E0[f(x, h)] =

∫f(x, h) p(x|h) p 0(h) dx dh is defined the same way. The

only thing that changes is that, in the first setting, sincex is not available, thefunction f considered would only be a function ofh, and, symmetrically, in thethird setting, that functionf would only be a function ofx.

A test is entirely specified by itscritical function, which is essentially the in-dicator of the rejection region. In the case of the LR test, the critical function isentirely defined by the likelihood ratio itself and the null distribution that the ex-pectationE0 is taken under. Given this fact and the above remark it should be clearthat the critical function in the two first settings is the same. We denote thecriticalfunctions of the LR tests at levelα, for the first and second LR tests byφα and forthe third byφα.

The exact definition of these functions does not matter to our argument, but forconcreteness, these functions are defined such that

Forh s.t.p1(h) 6= r(α) p0(h), φα(x, h) = φα(h) = 1

p1(h)p0(h) > r(α)

ff,

Forx s.t. p1(h) 6= r(α) p0(h), φα(x) = 1

p1(x)p0(x) > r(α)

ff.

38


Sinceφα and φα correspond to a test of levelα, the functionsr andr (whichare guaranteed to exist) are defined so that the false positive error rate is fixed toE0φα(h) = E0φα(x) = α.

Since expectations are the same under all settings, it should first be noted thatthe test defined byφα has the same power under the two first settings; in bothcases it is the LR test which is the most powerful test, by the Neyman-Pearsonlemma. Similarly, the test defined byφα can be used in both the second and thethird setting, and for the same reason it has the same power in both settings. Inparticular, that means that it cannot have more power than the LR test based onφα,which proves the claim.

To address the question of when the power could be the same, according to theNeyman-Pearson lemma, a test can only be most powerful if its critical function isequal almost everywhere to the LR test, under the null and the alternative. This ispossible forφα and it is the case for instance ifx is a sufficient statistic for the pairof distributionsp0, p1.

T7 Detecting relatives

In the analysis that we have done so far we have ignored the possibility that a rela-tive of the individual of interestX could be in the pool and that its presence couldbe detected based on the SNPs ofX. Given a haplotypex, we study the LR testto detect the presence of a relative in the pool. As before we model the individualsin the pool and the reference dataset as independent draws from a pure population.Note that this excludes the pool and the reference dataset from containing relatedindividuals.

This setting has the same null as we considered previously but a different alter-native. In the alternative, at a SNPj, we sample the test individualXj ∼ Ber(pj), arelativeXj givenXj , andn−1 other individualsXj

i ∼ Ber(pj), i = 1, . . . , n−1.

The allele frequency from the pool isPj =Pn−1

i=1Xj

i+Xj

n .Let γ be the probability thatXj inheritsXj ’s allele; for example,γ = 1

2 forsiblings. We can express conditional probabilities in terms ofγ; e.g.,Pr(Xj =1|Xj = 1) = γ + (1 − γ)pj . This conditional probability is given by

p(xj |xj) xj = 0 xj = 1

xj = 0 1 − pj(1 − γ) pj(1 − γ)xj = 1 (1 − pj)(1 − γ) 1 − (1 − pj)(1 − γ)

39


The log-likelihood ratio gives us

L =

m∑

j=1

xj logqj

pj+ (1 − xj) log

1 − qj

1 − pj

whereqj = γpj + (1 − γ)pj .Analyzing the above expression, the mean and variance ofL under the null and

the alternative converge to the same value as before with a multiplicative factor of(γ)2. In terms of the minimum number of SNPs, this translates to

m ≈ n

(zα + z1−β

γ

)2

.Thus, finding a sibling, with the same bound on false positive and false negative

rates, requires four times as many SNPs as before.

T8 Technical Results

T8.1 Simultaneous central limit theorem inm and n

T8.1.1 Statement

Intuitively, asn becomes large, taking the average of SNP values across the poolhas the effect of gradually masking the information that is specific about the in-dividuals in the pool; indeedp

a.s.→ p, and, under the null as well as under thealternative

√nL N (0, 1), which does not allow discriminating the hypotheses.

The situation in which the discrimination becomes possible is the case wherenandm are both large simultaneously. A fundamental question concerns what therelative size ofm andn is for which discrimination is possible. In this section,we show thatm has to be at least linear inn and that in the linear case the powerincreases exponentially with the slopeτ .

Indeed, we show the following results:

If n → ∞, andm → ∞, then,

ifm

n→ τ < +∞, then L 1L6=−∞

H0

N (− τ2 , τ)

H1

N (+ τ2 , τ)

.

ifm

n→ ∞, then

n

mL 1L6=−∞

P0−→ −1

2P1−→ +1

2

,

and in the latter case the convergence to the limit isOp(nm).

40


T8.1.2 Proof

Denote

• Fj =Lj 6= −∞

, the event that the likelihood ratio for SNPj is finite

• Aj =|pj −pj| ≤ pj(1−ǫ); pj /∈ 0, 1

, the event that the pool frequency

pj is loosely concentrated aroundpj , which should occur with overwhelmingprobability for largen.

Consider the following decomposition:

L1L6=−∞ = LA + LF\A

LA =

m∑

j=1

Lj 1Aj LF\A =

m∑

j=1

Lj 1Fj\Aj.

Eqn. (A18) of SectionA4 shows that provideda ≤ pj ≤ 1−a andn > c 2a(1−ǫ) for

somec > 1 then there exists a constantCa independent ofn such thatE[L2j |Fj\Aj] ≤

Ca. On the other hand, Eqn. (A15) in SectionA2 shows that there exists a constantc1 such thatP(Fj\Aj) ≤ 2 exp−c1n. Thus

E

[

L2F\A

]

=m∑

j=1

E[Lj|Fj\Aj ] P(Fj\Aj) ≤ 2m Ca exp−(c1n) = o( n

m

)

.

(T8)This shows thatnm LF\A converges to zero sufficiently quickly, whetherm

n con-verges to a finite limit or not, to be negligible with respect to the other term.

We now considerLA. We introduce the functions:

ℓ′p(x, z) =1√n

x − p√

p(1−p)z and ℓ′′p(x, z) = − 1

2n

[

(1 − p)x

p+ p

1 − x

1 − p

]

z2,

(T9)wherez ∈ R, p ∈ [0, 1] andx ∈ 0, 1. Then,

Lj = ℓ′pj(xj , Zj)1Aj + ℓ′′pj

(xj , Zj)1Aj + R(pj, n), (T10)

whereR(pj, n) = Op(n−3/2). However, the stronger statement

∑mj=1 R(pj , n) =

Op(m

n3/2) holds because of the independence of each term. We therefore have

LA =m∑

j=1

ℓ′pj(xj , Zj) 1Aj

︸︷︷︸

ℓ′

+m∑

j=1

ℓ′′pj(xj , Zj) 1Aj

︸︷︷︸

ℓ′′

+ Op

( m

n3/2

)

. (T11)

41


We denote the conditional expectations given the pool frequency pj or equivalentlyits standardized formZj, under both alternatives denoted by the indexκ ∈ 0, 1by

h′κ, pj

(z) = Eκ[ℓ′pj(xj , Zj)|Zj = z] and h′′

κ, pj(z) = Eκ[ℓ′′pj

(xj, Zj)|Zj = z].

Given that1Aj

P→ 1 and thatZj Zjd= N (0, 1), by continuity ofh′

κ, pjand

h′′κ, pj

, and by Slutsky’s lemma, we have that

n h′κ, pj

(Zj)1Aj + h′′κ, pj

(Zj)1Aj ! n h′κ, pj

(Zj) + h′′κ, pj

(Zj).

We should note that one cannot conclude immediately that∑

j h′κ, pj

(Zj)1Aj !

∑

j h′κ, pj

(Zj) sincem → ∞; even statements about moments, such asE[∑

j h′κ, pj

(Zj)1Aj ] ∼∑

j E[h′κ, pj

(Zj)] are not immediate. However, the conditional expectationsh′ and

h′′ are actually second-degree polynomials in the i.i.d. variablesZj , with coeffi-cients bounded above because of the assumptionpj > a; if we consider a groupof terms of equal degree, a Lindeberg CLT applies to it, given that the terms areindependent and that their second moment is bounded (since the coefficients are),and this shows that the statementE[

∑

j h′κ, pj

(Zj)1Aj ] ∼∑

j E[h′κ, pj

(Zj)] is true.More generally the equivalents of all moments ofℓ′ andℓ′′ can be obtained and inparticular, we have

E0[ℓ′] = 0 E0[ℓ

′′] = − m

2n+ o

(m

n

)

E0

[(ℓ′ + ℓ′′

)2]

=m

n+ o

(m

n

)

E1[ℓ′] =

m

n+ o

(m

n

)

E1[ℓ′′] = − m

2n+ o

(m

n

)

E1

[(ℓ′ + ℓ′′

)2]

=m

n+ o

(m

n

)

.

Denoting, under different rescalings ofLA, the means (resp. variances) under thenull and the alternativeµ0 andµ1 (resp.σ2

0 andσ21) we have

LAn

mLA

µ0 − m

2n−1

2

µ1 +m

2n+

1

2

σ20 ∼ σ2

1

m

n

n

m

From these equivalents of the moments, ifmn → τ , we can appeal to the Lindeberg-

Feller central limit theorem. Indeed, in that case, we have thatE0[L2A] → −1

2τ

42


and E1[L2A] → +1

2τ and moreover each termℓ′pj(xj , Zj) + ℓ′′pj

(xj , Zj) is uni-

formly Op

(1√n

)

from the assumption thatpj > a, which shows that the condition

limm, n→∞∑

j E(L2j 1Aj 1Lj>ε) = 0 is satisfied for allε.

The casemn → ∞ is simpler since in that casen

2

m2 Eκ[L2A] → 0 for κ = 0, 1,

which shows thatnm L2A converges in probability under the null and the alternative

to the limits−12 and+1

2 respectively.

T8.1.3 Asymptotic Gaussian approximations

The simultaneous central limit theorem form andn provide a justification for thesubstitution of Gaussian variablesZj for Zj in our analysis and the fact that weignored the indicators1Aj .

To give a sense of why such a justification is needed we make a couple of pre-liminary remarks. As argued earlier,n (h′

κ, pj(Zj) + h′′

κ, pj(Zj))1Aj converges in

distribution to a second-degree polynomial in the Gaussian variableZj . It shouldbe noted, however, thatn(ℓ′pj

(xj, Zj) + ℓ′′pj(xj , Zj))1Aj does not converge in dis-

tribution and that√

n(ℓ′pj(xj, Zj) + ℓ′′pj

(xj , Zj))1Aj converges to the same limitunder the null and the alternative.

For eachj we define a Bernoulli random variablexj asxjd= Bin(1, pn(Zj))

where we define

pn(z) =

0 : p n(z) < 0

p n(z) : p

n(z) ∈ [0, 1]

1 : p n(z) > 1

with p n(z) = p +

√

p(1 − p)

nz

Then if

Lj = (ℓ′pj(xj , Zj)+ℓ′′pj

(xj , Zj)) =1√n

xj − pj√

pj(1−pj)Zj−

1

2n

[

(1 − pj)xj

pj+ pj

1 − xj

1 − pj

]

Z2j

we have

L !

m,n→∞

∑

j

Lj .

T8.2 Lindeberg-Feller CLT

In this section, we present a result which is complementary to the result of theprevious section. While in the previous case,n was assumed to be large, we nowpresent an analysis in whichn is not assumed to be large. The likelihood ratio ismore difficult to analyze in this case because the simple approximation of (T2) is

43


no longer valid. We show that the conditions of the Lindeberg-Feller central limittheorem nonetheless still hold, which means that a Gaussian approximation to thetest statistic is valid. Since means and variance can readily be computed, the powerof the test can still be computed to give privacy guaranties and accurate p-valuescan be computed for an actual test.

Conditionally on the eventpj /∈ 0, 1 for all j, we show that the Lindeberg-Feller CLT applies to 1√

mL in the sense that

Under hypothesisHκ,1√m

L N (µκ, σ2κ)

whereµκ andσ2κ are the mean and variance for hypothesisHκ.

In fact, showing thatE[L2j |pj /∈ 0, 1] is bounded from above by a constant

independent ofn suffices.Using the concavity of the logarithm, we first upper bound the log-ratio

∣∣∣∣log

(p

p

)∣∣∣∣≤ max

(p − p

p,p − p

p

)

so that

L2 ∆= x log

(p

p

)2

≤ x(p − p)2

p2+ x

(p − p)2

p 2.

Since

E0[L2|p] ≤ p

(1

p2+

1

p2

)

(p − p)2 =(p − p)2

p+ p − 2

p2

p+

p3

p2(T12)

we have, using the bounds on the moments of1p obtained in SectionA3, that

E0

[L2

1p /∈0,1]≤ 1 − p

n+ p − 2

p2

p+ 6

p3

p2= 5p +

1 − p

n

so that by symmetryE0[L2|p /∈ 0, 1] ≤ (1 − π∞)(5 + 1

n). For the alternatehypothesis, we have:

E1[L2|p] ≤ p

(1

p2+

1

p2

)

(p − p)2 =p3

p2− 2

p2

p+ p +

p2

p− 2p + p (T13)

E1

[L2

1p /∈0,1]≤ 1

p2

(

p3 +p(1 − p)

n

[

1 +1 − 2p

n

])

+ p +2p2

p+ p

44


so that using symmetry againE1[L2|p /∈ 0, 1] ≤ (1 − π∞)(5 + 4

np(1−p)).

Note that bothE0[L2|p /∈ 0, 1] andE1[L

2|p /∈ 0, 1] are bounded indepen-dently ofn and that, ifp is bounded away from0, 1 then both second momentsare bounded independently ofp.

This shows that, forκ ∈ 0, 1, 1m

∑mj=1 Eκ[L2

j |p /∈ 0, 1] converges to a

finite value and that∑m

j=1 Eκ[ 1mL2

j1L2

j>ǫm|p /∈ 0, 1] converges to zero for all

m, which are the conditions of the Lindeberg-Feller central limit theorem [3].

45


A Appendix

A1 Proof for the approximation to the log

In this section, we prove that the log-likelihood ratio can be approximated by aTaylor expansion to the second order in the sense that the remainderR(p, n) in

L =1√n

x − p√

p(1−p)Z − 1

2n

(x−p)2

p(1−p)Z2 + R(p, n),

whereZ.=

√n p−p√

p(1−p), satisfiesR(p, n) = Op(n

− 3

2 ).

Since log(1 + x) can be expanded as a series as long as|x| < 1, and sincethe expansion diverges on the edge of that domain, we can safely expand bothlogarithmic terms in the LR statistics on the following event:

A = |p − p | ≤ ǫmin(p, 1 − p) . (A14)

(A15) of SectionA2 shows that there existsc1 > 0 such thatP(Ac) ≤ 2 exp(−c1n).

This means thatL1Ac = Op(exp(−c1n)) and that term is therefore alsoOp(n− 3

2 ).Furthermore, onA, the remainder in the Taylor expansion, which we denote

R′(p, n), is alsoOp(n− 3

2 ). Indeed, consider each term in the series, wherex isassumed to satisfy|x| ≤ 1 − ǫ. We have:

|x|kk

≤ |x|3 (1 − ǫ)k−3

k.

Since, for the expansion of the log termsx is either

p − p

p= Z

√1 − p

npor

p − p

1 − p= Z

√p

n(1 − p),

we have in each case|x| ≤ |Z|cp

√n

where we introducedcp = min(√

p1−p ,

√1−p

p

)

.

We then have

R′(p, n) ≤ |Z|3c3pn

√n

∞∑

k=3

(1 − ǫ)k−3

k= Op

(

n− 3

2

)

,

since Z = Op(1). SinceR(p, n) ≤ L1Ac + R′(p, n)1A, we have shown thedesired result.

46


A2 Concentration of binomial random variables

In this section, we show that the regime in which the log terms in the LR statisticcan be expanded has very high probability. Specifically, we show that the com-plement ofA defined in Eqn. (A14) has vanishingly small probability. SinceAc

corresponds to large deviation of binomial random variables, either Hoeffding orChernoff inequalities can be used to obtained upper bounds. To get a good bound,we use the Chernoff inequality and a sharp lower bound of the KL divergence toget control of the concentration. The Chernoff inequality for the binomial is

P(p − p > ξ) ≤ exp−nKL(p + ξ‖p) with

KL(p+ξ‖p) = p

(

1 +ξ

p

)

log

(

1 +ξ

p

)

+(1−p)

(

1 − ξ

1 − p

)

log

(

1 − ξ

1 − p

)

.

One can verify that following inequality holds, considering lower bounds of thefunction(1 + x) log(1 + x):

∀x > −1, log(1 + x) ≥ x + 12x2 − 1

6x3

1 + x.

Thus

KL(p + ξ‖p) ≥ ξ +1

2

ξ2

p− 1

6

ξ3

p2− ξ +

1

2

ξ2

1 − p+

1

6

ξ3

(1 − p)2

≥ 1

2

ξ2

p(1 − p)− 1

6ξ3

[1

p2− 1

(1 − p)2

]

.

We consider−(1−ǫ)min(p , 1−p) ≤ p−p ≤ ξ.= (1−ǫ)min(p , 1−p). Without

loss of generality we assume thatp < 1 − p. We have:

KL(p + (1 − ǫ)p‖p) ≥ 1

2

(1 − ǫ)2p

(1 − p)− 1

6(1 − ǫ)3p

[

1 − p2

(1 − p)2

]

≥ 1

3(1 − ǫ)2p

so that in general we get

KL(p + (1 − ǫ)p‖p) ≥ 1

3(1 − ǫ)2 min(p, 1 − p).

Since we also have the inequality

P(p − p < −ξ) ≤ exp−nKL(p − ξ‖p)

and since by symmetry the same lower bound applies, we finally have:

P (|p − p| < (1 − ǫ)min(p, 1 − p)) ≤ 2 exp−(n

3(1 − ǫ)2 min(p, 1 − p)

)

.

(A15)

47


A3 Bounds on moments of1p

We first lower and upper boundE[p−1|p /∈ 0, 1] andE[p−2|p /∈ 0, 1]. Giventhat for k > 0 andx > 0, x 7→ x−k is a convex function we have, by Jensen’sinequality,E[p−k|p /∈ 0, 1] ≥ p−k. We have, using that fork ≥ 1, 1

k ≤ 2k+1 ,

E

[1

p1p/∈0,1

]

= n

n−1∑

k=1

(n

k

)

pk(1 − p)n−k 1

k

≤ 2n

n−1∑

k=1

(n

k

)

pk(1 − p)n−k 1

k + 1

≤ 2n

p(n + 1)

n−1∑

k=1

(n + 1

k + 1

)

pk+1(1 − p)n−k

≤ 2

p.

Similarly, using 1k2 ≤ 6

(k+1)(k+2) , we have

E

[1

p21p/∈0,1

]

≤ n2n−1∑

k=1

(n

k

)

pk(1 − p)n−k 1

k2

≤ 6nn−1∑

k=1

(n

k

)

pk(1 − p)n−k 1

(k + 1)(k + 2)

≤ 6n2

p2(n + 1)(n + 2)

n−1∑

k=1

(n + 2

k + 2

)

pk+2(1 − p)n−k

≤ 6

p2.

A4 LR moment bounds for large deviations of the binomial

The purpose of this section is to show that ifF = p /∈ 0, 1 is the event onwhichL is finite or zero andA = |p−p| ≤ min(p, 1−p)(1− ǫ) is the event thatp is moderately concentrated, then, conditionally on the eventF\A, the likelihoodratio is bounded in probability independently ofn. This will guarantee that the fewSNPs for whichpj is not concentrated contribute a negligible term to the overalllog-likelihood ratio. We make use of the following lemma from [4]: Let

fn(p, x) =

(n

x

)

px(1 − p)n−x Fn(p, x) =

n∑

k=x

fn(p, x).

48


Providedp ≤ xn ≤ 1, we have

fn(p, x) ≤ Fn(p, x) ≤ x + 1

n + 1

1 − px+1n+1 − p

fn(p, x) (A16)

Without loss of generality we assume thatp ≤ 1 − p. We define the setI(l, u) =l < p < u; p /∈ 0, 1. We have

E

[1

p1I(0,x)

]

≤ 2n

(n + 1)p

⌊nx⌋∑

k=1

(n + 1

k + 1

)

pk+1(1 − p)n−k

≤ 2n

(n + 1)pFn+1(1 − p, n + 1 − (⌊nx⌋ + 1))

P(I(0, x)) = Fn(1 − p , n − ⌊nx⌋) − fn(1−p, n) ≥ fn(1−p, n − ⌊nx⌋)1⌊nx⌋>0.

Combining the last equations with Eqn. (A16), we get, provided1n ≤ x ≤ p − 1n

E

[1

p

∣∣∣∣I(0, x)

]

≤ 2n

p(n + 1)

n − ⌊nx⌋ + 1

n + 2

pn−⌊nx⌋+1

n+2 − (1 − p)

fn+1(1 − p, n − ⌊nx⌋)fn(1 − p, n − ⌊nx⌋)

≤ 2n + 1 − ⌊nx⌋

n + 2

1

p − ⌊nx⌋+1n+2

n + 1

⌊nx⌋ + 1p

≤ 21

x

p

p − x − 1n

.

Similarly, we have

E

[1

p21I(0,x)

]

≤ 6n2

p2(n + 1)(n + 2)

⌊nx⌋∑

k=1

(n + 2

k + 2

)

pk+2(1 − p)n−k

≤ 6n2

p2(n + 1)(n + 2)Fn+2(1 − p, n − ⌊nx⌋),

so that, provided1n ≤ x ≤ p − 2n ,

E

[1

p2

∣∣∣∣I(0, x)

]

≤ 6n2

p2(n + 1)(n + 2)

n − ⌊nx⌋ + 1

n + 3

pn−⌊nx⌋+1

n+3 − (1 − p)

fn+2(1 − p, n − ⌊nx⌋)fn(1 − p, n − ⌊nx⌋)

≤ 6n2

p2(n + 1)(n + 2)

n + 1 − ⌊nx⌋n + 3

p

p − ⌊nx⌋+2n+3

(n + 1)(n + 2)

(⌊nx⌋ + 1)(⌊nx⌋ + 2)p2

≤ 61

x2

p

p − x − 2n

.

49


If we denoteEq the expectation forp ∼ Bin(n, q) (so that all the previously writtenexpectations are denotedEp) we have that

Ep

[1

1 − p

∣∣∣∣I(x, 1)

]

= E1−p

[1

p

∣∣∣∣I(0, 1 − x)

]

≤ 21

1 − x

1 − p

x − p − 1n

Ep

[1

(1 − p)2

∣∣∣∣I(x, 1)

]

= E1−p

[1

p2

∣∣∣∣I(0, 1 − x)

]

≤ 61

(1 − x)21 − p

x − p − 2n

.

We obviously have

E[pk|I(0, x)] ≤ xk, k = 1, 2

E[p−k|I(x, 1)] ≤ x−k, k = 1, 2

E[(1 − p)−k|I(0, x)] ≤ (1 − x)−k, k = 1, 2,

and from the above we have

E[p−1|I(0, ǫp)] ≤ 2

pǫ(1 − ǫ)

1

1 − 1np(1−ǫ)

E[p−2|I(0, ǫp)] ≤ 6

p2ǫ2(1 − ǫ)

1

1 − 2np(1−ǫ)

E[(1 − p)−1|I(p + (1 − ǫ)p, 1)] ≤ 2

pǫ(1 − ǫ)

1 − p

p

1

1 − 1np(1−ǫ)

E[(1 − p)−2|I(p + (1 − ǫ)p, 1)] ≤ 6

p2ǫ2(1 − ǫ)

1 − p

p

1

1 − 2np(1−ǫ)

,

where to obtain the last two inequalities we used the fact that1−(p+(1−ǫ)p) ≥ ǫp,which is a consequence ofp ≤ 1

2 .

T 1.4.1 Actual bounds

Forκ ∈ 0, 1, we have

Eκ[L|p] ≤ 1

p2+ E

[1

p2

]

+1

(1 − p)2+ E

[1

(1 − p)2.

]

(A17)

50


With inequality (A17), we get, still assumingp < 1 − p

E

[1

p2

∣∣∣∣F\A

]

≤ E

[1

p2

∣∣∣∣I(0, ǫp)

]

+ E

[1

p2

∣∣∣∣I((2 − ǫ)p, 1)

]

≤ 6

p2ǫ2(1 − ǫ)

1

1 − 2np(1−ǫ)

+1

(2 − ǫ)2p2

E

[1

(1 − p)2

∣∣∣∣F\A

]

≤ E

[1

(1 − p)2

∣∣∣∣I(0, ǫp)

]

+ E

[1

(1 − p)2

∣∣∣∣I(2 − ǫ)p, 1)

]

≤ 1

(1 − ǫp)2+

6

p2ǫ2(1 − ǫ)

1 − p

p

1

1 − 2np(1−ǫ)

so that

Eκ[L|F\A] ≤ 2

(1 − p)2+

1

p2

(

2 +6

pǫ2(1 − ǫ)

1

1 − 2np(1−ǫ)

)

.

Finally, removing the assumptionp < 1 − p and witha ≤ min(p, 1 − p), we get

Eκ[L|F\A] ≤ 8 +1

a2

(

2 +6

aǫ2(1 − ǫ)

1

1 − 2an(1−ǫ)

)

. (A18)

51


Bibliography

1. Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, PearsonJV, Stephan DA, Nelson SF, and Craig DW. Resolving individuals contribut-ing trace amounts of DNA to highly complex mixtures using high-density SNPgenotyping microarrays.PLoS Genet., 4(8):e1000167, 2008.

2. E. L. Lehmann. Testing Statistical Hypotheses. Springer Texts in Statistics,New York, NY, 2005.

3. R. Durrett.Probability: Theory and Examples. Dubury Press, 2003.

4. B. Klar. Bounds on tail probabilities of discrete distributions.Probability inEngineering and the Informational Sciences, 14:161–171, 2000.

52


genomic privacy and limits of individual detection...

Documents