a survey design for a sensitive binary variable correlated ... · title a survey design for a...
TRANSCRIPT
Title A survey design for a sensitive binary variable correlated withanother nonsensitive binary variable
Author(s) Yu JW Lu Y Tian G
Citation Journal of Probability and Statistics 2013 v 2013 article no827048
Issued Date 2013
URL httphdlhandlenet10722191507
Rights Creative Commons Attribution 30 Hong Kong License
Hindawi Publishing CorporationJournal of Probability and StatisticsVolume 2013 Article ID 827048 11 pageshttpdxdoiorg1011552013827048
Research ArticleA Survey Design for a Sensitive Binary Variable Correlated withAnother Nonsensitive Binary Variable
Jun-Wu Yu1 Yang Lu2 and Guo-Liang Tian3
1 School of Mathematics and Computational Science Hunan University of Science and Technology Xiangtan Hunan 411201 China2 School of Mathematics and Statistics Central China Normal University Wuhan Hubei 430079 China3Department of Statistics and Actuarial Science The University of Hong Kong Pokfulam Road Hong Kong
Correspondence should be addressed to Guo-Liang Tian gltianhkuhk
Received 30 August 2012 Accepted 19 May 2013
Academic Editor Zhidong Bai
Copyright copy 2013 Jun-Wu Yu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Tian et al (2007) introduced a so-called hidden sensitivity model for evaluating the association of two sensitive questions withbinary outcomes However in practice we sometimes need to assess the association between one sensitive binary variable (egwhether or not a drug user the number of sex partner being ⩽1 or gt1 and so on) and one nonsensitive binary variable (eg good orpoor health status with or without cervical cancer and so on) To address this issue by sufficiently utilizing the information con-tained in the non-sensitive binary variable in this paper we propose a new survey scheme called combination questionnairedesignmodel which consists of a main questionnaire and a supplemental questionnaire The introduction of the supplementalquestionnaire which is indeed a design of direct questioning can effectively reduce the noncompliance behavior since morerespondents will not be faced with the sensitive question Likelihood-based inferences including maximum likelihood estimatesvia the expectation-maximization algorithm asymptotic confidence intervals and bootstrap confidence intervals of parameters ofinterest are derived A likelihood ratio test is provided to test the association between the two binary random variables Bayesianinferences are also discussed Simulation studies are performed and a cervical cancer data set in Atlanta is used to illustrate theproposed methods
1 Introduction
Warner [1] introduced a randomized response technique toobtain truthful answers to questions with sensitive attributesUsing the Warner design Kraemer [2] derived a bivariatecorrelation between an attribute with polytomous responsesand an attribute with normally distributed responses Fox andTracy [3] derived estimation of the Pearson product momentcorrelation coefficient between two sensitive questions byassuming that randomized response observations can betreated as individual-level scores that are contaminated byrandom measurement error Edgell et al [4] consideredthe correlation between two sensitive questions using theunrelated question design or the additive constants designChristofides [5] presented a randomized response techniquewith two randomization devices to estimate the proportionof individuals having two sensitive characteristics at thesame time Kim and Warde [6] considered a multinomial
randomized response model which can handle untruthfulresponses They also derived the Pearson product momentcorrelation estimator which may be used to quantify thelinear relationship between two variables when multinomialresponse data are observed according to a randomizedresponse procedure However all these randomized responseprocedures make use of one or two randomizing deviceswhich (i) entail extra costs in both efficiency and complexity(ii) increase the cognitive load of randomized responsetechniques and (iii) allow for new sources of error such asmisunderstanding the randomized response procedures orcheating on the procedures [7]
From the perspective of incomplete categorical datadesign Tian et al [8] proposed a nonrandomized responsemodel (called the hidden sensitivity model) for assessing theassociation of two sensitive questions with binary outcomesTo protect respondentsrsquo privacy and to avoid the use of anyrandomization device they utilized a non-sensitive question
2 Journal of Probability and Statistics
in the questionnaire to indirectly obtain respondentsrsquo answersto the two sensitive questions In the hidden sensitivitymodel they implicitly assumed that all respondents arewilling to follow the design instructions In other words thenoncompliance behavior will not occur However in practicewe sometimes need to assess the association between onesensitive binary variable (eg whether or not a drug user thenumber of sex partner being ⩽1 or gt1 and so on) and onenonsensitive binary variable (eg good or poor health statuswith or without cervical cancer and so on) To our know-ledge the survey design for addressing this issue and cor-responding statistical analysis methods are not availableAlthough we could directly adopt the hidden sensitivitymodel the information contained in the nonsensitive binaryvariable cannot be utilized in the design Intuitively suchinformation can be used to enhance the degree of privacy pro-tection so that more respondents will not be faced with thesensitive questionThemajor objective of this paper is to pro-pose a new survey design and to develop corresponding sta-tistical methods for analyzing sensitive data collected by thistechnique
The rest of the paper is organized as follows In Section 2without using any randomizing device we propose a surveyscheme called combination questionnaire designmodelwhich consists of a main questionnaire and a supplementalquestionnaire Likelihood-based inferences including maxi-mum likelihood estimates via an expectationndashmaximization(EM) algorithm asymptotic confidence intervals and boot-strap confidence intervals of parameters of interest arederived in Section 3 A likelihood ratio test is also provided totest the association between the two binary random variablesIn Section 4we discuss Bayesian inferenceswhen prior infor-mation on parameters is available In Section 5 two simu-lation studies are performed to compare the efficiency ofthe proposed combination questionnaire model with that ofthe existing hidden sensitivity model of Tianet al [8] (iethe main questionnaire only) A cervical cancer data set inAtlanta is used in Section 6 to illustrate the proposed meth-ods A discussion and an appendix on the mode of a groupDirichlet density and a sampling method from it are alsopresented
2 The Survey Design
Assume that 119883 is a sensitive binary random variable 119884 isa non-sensitive binary random variable and they are corre-lated Let 119883 = 1 denote the sensitive class (eg 119883 = 1 if arespondent is a drug user) and 119883 = 0 denote the non-sen-sitive class (eg 119883 = 0 if a respondent is not a drug user)Furthermore let both 119884 = 1 (eg 119884 = 1 if a respondentreceives at least some college training) and 119884 = 0 (eg119884 = 0 if a respondent graduates at most from some highschool) be non-sensitive classes Define 120579 = (120579
1 120579
4)⊤
where 1205791= Pr(119883 = 0 119884 = 0) 120579
2= Pr(119883 = 1 119884 = 0)
1205793= Pr(119883 = 1 119884 = 1) and 120579
4= Pr(119883 = 0 119884 = 1) then
120579 isin T4 where
T119899= (119909
1 119909
119899)⊤
119909119894⩾ 0 119894 = 1 119899
119899
sum
119894=1
119909119894= 1 (1)
Table 1 The main questionnaire for the combination questionnairemodel
Category 119882 = 1 119882 = 2 119882 = 3
I 119883 = 0 119884 = 0 Block 1 mdash Block 2 mdash Block 3 mdashII 119883 = 1 119884 = 0 Category II Please put a tick in Block 2III 119883 = 1 119884 = 1 Category III Please put a tick in Block 3IV 119883 = 0 119884 = 1 Block 4 mdashNote Only 119883 = 1 is a sensitive class while 119883 = 0 119884 = 0 and 119884 = 1are nonsensitive classes
Table 2 The supplemental questionnaire for the combinationquestionnaire model
Category
119884 = 0Please put a tick in Block 5 mdash if you belong to
119884 = 0
119884 = 1Please put a tick in Block 6 mdash if you belong to
119884 = 1
Note Both the 119884 = 0 and 119884 = 1 are nonsensitive classes
denotes the 119899-dimensional closed simplex in R119899 The objec-tive is to make inferences on 120579 120579
119909= Pr (119883 = 1) = 120579
2+ 1205793
120579119910= Pr(119884 = 1) = 120579
4+ 1205793and the odds ratio 120595 = 120579
11205793(12057921205794)
The survey scheme consists of a main questionnaire anda supplemental questionnaire To design the main question-naire which is to be assigned to group 1 with 119899 respondents(119899 is specified by the investigators) we first introduce a non-sensitive question (say 119876
119882) with three possible answers
Assume that 119882 is a non-sensitive variate with trichotomousoutcomes associated with the 119876
119882and 119882 is independent of
both 119883 and 119884 Define 119901119894= Pr(119882 = 119894) for 119894 = 1 2 3 For
example let119882 = 1 (2 3) denote that a respondent was bornin JanuaryndashApril (MayndashAugust SeptemberndashDecember) andthus we could assume that 119901
119894asymp 13 The main questionnaire
is shown in Table 1 under which each respondent is asked toanswer the non-sensitive question 119876
119882
On the one hand since Category I (ie 119883 = 0 119884 = 0)and Category IV (ie 119883 = 0 119884 = 1) are non-sensitive toeach respondent it is reasonable to assume that a respondentis willing to provide hisher truthful answer by putting a tickin Block 119894 (119894 = 1 4) according to hisher true statusOn the other hand Category II (ie 119883 = 1 119884 = 0) andCategory III (ie 119883 = 1 119884 = 1) are usually sensitive torespondents In this case if a respondent belongs to CategoryII (III) heshe is designed to put a tick in Block 2 (3)
The supplemental questionnaire is designed as shown inTable 2 under which 119898 respondents (119898 is also specified bythe investigators) in group 2 are asked to put a tick in Block5 or Block 6 depending on their true status that is 119884 = 0
or 119884 = 1 Since both the 119884 = 0 and 119884 = 1 are non-sensitive classes the supplemental questionnaire is in fact adesign of direct questioningTherefore we call this design thecombination questionnaire model
Table 3 shows the cell probabilities 1205791198944
119894=1 the observed
frequencies 1198991198944
119894=1and the unobservable frequencies 119885
1198943
119894=1
for the main questionnaire The observed frequency 1198992is the
sumof the frequency of respondents belonging to Block 2 and
Journal of Probability and Statistics 3
Table 3 Cell probabilities observed and unobservable counts forthe main questionnaire
Category 119882 = 1 119882 = 2 119882 = 3 TotalI 119883 = 0 119884 = 0 119901
11205791
11990121205791
11990131205791
1205791(1198851)
II 119883 = 1 119884 = 0 1205792(1198852)
III 119883 = 1 119884 = 1 1205793(1198853)
IV 119883 = 0 119884 = 1 1205794(1198994) 120579
4(1198994)
Total 11990111205791(1198991) 11990121205791+ 1205792(1198992) 11990131205791+ 1205793(1198993) 1(119899)
Note 119899 = sum4119894=11198991198941198851 = 119899minus(1198852 +1198853)minus1198994 where (1198852 1198853) are unobservable
Table 4Cell probabilities andobserved counts for the supplementalquestionnaire
Category Total119884 = 0 Block 5 mdash 120579
1+ 1205792(1198980)
119884 = 1 Block 6 mdash 1205793+ 1205794(1198981)
Total 1 (119898)
Note119898 = 1198980 + 1198981
the frequency of those belonging toCategory IIThe observedfrequency 119899
3is the sum of the frequency of respondents
belonging to Block 3 and the frequency of those belongingto Category III Note that 119899 = sum
4
119894=1119899119894= sum3
119894=1119885119894+ 1198994 we have
1198851= 119899minus(119885
2+1198853)minus1198994Thus only119885
2and119885
3are unobservable
Table 4 shows the cell probabilities and observed counts forthe supplemental questionnaire
Remark 1 The design of the main questionnaire is similar tothat of the hidden sensitivitymodel of Tian et al [8] while thedesign of the supplemental questionnaire is indeed a designof direct questioning since both the 119884 = 0 and 119884 = 1 arenon-sensitive classes Table 1 shows that Categories II and IIIare two sensitive subclassesTherefore putting a tick in Block2 or Block 3 implies that the respondent could be suspectedwith the sensitive attribute Let 120579
1= sdot sdot sdot = 120579
4asymp 025
(see Table 3) then around half of the say 2119899 respondentswill be suspected with the sensitive attribute if only themain questionnaire is employed However besides the mainquestionnaire (with 119899 respondents) if the supplemental ques-tionnaire (see Table 2) with 119898 = 119899 respondents is also usedthen only half of the 119899 respondents will be suspected withthe sensitive attribute In other words in the proposed com-bination questionnaire model the information of the non-sensitive binary variable 119884 can be used to enhance the degreeof privacy protection so that more respondents will not befacedwith the sensitive questionThis is whywe introduce thesupplemental questionnaire besides the main questionnaire
Remark 2 In practice to simplify the design itself we suggestthat both the sample size 119899 in the main questionnaire andthe sample size 119898 in the supplemental questionnaire shouldbe fixed in advance rather than the total sample size 119873 =
119899 + 119898 being fixed In this way survey data can be collectedin two independent groups resulting in a relatively simplerstatistical analysis In addition the interviewees are randomlyassigned to either the group 1 or group 2
3 Likelihood-Based Inferences
In this section maximum likelihood estimates (MLEs) of the120579 and the odds ratio 120595 are derived by using the EM algo-rithm In addition asymptotic confidence intervals and thebootstrap confidence intervals of an arbitrary function of 120579are also provided Finally a likelihood ratio test is presentedfor testing the association between the two binary randomvariables
31 MLEs via the EM Algorithm A total of 119873 = 119899 + 119898
respondents are classified into two groups by a randomizationapproach such that 119899 respondents answer the questionsin the main questionnaire and 119898 respondents answer thequestions in the supplemental questionnaire Let 119884obsM =
119899 1198991 119899
4 denote the observed counts collected in the
main questionnaire (see Table 3) where sum4
119894=1119899119894= 119899 The
likelihood function of 120579 based on 119884obs119872 is
119871 (120579 | 119884obs119872) = (119899
1198991 1198992 1198993 1198994
)
times (11990111205791)1198991
3
prod
119894=2
(1199011198941205791+ 120579119894)119899119894
1205791198994
4
(2)
Let 119884obs119878 = 1198981198980 1198981 denote the observed counts gathered
in the supplemental questionnaire (see Table 4) where 1198980+
1198981= 119898 The likelihood function of 120579 based on 119884obs119878 is
119871 (120579 | 119884obs119878) = (119898
1198980
) (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(3)
Let 119884obs = 119884obs119872 119884obs119878 Since 119884obs119872 and 119884obs119878 are inde-pendent the observed-data likelihood function of 120579 isin T
4is
119871CQ (120579 | 119884obs119878) prop 1205791198991
1
3
prod
119894=2
(1199011198941205791+ 120579119894)119899119894
1205791198994
4
times (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(4)
where the subscript ldquoCQrdquodenotes the ldquocombination question-nairerdquo model
Since the corresponding cell probabilities to the observedcounts 119899
2and 1198993in the group 1 are in the form of summation
(ie 1199011198941205791+ 120579119894 119894 = 2 3) we cannot obtain the explicit
expressions of the MLEs of 120579 from the score equations of(4) By treating the observed counts 119899
2and 119899
3as incomplete
data we use the EM algorithm [9] to find the MLE of 120579The counts 119885
1198943
119894=1in Table 3 can be viewed as missing data
Briefly119885111988521198853 and 119899
4represent the counts of the respond-
ents belonging to Categories I II III and IV respectivelyThus we denote the latent data by 119884mis = 119911
2 1199113 and the
complete data by 119884com = 119884obs 119884mis Note that all 119901i areknown Consequently the complete-data likelihood functionfor 120579 is
119871CQ (120579 | 119884COM) prop (
3
prod
119894=1
120579119911119894
119894)1205791198994
4times (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(5)
where 1199111= 119899 minus (119911
2+ 1199113) minus 1198994
4 Journal of Probability and Statistics
By treating 1205791198944
119894=1as random variables we note that
the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=1199112
119873(1 +
1198980
119899 minus 1199113minus 1198994
)
1205793=1199113
119873(1 +
1198981
1199113+ 1198994
)
1205794=1198994
119873(1 +
1198981
1199113+ 1198994
)
(6)
Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579
119894(1199011198941205791+ 120579119894) that is
119885119894| (119884obs 120579) sim Binomial(119899
119894
120579119894
1199011198941205791+ 120579119894
) 119894 = 2 3 (7)
Therefore the E-step of the EM algorithm computes thefollowing conditional expectations
119864 (119885119894| 119884obs 120579) =
119899119894120579119894
1199011198941205791+ 120579i
119894 = 2 3 (8)
and the M-step updates (6) by replacing 1199112and 119911
3with
previous conditional expectations
Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)
32 Asymptotic Confidence Intervals Let 120579minus4
= (1205791 1205792 1205793)⊤
The asymptotic variance-covariance matrix of theMLE minus4
isthen given by Iminus1obs(minus4) where
Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)
120597120579minus4120597120579⊤
minus4
(9)
denotes the observed information matrix and ℓCQ(120579 |
119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have
ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901
21205791+ 1205792)
+ 1198993log (119901
31205791+ 1205793)
+ 1198994log (1 minus 120579
1minus 1205792minus 1205793)
+ 1198980log (120579
1+ 1205792) + 1198981log (1 minus 120579
1minus 1205792)
(10)
It is easy to show that
120597ℓCQ (120579 | 119884obs)
1205971205791
=1198991
1205791
minus1198994
1205794
+
3
sum
119894=2
119899i119901i119901i1205791 + 120579i
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205792
= minus1198994
1205794
+1198992
11990121205791+ 1205792
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205793
= minus1198994
1205794
+1198993
11990131205791+ 1205793
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
1
=1198991
1205792
1
+1198994
1205792
4
+
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
2
=1198994
1205792
4
+1198992
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
3
=1198994
1205792
4
+1198993
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205792
=1198994
1205792
4
+11989921199012
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205793
=1198994
1205792
4
+11989931199013
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057921205971205793
=1198994
1205792
4
(11)
where
120601 =1198980
(1205791+ 1205792)2+
1198981
(1 minus 1205791minus 1205792)2 (12)
Hence the observed information matrix can be expressed as
Iobs (120579minus4) = diag(1198991
1205792
1
0 0) +1198994
1205792
4
times 131⊤3+ A (13)
Journal of Probability and Statistics 5
whereA =
((((
(
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+120601
11989921199012
(11990121205791+ 1205792)2+120601
11989931199013
(11990131205791+1205793)2
11989921199012
(11990121205791+ 1205792)2+120601
1198992
(11990121205791+ 1205792)2+120601 0
11989931199013
(11990131205791+ 1205793)2 0
1198993
(11990131205791+1205793)2
))))
)
(14)
Let se(120579119894) denote the standard error of 120579
119894for 119894 = 1 2 3
Note that se(120579119894) can be estimated by the square root of the 119894th
diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579
119894) by se(120579
119894)Thus a 95normal-based asymptotic con-
fidence interval for 120579i can be constructed as
[120579119894minus 196 times se (120579
119894) 120579119894+ 196 times se (120579
119894)] 119894 = 1 2 3 (15)
Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of
120579minus4 For example 120579
4= 1 minus sum
3
119894=1120579i and the odds ratio 120595 =
12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used
to approximate the standard error of 120599 = ℎ(minus4) and a 95
normal-based asymptotic confidence interval for 120599 is given by
[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)
where
se (120599) = (120597120599
120597120579minus4
)
⊤
Iminus1obs (120579minus4) (120597120599
120597120579minus4
)
10038161003816100381610038161003816100381610038161003816120579minus4= minus4
12
(17)
33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579
minus4) Based on the obtained MLE we inde-
pendently generate
(119899lowast
1 119899
lowast
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(18)
119898lowast
0sim Binomial (119898 120579
1+ 1205792) (19)
Having obtained 119884lowast
obs119872 = 119899 119899lowast
1 119899
lowast
4 and 119884
lowast
obs119878 =
119898119898lowast
0 119898lowast
1 where 119898
lowast
1= 119898 minus 119898
lowast
0 we can calculate the
bootstrap replication 120599lowast= ℎ(120579
lowast
minus4) based on 119884
lowast
obs = 119884lowast
obs119872
119884lowast
obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast
119892119866
119892=1 Consequently a (1 minus 120572)100 bootstrap
confidence interval for 120599 is given by
[120599119871 120599119880] (20)
where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles
of 120599lowast119892119866
119892=1 respectively
34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]
1198670 120595 = 1 against 119867
1 120595 = 1 (21)
The likelihood ratio statistic is defined by
Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)
where 119877denotes the restrictedMLE of 120579 under119867
0 denotes
the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)
To find the restricted MLE 119877 we also employ the 119864119872
algorithm Under1198670 12057911205793= 12057921205794 we have
1205791= (1 minus 120579
119909) (1 minus 120579
119910)
1205792= 120579119909(1 minus 120579
119910)
1205793= 120579119909120579119910
1205794= (1 minus 120579
119909) 120579119910
(23)
In otherwords under1198670we only have two free parameters 120579
119909
and 120579119910 Having obtained the restrictedMLEs 120579
119909119877and 120579119910119877
wecan compute the restricted MLE
119877= (1205791119877
1205794119877
)⊤ from
(23) by
1205791119877
= (1 minus 120579119909119877
) (1 minus 120579119910119877
)
1205792119877
= 120579119909119877
(1 minus 120579119910119877
)
1205793119877
= 120579119909119877
120579119910119877
1205794119877
= (1 minus 120579119909119877
) 120579119910119877
(24)
In what follows we consider the computation of therestricted MLEs 120579
119909119877and 120579
119910119877 Now the complete-data like-
lihood function (5) becomes
119871CQ (120579119909 120579119910| 119884com 1198670)
prop (1 minus 120579119909) (1 minus 120579
119910)1199111
120579119909(1 minus 120579
119910)1199112
times (120579119909120579119910)1199113
(1 minus 120579119909) 1205791199101198994
(1 minus 120579119910)1198980
1205791198981
119910
= 1205791199112+1199113
119909(1 minus 120579
119909)1199111+1198994
1205791199113+1198994+1198981
119910(1 minus 120579
119910)1199111+1199112+1198980
(25)
so that the restricted MLEs of 120579119909and 120579
119910based on the
complete-data are given by
120579119909=1199112+ 1199113
119899 120579
119910=1199113+ 1198994+ 1198981
119873 (26)
respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579
119894 are defined in (23) Finally under119867
0
Λ asymptotically follows chi-squared distribution with onedegree of freedom
6 Journal of Probability and Statistics
4 Bayesian Inferences
To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911
2 1199113 are the same as
those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886
1 119886
4)
is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is
119891 (120579 | 119884obs 119884mis) = GD422
(120579minus4
| a b) (27)
where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898
0 1198981)⊤
and 1199111= 119899 minus (119911
2+ 1199113) minus 1198994 The conditional predictive dis-
tribution is
119891 (119884mis | 119884obs 120579) =3
prod
119894=2
Binomial(119911119894
10038161003816100381610038161003816100381610038161003816
119899119894
120579119894
1199011198941205791+ 120579119894
) (28)
Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=
1198862minus 1 + 119911
2
119886+minus 4 + 119873
(1 +1198980
1198861+ 1198862minus 2 + 119899 minus 119911
3minus 1198994
)
1205793=
1198863minus 1 + 119911
3
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
1205794=
1198864minus 1 + 119899
4
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
(29)
where 119886+= sum4
119894=1119886119894 and the 119864-step is to replace 119911
1198943
119894=2by the
conditional expectations given by (8)In addition based on (27) and (28) the data augmen-
tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix
5 Simulation Studies
In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901
119894=Pr(119882 =
119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model
In the first simulated example let a total of119873 = 119899 + 119898 =
50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579
1198944
119894=1are
listed in the second column of Table 5 We first generate
(1198991 119899
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(30)
and1198980sim Binomial (119898 120579
1+1205792) so that119898
1= 119898minus119898
0The119864119872
algorithm (6) and (8) is used to calculate the MLEs of 1205791198944
119894=1
We repeated this experiment 1000 times The average MLEsof 1205791198944
119894=1are reported in the third column of Table 5 The
corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5
From Table 5 we can see that both the MLEs of 1205791198944
119894=1
in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579
1198944
119894=1in the combination questionnaire model
are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency
In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =
119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6
From Table 6 we can see that the MSEs of 1205791198944
119894=1in the
combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944
119894=1for the proposed combination questionnaire model
in the second simulated example are significantly improvedwhen compared with those in the first simulated example
6 Analyzing Cervical Cancer Data in Atlanta
Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer
For the purpose of illustration we presume that 119883 = 0
is non-sensitive although the number 0ndash3 of sex partners
Journal of Probability and Statistics 7
Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01013 00013 00113 00113 01010 00010 00064 000641205792
050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793
020 02008 00008 00037 00037 02001 00001 00029 000291205794
020 01994 minus00006 00026 00026 02014 00014 00016 000161205791
020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792
030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793
040 04038 00038 00040 00040 04025 00025 00037 000371205794
010 01006 00006 00017 00017 00991 minus00009 00009 000091205791
030 03021 00021 00107 00107 03026 00026 00081 000811205792
025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793
015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794
030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01003 00003 00030 00030 01013 00013 00064 000641205792
050 05015 00015 00029 00029 05019 00019 00041 000411205793
020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794
020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791
020 02007 00007 00041 00041 02049 00049 00057 000571205792
030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793
040 04007 00007 00020 00020 03992 minus00008 00037 000371205794
010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791
030 03032 00032 00054 00054 02965 minus00036 00079 000791205792
025 02488 minus00012 00039 00039 02501 00001 00038 000381205793
015 01481 minus00019 00021 00021 01531 00031 00032 000321205794
030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 7 Cervical cancer data fromWilliamson and Haber [17]
Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)
119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903
4 1205794)
119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903
3 1205793)
Missing 43 (11990312 1205791+ 1205792) 59 (119903
34 1205793+ 1205794)
Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate
is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)
if a respondent was born in JanuaryndashApril (MayndashAugust
SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent
of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899
1= 11990313 = 55 119899
2=
55 + 1199032= 219 119899
3= 55 + 119903
3= 276 119899
4= 1199034= 103 that is
119884obs119872 = 119899 1198991 119899
4 = 653 55 219 276 103 (31)
On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898
0= 11990312
= 43 and1198981= 11990334
= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59
Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Hindawi Publishing CorporationJournal of Probability and StatisticsVolume 2013 Article ID 827048 11 pageshttpdxdoiorg1011552013827048
Research ArticleA Survey Design for a Sensitive Binary Variable Correlated withAnother Nonsensitive Binary Variable
Jun-Wu Yu1 Yang Lu2 and Guo-Liang Tian3
1 School of Mathematics and Computational Science Hunan University of Science and Technology Xiangtan Hunan 411201 China2 School of Mathematics and Statistics Central China Normal University Wuhan Hubei 430079 China3Department of Statistics and Actuarial Science The University of Hong Kong Pokfulam Road Hong Kong
Correspondence should be addressed to Guo-Liang Tian gltianhkuhk
Received 30 August 2012 Accepted 19 May 2013
Academic Editor Zhidong Bai
Copyright copy 2013 Jun-Wu Yu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Tian et al (2007) introduced a so-called hidden sensitivity model for evaluating the association of two sensitive questions withbinary outcomes However in practice we sometimes need to assess the association between one sensitive binary variable (egwhether or not a drug user the number of sex partner being ⩽1 or gt1 and so on) and one nonsensitive binary variable (eg good orpoor health status with or without cervical cancer and so on) To address this issue by sufficiently utilizing the information con-tained in the non-sensitive binary variable in this paper we propose a new survey scheme called combination questionnairedesignmodel which consists of a main questionnaire and a supplemental questionnaire The introduction of the supplementalquestionnaire which is indeed a design of direct questioning can effectively reduce the noncompliance behavior since morerespondents will not be faced with the sensitive question Likelihood-based inferences including maximum likelihood estimatesvia the expectation-maximization algorithm asymptotic confidence intervals and bootstrap confidence intervals of parameters ofinterest are derived A likelihood ratio test is provided to test the association between the two binary random variables Bayesianinferences are also discussed Simulation studies are performed and a cervical cancer data set in Atlanta is used to illustrate theproposed methods
1 Introduction
Warner [1] introduced a randomized response technique toobtain truthful answers to questions with sensitive attributesUsing the Warner design Kraemer [2] derived a bivariatecorrelation between an attribute with polytomous responsesand an attribute with normally distributed responses Fox andTracy [3] derived estimation of the Pearson product momentcorrelation coefficient between two sensitive questions byassuming that randomized response observations can betreated as individual-level scores that are contaminated byrandom measurement error Edgell et al [4] consideredthe correlation between two sensitive questions using theunrelated question design or the additive constants designChristofides [5] presented a randomized response techniquewith two randomization devices to estimate the proportionof individuals having two sensitive characteristics at thesame time Kim and Warde [6] considered a multinomial
randomized response model which can handle untruthfulresponses They also derived the Pearson product momentcorrelation estimator which may be used to quantify thelinear relationship between two variables when multinomialresponse data are observed according to a randomizedresponse procedure However all these randomized responseprocedures make use of one or two randomizing deviceswhich (i) entail extra costs in both efficiency and complexity(ii) increase the cognitive load of randomized responsetechniques and (iii) allow for new sources of error such asmisunderstanding the randomized response procedures orcheating on the procedures [7]
From the perspective of incomplete categorical datadesign Tian et al [8] proposed a nonrandomized responsemodel (called the hidden sensitivity model) for assessing theassociation of two sensitive questions with binary outcomesTo protect respondentsrsquo privacy and to avoid the use of anyrandomization device they utilized a non-sensitive question
2 Journal of Probability and Statistics
in the questionnaire to indirectly obtain respondentsrsquo answersto the two sensitive questions In the hidden sensitivitymodel they implicitly assumed that all respondents arewilling to follow the design instructions In other words thenoncompliance behavior will not occur However in practicewe sometimes need to assess the association between onesensitive binary variable (eg whether or not a drug user thenumber of sex partner being ⩽1 or gt1 and so on) and onenonsensitive binary variable (eg good or poor health statuswith or without cervical cancer and so on) To our know-ledge the survey design for addressing this issue and cor-responding statistical analysis methods are not availableAlthough we could directly adopt the hidden sensitivitymodel the information contained in the nonsensitive binaryvariable cannot be utilized in the design Intuitively suchinformation can be used to enhance the degree of privacy pro-tection so that more respondents will not be faced with thesensitive questionThemajor objective of this paper is to pro-pose a new survey design and to develop corresponding sta-tistical methods for analyzing sensitive data collected by thistechnique
The rest of the paper is organized as follows In Section 2without using any randomizing device we propose a surveyscheme called combination questionnaire designmodelwhich consists of a main questionnaire and a supplementalquestionnaire Likelihood-based inferences including maxi-mum likelihood estimates via an expectationndashmaximization(EM) algorithm asymptotic confidence intervals and boot-strap confidence intervals of parameters of interest arederived in Section 3 A likelihood ratio test is also provided totest the association between the two binary random variablesIn Section 4we discuss Bayesian inferenceswhen prior infor-mation on parameters is available In Section 5 two simu-lation studies are performed to compare the efficiency ofthe proposed combination questionnaire model with that ofthe existing hidden sensitivity model of Tianet al [8] (iethe main questionnaire only) A cervical cancer data set inAtlanta is used in Section 6 to illustrate the proposed meth-ods A discussion and an appendix on the mode of a groupDirichlet density and a sampling method from it are alsopresented
2 The Survey Design
Assume that 119883 is a sensitive binary random variable 119884 isa non-sensitive binary random variable and they are corre-lated Let 119883 = 1 denote the sensitive class (eg 119883 = 1 if arespondent is a drug user) and 119883 = 0 denote the non-sen-sitive class (eg 119883 = 0 if a respondent is not a drug user)Furthermore let both 119884 = 1 (eg 119884 = 1 if a respondentreceives at least some college training) and 119884 = 0 (eg119884 = 0 if a respondent graduates at most from some highschool) be non-sensitive classes Define 120579 = (120579
1 120579
4)⊤
where 1205791= Pr(119883 = 0 119884 = 0) 120579
2= Pr(119883 = 1 119884 = 0)
1205793= Pr(119883 = 1 119884 = 1) and 120579
4= Pr(119883 = 0 119884 = 1) then
120579 isin T4 where
T119899= (119909
1 119909
119899)⊤
119909119894⩾ 0 119894 = 1 119899
119899
sum
119894=1
119909119894= 1 (1)
Table 1 The main questionnaire for the combination questionnairemodel
Category 119882 = 1 119882 = 2 119882 = 3
I 119883 = 0 119884 = 0 Block 1 mdash Block 2 mdash Block 3 mdashII 119883 = 1 119884 = 0 Category II Please put a tick in Block 2III 119883 = 1 119884 = 1 Category III Please put a tick in Block 3IV 119883 = 0 119884 = 1 Block 4 mdashNote Only 119883 = 1 is a sensitive class while 119883 = 0 119884 = 0 and 119884 = 1are nonsensitive classes
Table 2 The supplemental questionnaire for the combinationquestionnaire model
Category
119884 = 0Please put a tick in Block 5 mdash if you belong to
119884 = 0
119884 = 1Please put a tick in Block 6 mdash if you belong to
119884 = 1
Note Both the 119884 = 0 and 119884 = 1 are nonsensitive classes
denotes the 119899-dimensional closed simplex in R119899 The objec-tive is to make inferences on 120579 120579
119909= Pr (119883 = 1) = 120579
2+ 1205793
120579119910= Pr(119884 = 1) = 120579
4+ 1205793and the odds ratio 120595 = 120579
11205793(12057921205794)
The survey scheme consists of a main questionnaire anda supplemental questionnaire To design the main question-naire which is to be assigned to group 1 with 119899 respondents(119899 is specified by the investigators) we first introduce a non-sensitive question (say 119876
119882) with three possible answers
Assume that 119882 is a non-sensitive variate with trichotomousoutcomes associated with the 119876
119882and 119882 is independent of
both 119883 and 119884 Define 119901119894= Pr(119882 = 119894) for 119894 = 1 2 3 For
example let119882 = 1 (2 3) denote that a respondent was bornin JanuaryndashApril (MayndashAugust SeptemberndashDecember) andthus we could assume that 119901
119894asymp 13 The main questionnaire
is shown in Table 1 under which each respondent is asked toanswer the non-sensitive question 119876
119882
On the one hand since Category I (ie 119883 = 0 119884 = 0)and Category IV (ie 119883 = 0 119884 = 1) are non-sensitive toeach respondent it is reasonable to assume that a respondentis willing to provide hisher truthful answer by putting a tickin Block 119894 (119894 = 1 4) according to hisher true statusOn the other hand Category II (ie 119883 = 1 119884 = 0) andCategory III (ie 119883 = 1 119884 = 1) are usually sensitive torespondents In this case if a respondent belongs to CategoryII (III) heshe is designed to put a tick in Block 2 (3)
The supplemental questionnaire is designed as shown inTable 2 under which 119898 respondents (119898 is also specified bythe investigators) in group 2 are asked to put a tick in Block5 or Block 6 depending on their true status that is 119884 = 0
or 119884 = 1 Since both the 119884 = 0 and 119884 = 1 are non-sensitive classes the supplemental questionnaire is in fact adesign of direct questioningTherefore we call this design thecombination questionnaire model
Table 3 shows the cell probabilities 1205791198944
119894=1 the observed
frequencies 1198991198944
119894=1and the unobservable frequencies 119885
1198943
119894=1
for the main questionnaire The observed frequency 1198992is the
sumof the frequency of respondents belonging to Block 2 and
Journal of Probability and Statistics 3
Table 3 Cell probabilities observed and unobservable counts forthe main questionnaire
Category 119882 = 1 119882 = 2 119882 = 3 TotalI 119883 = 0 119884 = 0 119901
11205791
11990121205791
11990131205791
1205791(1198851)
II 119883 = 1 119884 = 0 1205792(1198852)
III 119883 = 1 119884 = 1 1205793(1198853)
IV 119883 = 0 119884 = 1 1205794(1198994) 120579
4(1198994)
Total 11990111205791(1198991) 11990121205791+ 1205792(1198992) 11990131205791+ 1205793(1198993) 1(119899)
Note 119899 = sum4119894=11198991198941198851 = 119899minus(1198852 +1198853)minus1198994 where (1198852 1198853) are unobservable
Table 4Cell probabilities andobserved counts for the supplementalquestionnaire
Category Total119884 = 0 Block 5 mdash 120579
1+ 1205792(1198980)
119884 = 1 Block 6 mdash 1205793+ 1205794(1198981)
Total 1 (119898)
Note119898 = 1198980 + 1198981
the frequency of those belonging toCategory IIThe observedfrequency 119899
3is the sum of the frequency of respondents
belonging to Block 3 and the frequency of those belongingto Category III Note that 119899 = sum
4
119894=1119899119894= sum3
119894=1119885119894+ 1198994 we have
1198851= 119899minus(119885
2+1198853)minus1198994Thus only119885
2and119885
3are unobservable
Table 4 shows the cell probabilities and observed counts forthe supplemental questionnaire
Remark 1 The design of the main questionnaire is similar tothat of the hidden sensitivitymodel of Tian et al [8] while thedesign of the supplemental questionnaire is indeed a designof direct questioning since both the 119884 = 0 and 119884 = 1 arenon-sensitive classes Table 1 shows that Categories II and IIIare two sensitive subclassesTherefore putting a tick in Block2 or Block 3 implies that the respondent could be suspectedwith the sensitive attribute Let 120579
1= sdot sdot sdot = 120579
4asymp 025
(see Table 3) then around half of the say 2119899 respondentswill be suspected with the sensitive attribute if only themain questionnaire is employed However besides the mainquestionnaire (with 119899 respondents) if the supplemental ques-tionnaire (see Table 2) with 119898 = 119899 respondents is also usedthen only half of the 119899 respondents will be suspected withthe sensitive attribute In other words in the proposed com-bination questionnaire model the information of the non-sensitive binary variable 119884 can be used to enhance the degreeof privacy protection so that more respondents will not befacedwith the sensitive questionThis is whywe introduce thesupplemental questionnaire besides the main questionnaire
Remark 2 In practice to simplify the design itself we suggestthat both the sample size 119899 in the main questionnaire andthe sample size 119898 in the supplemental questionnaire shouldbe fixed in advance rather than the total sample size 119873 =
119899 + 119898 being fixed In this way survey data can be collectedin two independent groups resulting in a relatively simplerstatistical analysis In addition the interviewees are randomlyassigned to either the group 1 or group 2
3 Likelihood-Based Inferences
In this section maximum likelihood estimates (MLEs) of the120579 and the odds ratio 120595 are derived by using the EM algo-rithm In addition asymptotic confidence intervals and thebootstrap confidence intervals of an arbitrary function of 120579are also provided Finally a likelihood ratio test is presentedfor testing the association between the two binary randomvariables
31 MLEs via the EM Algorithm A total of 119873 = 119899 + 119898
respondents are classified into two groups by a randomizationapproach such that 119899 respondents answer the questionsin the main questionnaire and 119898 respondents answer thequestions in the supplemental questionnaire Let 119884obsM =
119899 1198991 119899
4 denote the observed counts collected in the
main questionnaire (see Table 3) where sum4
119894=1119899119894= 119899 The
likelihood function of 120579 based on 119884obs119872 is
119871 (120579 | 119884obs119872) = (119899
1198991 1198992 1198993 1198994
)
times (11990111205791)1198991
3
prod
119894=2
(1199011198941205791+ 120579119894)119899119894
1205791198994
4
(2)
Let 119884obs119878 = 1198981198980 1198981 denote the observed counts gathered
in the supplemental questionnaire (see Table 4) where 1198980+
1198981= 119898 The likelihood function of 120579 based on 119884obs119878 is
119871 (120579 | 119884obs119878) = (119898
1198980
) (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(3)
Let 119884obs = 119884obs119872 119884obs119878 Since 119884obs119872 and 119884obs119878 are inde-pendent the observed-data likelihood function of 120579 isin T
4is
119871CQ (120579 | 119884obs119878) prop 1205791198991
1
3
prod
119894=2
(1199011198941205791+ 120579119894)119899119894
1205791198994
4
times (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(4)
where the subscript ldquoCQrdquodenotes the ldquocombination question-nairerdquo model
Since the corresponding cell probabilities to the observedcounts 119899
2and 1198993in the group 1 are in the form of summation
(ie 1199011198941205791+ 120579119894 119894 = 2 3) we cannot obtain the explicit
expressions of the MLEs of 120579 from the score equations of(4) By treating the observed counts 119899
2and 119899
3as incomplete
data we use the EM algorithm [9] to find the MLE of 120579The counts 119885
1198943
119894=1in Table 3 can be viewed as missing data
Briefly119885111988521198853 and 119899
4represent the counts of the respond-
ents belonging to Categories I II III and IV respectivelyThus we denote the latent data by 119884mis = 119911
2 1199113 and the
complete data by 119884com = 119884obs 119884mis Note that all 119901i areknown Consequently the complete-data likelihood functionfor 120579 is
119871CQ (120579 | 119884COM) prop (
3
prod
119894=1
120579119911119894
119894)1205791198994
4times (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(5)
where 1199111= 119899 minus (119911
2+ 1199113) minus 1198994
4 Journal of Probability and Statistics
By treating 1205791198944
119894=1as random variables we note that
the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=1199112
119873(1 +
1198980
119899 minus 1199113minus 1198994
)
1205793=1199113
119873(1 +
1198981
1199113+ 1198994
)
1205794=1198994
119873(1 +
1198981
1199113+ 1198994
)
(6)
Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579
119894(1199011198941205791+ 120579119894) that is
119885119894| (119884obs 120579) sim Binomial(119899
119894
120579119894
1199011198941205791+ 120579119894
) 119894 = 2 3 (7)
Therefore the E-step of the EM algorithm computes thefollowing conditional expectations
119864 (119885119894| 119884obs 120579) =
119899119894120579119894
1199011198941205791+ 120579i
119894 = 2 3 (8)
and the M-step updates (6) by replacing 1199112and 119911
3with
previous conditional expectations
Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)
32 Asymptotic Confidence Intervals Let 120579minus4
= (1205791 1205792 1205793)⊤
The asymptotic variance-covariance matrix of theMLE minus4
isthen given by Iminus1obs(minus4) where
Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)
120597120579minus4120597120579⊤
minus4
(9)
denotes the observed information matrix and ℓCQ(120579 |
119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have
ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901
21205791+ 1205792)
+ 1198993log (119901
31205791+ 1205793)
+ 1198994log (1 minus 120579
1minus 1205792minus 1205793)
+ 1198980log (120579
1+ 1205792) + 1198981log (1 minus 120579
1minus 1205792)
(10)
It is easy to show that
120597ℓCQ (120579 | 119884obs)
1205971205791
=1198991
1205791
minus1198994
1205794
+
3
sum
119894=2
119899i119901i119901i1205791 + 120579i
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205792
= minus1198994
1205794
+1198992
11990121205791+ 1205792
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205793
= minus1198994
1205794
+1198993
11990131205791+ 1205793
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
1
=1198991
1205792
1
+1198994
1205792
4
+
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
2
=1198994
1205792
4
+1198992
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
3
=1198994
1205792
4
+1198993
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205792
=1198994
1205792
4
+11989921199012
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205793
=1198994
1205792
4
+11989931199013
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057921205971205793
=1198994
1205792
4
(11)
where
120601 =1198980
(1205791+ 1205792)2+
1198981
(1 minus 1205791minus 1205792)2 (12)
Hence the observed information matrix can be expressed as
Iobs (120579minus4) = diag(1198991
1205792
1
0 0) +1198994
1205792
4
times 131⊤3+ A (13)
Journal of Probability and Statistics 5
whereA =
((((
(
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+120601
11989921199012
(11990121205791+ 1205792)2+120601
11989931199013
(11990131205791+1205793)2
11989921199012
(11990121205791+ 1205792)2+120601
1198992
(11990121205791+ 1205792)2+120601 0
11989931199013
(11990131205791+ 1205793)2 0
1198993
(11990131205791+1205793)2
))))
)
(14)
Let se(120579119894) denote the standard error of 120579
119894for 119894 = 1 2 3
Note that se(120579119894) can be estimated by the square root of the 119894th
diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579
119894) by se(120579
119894)Thus a 95normal-based asymptotic con-
fidence interval for 120579i can be constructed as
[120579119894minus 196 times se (120579
119894) 120579119894+ 196 times se (120579
119894)] 119894 = 1 2 3 (15)
Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of
120579minus4 For example 120579
4= 1 minus sum
3
119894=1120579i and the odds ratio 120595 =
12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used
to approximate the standard error of 120599 = ℎ(minus4) and a 95
normal-based asymptotic confidence interval for 120599 is given by
[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)
where
se (120599) = (120597120599
120597120579minus4
)
⊤
Iminus1obs (120579minus4) (120597120599
120597120579minus4
)
10038161003816100381610038161003816100381610038161003816120579minus4= minus4
12
(17)
33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579
minus4) Based on the obtained MLE we inde-
pendently generate
(119899lowast
1 119899
lowast
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(18)
119898lowast
0sim Binomial (119898 120579
1+ 1205792) (19)
Having obtained 119884lowast
obs119872 = 119899 119899lowast
1 119899
lowast
4 and 119884
lowast
obs119878 =
119898119898lowast
0 119898lowast
1 where 119898
lowast
1= 119898 minus 119898
lowast
0 we can calculate the
bootstrap replication 120599lowast= ℎ(120579
lowast
minus4) based on 119884
lowast
obs = 119884lowast
obs119872
119884lowast
obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast
119892119866
119892=1 Consequently a (1 minus 120572)100 bootstrap
confidence interval for 120599 is given by
[120599119871 120599119880] (20)
where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles
of 120599lowast119892119866
119892=1 respectively
34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]
1198670 120595 = 1 against 119867
1 120595 = 1 (21)
The likelihood ratio statistic is defined by
Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)
where 119877denotes the restrictedMLE of 120579 under119867
0 denotes
the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)
To find the restricted MLE 119877 we also employ the 119864119872
algorithm Under1198670 12057911205793= 12057921205794 we have
1205791= (1 minus 120579
119909) (1 minus 120579
119910)
1205792= 120579119909(1 minus 120579
119910)
1205793= 120579119909120579119910
1205794= (1 minus 120579
119909) 120579119910
(23)
In otherwords under1198670we only have two free parameters 120579
119909
and 120579119910 Having obtained the restrictedMLEs 120579
119909119877and 120579119910119877
wecan compute the restricted MLE
119877= (1205791119877
1205794119877
)⊤ from
(23) by
1205791119877
= (1 minus 120579119909119877
) (1 minus 120579119910119877
)
1205792119877
= 120579119909119877
(1 minus 120579119910119877
)
1205793119877
= 120579119909119877
120579119910119877
1205794119877
= (1 minus 120579119909119877
) 120579119910119877
(24)
In what follows we consider the computation of therestricted MLEs 120579
119909119877and 120579
119910119877 Now the complete-data like-
lihood function (5) becomes
119871CQ (120579119909 120579119910| 119884com 1198670)
prop (1 minus 120579119909) (1 minus 120579
119910)1199111
120579119909(1 minus 120579
119910)1199112
times (120579119909120579119910)1199113
(1 minus 120579119909) 1205791199101198994
(1 minus 120579119910)1198980
1205791198981
119910
= 1205791199112+1199113
119909(1 minus 120579
119909)1199111+1198994
1205791199113+1198994+1198981
119910(1 minus 120579
119910)1199111+1199112+1198980
(25)
so that the restricted MLEs of 120579119909and 120579
119910based on the
complete-data are given by
120579119909=1199112+ 1199113
119899 120579
119910=1199113+ 1198994+ 1198981
119873 (26)
respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579
119894 are defined in (23) Finally under119867
0
Λ asymptotically follows chi-squared distribution with onedegree of freedom
6 Journal of Probability and Statistics
4 Bayesian Inferences
To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911
2 1199113 are the same as
those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886
1 119886
4)
is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is
119891 (120579 | 119884obs 119884mis) = GD422
(120579minus4
| a b) (27)
where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898
0 1198981)⊤
and 1199111= 119899 minus (119911
2+ 1199113) minus 1198994 The conditional predictive dis-
tribution is
119891 (119884mis | 119884obs 120579) =3
prod
119894=2
Binomial(119911119894
10038161003816100381610038161003816100381610038161003816
119899119894
120579119894
1199011198941205791+ 120579119894
) (28)
Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=
1198862minus 1 + 119911
2
119886+minus 4 + 119873
(1 +1198980
1198861+ 1198862minus 2 + 119899 minus 119911
3minus 1198994
)
1205793=
1198863minus 1 + 119911
3
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
1205794=
1198864minus 1 + 119899
4
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
(29)
where 119886+= sum4
119894=1119886119894 and the 119864-step is to replace 119911
1198943
119894=2by the
conditional expectations given by (8)In addition based on (27) and (28) the data augmen-
tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix
5 Simulation Studies
In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901
119894=Pr(119882 =
119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model
In the first simulated example let a total of119873 = 119899 + 119898 =
50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579
1198944
119894=1are
listed in the second column of Table 5 We first generate
(1198991 119899
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(30)
and1198980sim Binomial (119898 120579
1+1205792) so that119898
1= 119898minus119898
0The119864119872
algorithm (6) and (8) is used to calculate the MLEs of 1205791198944
119894=1
We repeated this experiment 1000 times The average MLEsof 1205791198944
119894=1are reported in the third column of Table 5 The
corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5
From Table 5 we can see that both the MLEs of 1205791198944
119894=1
in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579
1198944
119894=1in the combination questionnaire model
are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency
In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =
119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6
From Table 6 we can see that the MSEs of 1205791198944
119894=1in the
combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944
119894=1for the proposed combination questionnaire model
in the second simulated example are significantly improvedwhen compared with those in the first simulated example
6 Analyzing Cervical Cancer Data in Atlanta
Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer
For the purpose of illustration we presume that 119883 = 0
is non-sensitive although the number 0ndash3 of sex partners
Journal of Probability and Statistics 7
Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01013 00013 00113 00113 01010 00010 00064 000641205792
050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793
020 02008 00008 00037 00037 02001 00001 00029 000291205794
020 01994 minus00006 00026 00026 02014 00014 00016 000161205791
020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792
030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793
040 04038 00038 00040 00040 04025 00025 00037 000371205794
010 01006 00006 00017 00017 00991 minus00009 00009 000091205791
030 03021 00021 00107 00107 03026 00026 00081 000811205792
025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793
015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794
030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01003 00003 00030 00030 01013 00013 00064 000641205792
050 05015 00015 00029 00029 05019 00019 00041 000411205793
020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794
020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791
020 02007 00007 00041 00041 02049 00049 00057 000571205792
030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793
040 04007 00007 00020 00020 03992 minus00008 00037 000371205794
010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791
030 03032 00032 00054 00054 02965 minus00036 00079 000791205792
025 02488 minus00012 00039 00039 02501 00001 00038 000381205793
015 01481 minus00019 00021 00021 01531 00031 00032 000321205794
030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 7 Cervical cancer data fromWilliamson and Haber [17]
Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)
119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903
4 1205794)
119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903
3 1205793)
Missing 43 (11990312 1205791+ 1205792) 59 (119903
34 1205793+ 1205794)
Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate
is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)
if a respondent was born in JanuaryndashApril (MayndashAugust
SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent
of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899
1= 11990313 = 55 119899
2=
55 + 1199032= 219 119899
3= 55 + 119903
3= 276 119899
4= 1199034= 103 that is
119884obs119872 = 119899 1198991 119899
4 = 653 55 219 276 103 (31)
On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898
0= 11990312
= 43 and1198981= 11990334
= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59
Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Journal of Probability and Statistics
in the questionnaire to indirectly obtain respondentsrsquo answersto the two sensitive questions In the hidden sensitivitymodel they implicitly assumed that all respondents arewilling to follow the design instructions In other words thenoncompliance behavior will not occur However in practicewe sometimes need to assess the association between onesensitive binary variable (eg whether or not a drug user thenumber of sex partner being ⩽1 or gt1 and so on) and onenonsensitive binary variable (eg good or poor health statuswith or without cervical cancer and so on) To our know-ledge the survey design for addressing this issue and cor-responding statistical analysis methods are not availableAlthough we could directly adopt the hidden sensitivitymodel the information contained in the nonsensitive binaryvariable cannot be utilized in the design Intuitively suchinformation can be used to enhance the degree of privacy pro-tection so that more respondents will not be faced with thesensitive questionThemajor objective of this paper is to pro-pose a new survey design and to develop corresponding sta-tistical methods for analyzing sensitive data collected by thistechnique
The rest of the paper is organized as follows In Section 2without using any randomizing device we propose a surveyscheme called combination questionnaire designmodelwhich consists of a main questionnaire and a supplementalquestionnaire Likelihood-based inferences including maxi-mum likelihood estimates via an expectationndashmaximization(EM) algorithm asymptotic confidence intervals and boot-strap confidence intervals of parameters of interest arederived in Section 3 A likelihood ratio test is also provided totest the association between the two binary random variablesIn Section 4we discuss Bayesian inferenceswhen prior infor-mation on parameters is available In Section 5 two simu-lation studies are performed to compare the efficiency ofthe proposed combination questionnaire model with that ofthe existing hidden sensitivity model of Tianet al [8] (iethe main questionnaire only) A cervical cancer data set inAtlanta is used in Section 6 to illustrate the proposed meth-ods A discussion and an appendix on the mode of a groupDirichlet density and a sampling method from it are alsopresented
2 The Survey Design
Assume that 119883 is a sensitive binary random variable 119884 isa non-sensitive binary random variable and they are corre-lated Let 119883 = 1 denote the sensitive class (eg 119883 = 1 if arespondent is a drug user) and 119883 = 0 denote the non-sen-sitive class (eg 119883 = 0 if a respondent is not a drug user)Furthermore let both 119884 = 1 (eg 119884 = 1 if a respondentreceives at least some college training) and 119884 = 0 (eg119884 = 0 if a respondent graduates at most from some highschool) be non-sensitive classes Define 120579 = (120579
1 120579
4)⊤
where 1205791= Pr(119883 = 0 119884 = 0) 120579
2= Pr(119883 = 1 119884 = 0)
1205793= Pr(119883 = 1 119884 = 1) and 120579
4= Pr(119883 = 0 119884 = 1) then
120579 isin T4 where
T119899= (119909
1 119909
119899)⊤
119909119894⩾ 0 119894 = 1 119899
119899
sum
119894=1
119909119894= 1 (1)
Table 1 The main questionnaire for the combination questionnairemodel
Category 119882 = 1 119882 = 2 119882 = 3
I 119883 = 0 119884 = 0 Block 1 mdash Block 2 mdash Block 3 mdashII 119883 = 1 119884 = 0 Category II Please put a tick in Block 2III 119883 = 1 119884 = 1 Category III Please put a tick in Block 3IV 119883 = 0 119884 = 1 Block 4 mdashNote Only 119883 = 1 is a sensitive class while 119883 = 0 119884 = 0 and 119884 = 1are nonsensitive classes
Table 2 The supplemental questionnaire for the combinationquestionnaire model
Category
119884 = 0Please put a tick in Block 5 mdash if you belong to
119884 = 0
119884 = 1Please put a tick in Block 6 mdash if you belong to
119884 = 1
Note Both the 119884 = 0 and 119884 = 1 are nonsensitive classes
denotes the 119899-dimensional closed simplex in R119899 The objec-tive is to make inferences on 120579 120579
119909= Pr (119883 = 1) = 120579
2+ 1205793
120579119910= Pr(119884 = 1) = 120579
4+ 1205793and the odds ratio 120595 = 120579
11205793(12057921205794)
The survey scheme consists of a main questionnaire anda supplemental questionnaire To design the main question-naire which is to be assigned to group 1 with 119899 respondents(119899 is specified by the investigators) we first introduce a non-sensitive question (say 119876
119882) with three possible answers
Assume that 119882 is a non-sensitive variate with trichotomousoutcomes associated with the 119876
119882and 119882 is independent of
both 119883 and 119884 Define 119901119894= Pr(119882 = 119894) for 119894 = 1 2 3 For
example let119882 = 1 (2 3) denote that a respondent was bornin JanuaryndashApril (MayndashAugust SeptemberndashDecember) andthus we could assume that 119901
119894asymp 13 The main questionnaire
is shown in Table 1 under which each respondent is asked toanswer the non-sensitive question 119876
119882
On the one hand since Category I (ie 119883 = 0 119884 = 0)and Category IV (ie 119883 = 0 119884 = 1) are non-sensitive toeach respondent it is reasonable to assume that a respondentis willing to provide hisher truthful answer by putting a tickin Block 119894 (119894 = 1 4) according to hisher true statusOn the other hand Category II (ie 119883 = 1 119884 = 0) andCategory III (ie 119883 = 1 119884 = 1) are usually sensitive torespondents In this case if a respondent belongs to CategoryII (III) heshe is designed to put a tick in Block 2 (3)
The supplemental questionnaire is designed as shown inTable 2 under which 119898 respondents (119898 is also specified bythe investigators) in group 2 are asked to put a tick in Block5 or Block 6 depending on their true status that is 119884 = 0
or 119884 = 1 Since both the 119884 = 0 and 119884 = 1 are non-sensitive classes the supplemental questionnaire is in fact adesign of direct questioningTherefore we call this design thecombination questionnaire model
Table 3 shows the cell probabilities 1205791198944
119894=1 the observed
frequencies 1198991198944
119894=1and the unobservable frequencies 119885
1198943
119894=1
for the main questionnaire The observed frequency 1198992is the
sumof the frequency of respondents belonging to Block 2 and
Journal of Probability and Statistics 3
Table 3 Cell probabilities observed and unobservable counts forthe main questionnaire
Category 119882 = 1 119882 = 2 119882 = 3 TotalI 119883 = 0 119884 = 0 119901
11205791
11990121205791
11990131205791
1205791(1198851)
II 119883 = 1 119884 = 0 1205792(1198852)
III 119883 = 1 119884 = 1 1205793(1198853)
IV 119883 = 0 119884 = 1 1205794(1198994) 120579
4(1198994)
Total 11990111205791(1198991) 11990121205791+ 1205792(1198992) 11990131205791+ 1205793(1198993) 1(119899)
Note 119899 = sum4119894=11198991198941198851 = 119899minus(1198852 +1198853)minus1198994 where (1198852 1198853) are unobservable
Table 4Cell probabilities andobserved counts for the supplementalquestionnaire
Category Total119884 = 0 Block 5 mdash 120579
1+ 1205792(1198980)
119884 = 1 Block 6 mdash 1205793+ 1205794(1198981)
Total 1 (119898)
Note119898 = 1198980 + 1198981
the frequency of those belonging toCategory IIThe observedfrequency 119899
3is the sum of the frequency of respondents
belonging to Block 3 and the frequency of those belongingto Category III Note that 119899 = sum
4
119894=1119899119894= sum3
119894=1119885119894+ 1198994 we have
1198851= 119899minus(119885
2+1198853)minus1198994Thus only119885
2and119885
3are unobservable
Table 4 shows the cell probabilities and observed counts forthe supplemental questionnaire
Remark 1 The design of the main questionnaire is similar tothat of the hidden sensitivitymodel of Tian et al [8] while thedesign of the supplemental questionnaire is indeed a designof direct questioning since both the 119884 = 0 and 119884 = 1 arenon-sensitive classes Table 1 shows that Categories II and IIIare two sensitive subclassesTherefore putting a tick in Block2 or Block 3 implies that the respondent could be suspectedwith the sensitive attribute Let 120579
1= sdot sdot sdot = 120579
4asymp 025
(see Table 3) then around half of the say 2119899 respondentswill be suspected with the sensitive attribute if only themain questionnaire is employed However besides the mainquestionnaire (with 119899 respondents) if the supplemental ques-tionnaire (see Table 2) with 119898 = 119899 respondents is also usedthen only half of the 119899 respondents will be suspected withthe sensitive attribute In other words in the proposed com-bination questionnaire model the information of the non-sensitive binary variable 119884 can be used to enhance the degreeof privacy protection so that more respondents will not befacedwith the sensitive questionThis is whywe introduce thesupplemental questionnaire besides the main questionnaire
Remark 2 In practice to simplify the design itself we suggestthat both the sample size 119899 in the main questionnaire andthe sample size 119898 in the supplemental questionnaire shouldbe fixed in advance rather than the total sample size 119873 =
119899 + 119898 being fixed In this way survey data can be collectedin two independent groups resulting in a relatively simplerstatistical analysis In addition the interviewees are randomlyassigned to either the group 1 or group 2
3 Likelihood-Based Inferences
In this section maximum likelihood estimates (MLEs) of the120579 and the odds ratio 120595 are derived by using the EM algo-rithm In addition asymptotic confidence intervals and thebootstrap confidence intervals of an arbitrary function of 120579are also provided Finally a likelihood ratio test is presentedfor testing the association between the two binary randomvariables
31 MLEs via the EM Algorithm A total of 119873 = 119899 + 119898
respondents are classified into two groups by a randomizationapproach such that 119899 respondents answer the questionsin the main questionnaire and 119898 respondents answer thequestions in the supplemental questionnaire Let 119884obsM =
119899 1198991 119899
4 denote the observed counts collected in the
main questionnaire (see Table 3) where sum4
119894=1119899119894= 119899 The
likelihood function of 120579 based on 119884obs119872 is
119871 (120579 | 119884obs119872) = (119899
1198991 1198992 1198993 1198994
)
times (11990111205791)1198991
3
prod
119894=2
(1199011198941205791+ 120579119894)119899119894
1205791198994
4
(2)
Let 119884obs119878 = 1198981198980 1198981 denote the observed counts gathered
in the supplemental questionnaire (see Table 4) where 1198980+
1198981= 119898 The likelihood function of 120579 based on 119884obs119878 is
119871 (120579 | 119884obs119878) = (119898
1198980
) (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(3)
Let 119884obs = 119884obs119872 119884obs119878 Since 119884obs119872 and 119884obs119878 are inde-pendent the observed-data likelihood function of 120579 isin T
4is
119871CQ (120579 | 119884obs119878) prop 1205791198991
1
3
prod
119894=2
(1199011198941205791+ 120579119894)119899119894
1205791198994
4
times (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(4)
where the subscript ldquoCQrdquodenotes the ldquocombination question-nairerdquo model
Since the corresponding cell probabilities to the observedcounts 119899
2and 1198993in the group 1 are in the form of summation
(ie 1199011198941205791+ 120579119894 119894 = 2 3) we cannot obtain the explicit
expressions of the MLEs of 120579 from the score equations of(4) By treating the observed counts 119899
2and 119899
3as incomplete
data we use the EM algorithm [9] to find the MLE of 120579The counts 119885
1198943
119894=1in Table 3 can be viewed as missing data
Briefly119885111988521198853 and 119899
4represent the counts of the respond-
ents belonging to Categories I II III and IV respectivelyThus we denote the latent data by 119884mis = 119911
2 1199113 and the
complete data by 119884com = 119884obs 119884mis Note that all 119901i areknown Consequently the complete-data likelihood functionfor 120579 is
119871CQ (120579 | 119884COM) prop (
3
prod
119894=1
120579119911119894
119894)1205791198994
4times (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(5)
where 1199111= 119899 minus (119911
2+ 1199113) minus 1198994
4 Journal of Probability and Statistics
By treating 1205791198944
119894=1as random variables we note that
the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=1199112
119873(1 +
1198980
119899 minus 1199113minus 1198994
)
1205793=1199113
119873(1 +
1198981
1199113+ 1198994
)
1205794=1198994
119873(1 +
1198981
1199113+ 1198994
)
(6)
Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579
119894(1199011198941205791+ 120579119894) that is
119885119894| (119884obs 120579) sim Binomial(119899
119894
120579119894
1199011198941205791+ 120579119894
) 119894 = 2 3 (7)
Therefore the E-step of the EM algorithm computes thefollowing conditional expectations
119864 (119885119894| 119884obs 120579) =
119899119894120579119894
1199011198941205791+ 120579i
119894 = 2 3 (8)
and the M-step updates (6) by replacing 1199112and 119911
3with
previous conditional expectations
Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)
32 Asymptotic Confidence Intervals Let 120579minus4
= (1205791 1205792 1205793)⊤
The asymptotic variance-covariance matrix of theMLE minus4
isthen given by Iminus1obs(minus4) where
Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)
120597120579minus4120597120579⊤
minus4
(9)
denotes the observed information matrix and ℓCQ(120579 |
119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have
ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901
21205791+ 1205792)
+ 1198993log (119901
31205791+ 1205793)
+ 1198994log (1 minus 120579
1minus 1205792minus 1205793)
+ 1198980log (120579
1+ 1205792) + 1198981log (1 minus 120579
1minus 1205792)
(10)
It is easy to show that
120597ℓCQ (120579 | 119884obs)
1205971205791
=1198991
1205791
minus1198994
1205794
+
3
sum
119894=2
119899i119901i119901i1205791 + 120579i
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205792
= minus1198994
1205794
+1198992
11990121205791+ 1205792
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205793
= minus1198994
1205794
+1198993
11990131205791+ 1205793
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
1
=1198991
1205792
1
+1198994
1205792
4
+
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
2
=1198994
1205792
4
+1198992
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
3
=1198994
1205792
4
+1198993
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205792
=1198994
1205792
4
+11989921199012
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205793
=1198994
1205792
4
+11989931199013
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057921205971205793
=1198994
1205792
4
(11)
where
120601 =1198980
(1205791+ 1205792)2+
1198981
(1 minus 1205791minus 1205792)2 (12)
Hence the observed information matrix can be expressed as
Iobs (120579minus4) = diag(1198991
1205792
1
0 0) +1198994
1205792
4
times 131⊤3+ A (13)
Journal of Probability and Statistics 5
whereA =
((((
(
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+120601
11989921199012
(11990121205791+ 1205792)2+120601
11989931199013
(11990131205791+1205793)2
11989921199012
(11990121205791+ 1205792)2+120601
1198992
(11990121205791+ 1205792)2+120601 0
11989931199013
(11990131205791+ 1205793)2 0
1198993
(11990131205791+1205793)2
))))
)
(14)
Let se(120579119894) denote the standard error of 120579
119894for 119894 = 1 2 3
Note that se(120579119894) can be estimated by the square root of the 119894th
diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579
119894) by se(120579
119894)Thus a 95normal-based asymptotic con-
fidence interval for 120579i can be constructed as
[120579119894minus 196 times se (120579
119894) 120579119894+ 196 times se (120579
119894)] 119894 = 1 2 3 (15)
Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of
120579minus4 For example 120579
4= 1 minus sum
3
119894=1120579i and the odds ratio 120595 =
12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used
to approximate the standard error of 120599 = ℎ(minus4) and a 95
normal-based asymptotic confidence interval for 120599 is given by
[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)
where
se (120599) = (120597120599
120597120579minus4
)
⊤
Iminus1obs (120579minus4) (120597120599
120597120579minus4
)
10038161003816100381610038161003816100381610038161003816120579minus4= minus4
12
(17)
33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579
minus4) Based on the obtained MLE we inde-
pendently generate
(119899lowast
1 119899
lowast
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(18)
119898lowast
0sim Binomial (119898 120579
1+ 1205792) (19)
Having obtained 119884lowast
obs119872 = 119899 119899lowast
1 119899
lowast
4 and 119884
lowast
obs119878 =
119898119898lowast
0 119898lowast
1 where 119898
lowast
1= 119898 minus 119898
lowast
0 we can calculate the
bootstrap replication 120599lowast= ℎ(120579
lowast
minus4) based on 119884
lowast
obs = 119884lowast
obs119872
119884lowast
obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast
119892119866
119892=1 Consequently a (1 minus 120572)100 bootstrap
confidence interval for 120599 is given by
[120599119871 120599119880] (20)
where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles
of 120599lowast119892119866
119892=1 respectively
34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]
1198670 120595 = 1 against 119867
1 120595 = 1 (21)
The likelihood ratio statistic is defined by
Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)
where 119877denotes the restrictedMLE of 120579 under119867
0 denotes
the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)
To find the restricted MLE 119877 we also employ the 119864119872
algorithm Under1198670 12057911205793= 12057921205794 we have
1205791= (1 minus 120579
119909) (1 minus 120579
119910)
1205792= 120579119909(1 minus 120579
119910)
1205793= 120579119909120579119910
1205794= (1 minus 120579
119909) 120579119910
(23)
In otherwords under1198670we only have two free parameters 120579
119909
and 120579119910 Having obtained the restrictedMLEs 120579
119909119877and 120579119910119877
wecan compute the restricted MLE
119877= (1205791119877
1205794119877
)⊤ from
(23) by
1205791119877
= (1 minus 120579119909119877
) (1 minus 120579119910119877
)
1205792119877
= 120579119909119877
(1 minus 120579119910119877
)
1205793119877
= 120579119909119877
120579119910119877
1205794119877
= (1 minus 120579119909119877
) 120579119910119877
(24)
In what follows we consider the computation of therestricted MLEs 120579
119909119877and 120579
119910119877 Now the complete-data like-
lihood function (5) becomes
119871CQ (120579119909 120579119910| 119884com 1198670)
prop (1 minus 120579119909) (1 minus 120579
119910)1199111
120579119909(1 minus 120579
119910)1199112
times (120579119909120579119910)1199113
(1 minus 120579119909) 1205791199101198994
(1 minus 120579119910)1198980
1205791198981
119910
= 1205791199112+1199113
119909(1 minus 120579
119909)1199111+1198994
1205791199113+1198994+1198981
119910(1 minus 120579
119910)1199111+1199112+1198980
(25)
so that the restricted MLEs of 120579119909and 120579
119910based on the
complete-data are given by
120579119909=1199112+ 1199113
119899 120579
119910=1199113+ 1198994+ 1198981
119873 (26)
respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579
119894 are defined in (23) Finally under119867
0
Λ asymptotically follows chi-squared distribution with onedegree of freedom
6 Journal of Probability and Statistics
4 Bayesian Inferences
To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911
2 1199113 are the same as
those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886
1 119886
4)
is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is
119891 (120579 | 119884obs 119884mis) = GD422
(120579minus4
| a b) (27)
where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898
0 1198981)⊤
and 1199111= 119899 minus (119911
2+ 1199113) minus 1198994 The conditional predictive dis-
tribution is
119891 (119884mis | 119884obs 120579) =3
prod
119894=2
Binomial(119911119894
10038161003816100381610038161003816100381610038161003816
119899119894
120579119894
1199011198941205791+ 120579119894
) (28)
Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=
1198862minus 1 + 119911
2
119886+minus 4 + 119873
(1 +1198980
1198861+ 1198862minus 2 + 119899 minus 119911
3minus 1198994
)
1205793=
1198863minus 1 + 119911
3
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
1205794=
1198864minus 1 + 119899
4
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
(29)
where 119886+= sum4
119894=1119886119894 and the 119864-step is to replace 119911
1198943
119894=2by the
conditional expectations given by (8)In addition based on (27) and (28) the data augmen-
tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix
5 Simulation Studies
In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901
119894=Pr(119882 =
119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model
In the first simulated example let a total of119873 = 119899 + 119898 =
50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579
1198944
119894=1are
listed in the second column of Table 5 We first generate
(1198991 119899
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(30)
and1198980sim Binomial (119898 120579
1+1205792) so that119898
1= 119898minus119898
0The119864119872
algorithm (6) and (8) is used to calculate the MLEs of 1205791198944
119894=1
We repeated this experiment 1000 times The average MLEsof 1205791198944
119894=1are reported in the third column of Table 5 The
corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5
From Table 5 we can see that both the MLEs of 1205791198944
119894=1
in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579
1198944
119894=1in the combination questionnaire model
are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency
In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =
119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6
From Table 6 we can see that the MSEs of 1205791198944
119894=1in the
combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944
119894=1for the proposed combination questionnaire model
in the second simulated example are significantly improvedwhen compared with those in the first simulated example
6 Analyzing Cervical Cancer Data in Atlanta
Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer
For the purpose of illustration we presume that 119883 = 0
is non-sensitive although the number 0ndash3 of sex partners
Journal of Probability and Statistics 7
Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01013 00013 00113 00113 01010 00010 00064 000641205792
050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793
020 02008 00008 00037 00037 02001 00001 00029 000291205794
020 01994 minus00006 00026 00026 02014 00014 00016 000161205791
020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792
030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793
040 04038 00038 00040 00040 04025 00025 00037 000371205794
010 01006 00006 00017 00017 00991 minus00009 00009 000091205791
030 03021 00021 00107 00107 03026 00026 00081 000811205792
025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793
015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794
030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01003 00003 00030 00030 01013 00013 00064 000641205792
050 05015 00015 00029 00029 05019 00019 00041 000411205793
020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794
020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791
020 02007 00007 00041 00041 02049 00049 00057 000571205792
030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793
040 04007 00007 00020 00020 03992 minus00008 00037 000371205794
010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791
030 03032 00032 00054 00054 02965 minus00036 00079 000791205792
025 02488 minus00012 00039 00039 02501 00001 00038 000381205793
015 01481 minus00019 00021 00021 01531 00031 00032 000321205794
030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 7 Cervical cancer data fromWilliamson and Haber [17]
Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)
119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903
4 1205794)
119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903
3 1205793)
Missing 43 (11990312 1205791+ 1205792) 59 (119903
34 1205793+ 1205794)
Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate
is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)
if a respondent was born in JanuaryndashApril (MayndashAugust
SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent
of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899
1= 11990313 = 55 119899
2=
55 + 1199032= 219 119899
3= 55 + 119903
3= 276 119899
4= 1199034= 103 that is
119884obs119872 = 119899 1198991 119899
4 = 653 55 219 276 103 (31)
On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898
0= 11990312
= 43 and1198981= 11990334
= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59
Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Probability and Statistics 3
Table 3 Cell probabilities observed and unobservable counts forthe main questionnaire
Category 119882 = 1 119882 = 2 119882 = 3 TotalI 119883 = 0 119884 = 0 119901
11205791
11990121205791
11990131205791
1205791(1198851)
II 119883 = 1 119884 = 0 1205792(1198852)
III 119883 = 1 119884 = 1 1205793(1198853)
IV 119883 = 0 119884 = 1 1205794(1198994) 120579
4(1198994)
Total 11990111205791(1198991) 11990121205791+ 1205792(1198992) 11990131205791+ 1205793(1198993) 1(119899)
Note 119899 = sum4119894=11198991198941198851 = 119899minus(1198852 +1198853)minus1198994 where (1198852 1198853) are unobservable
Table 4Cell probabilities andobserved counts for the supplementalquestionnaire
Category Total119884 = 0 Block 5 mdash 120579
1+ 1205792(1198980)
119884 = 1 Block 6 mdash 1205793+ 1205794(1198981)
Total 1 (119898)
Note119898 = 1198980 + 1198981
the frequency of those belonging toCategory IIThe observedfrequency 119899
3is the sum of the frequency of respondents
belonging to Block 3 and the frequency of those belongingto Category III Note that 119899 = sum
4
119894=1119899119894= sum3
119894=1119885119894+ 1198994 we have
1198851= 119899minus(119885
2+1198853)minus1198994Thus only119885
2and119885
3are unobservable
Table 4 shows the cell probabilities and observed counts forthe supplemental questionnaire
Remark 1 The design of the main questionnaire is similar tothat of the hidden sensitivitymodel of Tian et al [8] while thedesign of the supplemental questionnaire is indeed a designof direct questioning since both the 119884 = 0 and 119884 = 1 arenon-sensitive classes Table 1 shows that Categories II and IIIare two sensitive subclassesTherefore putting a tick in Block2 or Block 3 implies that the respondent could be suspectedwith the sensitive attribute Let 120579
1= sdot sdot sdot = 120579
4asymp 025
(see Table 3) then around half of the say 2119899 respondentswill be suspected with the sensitive attribute if only themain questionnaire is employed However besides the mainquestionnaire (with 119899 respondents) if the supplemental ques-tionnaire (see Table 2) with 119898 = 119899 respondents is also usedthen only half of the 119899 respondents will be suspected withthe sensitive attribute In other words in the proposed com-bination questionnaire model the information of the non-sensitive binary variable 119884 can be used to enhance the degreeof privacy protection so that more respondents will not befacedwith the sensitive questionThis is whywe introduce thesupplemental questionnaire besides the main questionnaire
Remark 2 In practice to simplify the design itself we suggestthat both the sample size 119899 in the main questionnaire andthe sample size 119898 in the supplemental questionnaire shouldbe fixed in advance rather than the total sample size 119873 =
119899 + 119898 being fixed In this way survey data can be collectedin two independent groups resulting in a relatively simplerstatistical analysis In addition the interviewees are randomlyassigned to either the group 1 or group 2
3 Likelihood-Based Inferences
In this section maximum likelihood estimates (MLEs) of the120579 and the odds ratio 120595 are derived by using the EM algo-rithm In addition asymptotic confidence intervals and thebootstrap confidence intervals of an arbitrary function of 120579are also provided Finally a likelihood ratio test is presentedfor testing the association between the two binary randomvariables
31 MLEs via the EM Algorithm A total of 119873 = 119899 + 119898
respondents are classified into two groups by a randomizationapproach such that 119899 respondents answer the questionsin the main questionnaire and 119898 respondents answer thequestions in the supplemental questionnaire Let 119884obsM =
119899 1198991 119899
4 denote the observed counts collected in the
main questionnaire (see Table 3) where sum4
119894=1119899119894= 119899 The
likelihood function of 120579 based on 119884obs119872 is
119871 (120579 | 119884obs119872) = (119899
1198991 1198992 1198993 1198994
)
times (11990111205791)1198991
3
prod
119894=2
(1199011198941205791+ 120579119894)119899119894
1205791198994
4
(2)
Let 119884obs119878 = 1198981198980 1198981 denote the observed counts gathered
in the supplemental questionnaire (see Table 4) where 1198980+
1198981= 119898 The likelihood function of 120579 based on 119884obs119878 is
119871 (120579 | 119884obs119878) = (119898
1198980
) (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(3)
Let 119884obs = 119884obs119872 119884obs119878 Since 119884obs119872 and 119884obs119878 are inde-pendent the observed-data likelihood function of 120579 isin T
4is
119871CQ (120579 | 119884obs119878) prop 1205791198991
1
3
prod
119894=2
(1199011198941205791+ 120579119894)119899119894
1205791198994
4
times (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(4)
where the subscript ldquoCQrdquodenotes the ldquocombination question-nairerdquo model
Since the corresponding cell probabilities to the observedcounts 119899
2and 1198993in the group 1 are in the form of summation
(ie 1199011198941205791+ 120579119894 119894 = 2 3) we cannot obtain the explicit
expressions of the MLEs of 120579 from the score equations of(4) By treating the observed counts 119899
2and 119899
3as incomplete
data we use the EM algorithm [9] to find the MLE of 120579The counts 119885
1198943
119894=1in Table 3 can be viewed as missing data
Briefly119885111988521198853 and 119899
4represent the counts of the respond-
ents belonging to Categories I II III and IV respectivelyThus we denote the latent data by 119884mis = 119911
2 1199113 and the
complete data by 119884com = 119884obs 119884mis Note that all 119901i areknown Consequently the complete-data likelihood functionfor 120579 is
119871CQ (120579 | 119884COM) prop (
3
prod
119894=1
120579119911119894
119894)1205791198994
4times (1205791+ 1205792)1198980
(1205793+ 1205794)1198981
(5)
where 1199111= 119899 minus (119911
2+ 1199113) minus 1198994
4 Journal of Probability and Statistics
By treating 1205791198944
119894=1as random variables we note that
the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=1199112
119873(1 +
1198980
119899 minus 1199113minus 1198994
)
1205793=1199113
119873(1 +
1198981
1199113+ 1198994
)
1205794=1198994
119873(1 +
1198981
1199113+ 1198994
)
(6)
Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579
119894(1199011198941205791+ 120579119894) that is
119885119894| (119884obs 120579) sim Binomial(119899
119894
120579119894
1199011198941205791+ 120579119894
) 119894 = 2 3 (7)
Therefore the E-step of the EM algorithm computes thefollowing conditional expectations
119864 (119885119894| 119884obs 120579) =
119899119894120579119894
1199011198941205791+ 120579i
119894 = 2 3 (8)
and the M-step updates (6) by replacing 1199112and 119911
3with
previous conditional expectations
Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)
32 Asymptotic Confidence Intervals Let 120579minus4
= (1205791 1205792 1205793)⊤
The asymptotic variance-covariance matrix of theMLE minus4
isthen given by Iminus1obs(minus4) where
Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)
120597120579minus4120597120579⊤
minus4
(9)
denotes the observed information matrix and ℓCQ(120579 |
119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have
ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901
21205791+ 1205792)
+ 1198993log (119901
31205791+ 1205793)
+ 1198994log (1 minus 120579
1minus 1205792minus 1205793)
+ 1198980log (120579
1+ 1205792) + 1198981log (1 minus 120579
1minus 1205792)
(10)
It is easy to show that
120597ℓCQ (120579 | 119884obs)
1205971205791
=1198991
1205791
minus1198994
1205794
+
3
sum
119894=2
119899i119901i119901i1205791 + 120579i
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205792
= minus1198994
1205794
+1198992
11990121205791+ 1205792
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205793
= minus1198994
1205794
+1198993
11990131205791+ 1205793
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
1
=1198991
1205792
1
+1198994
1205792
4
+
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
2
=1198994
1205792
4
+1198992
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
3
=1198994
1205792
4
+1198993
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205792
=1198994
1205792
4
+11989921199012
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205793
=1198994
1205792
4
+11989931199013
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057921205971205793
=1198994
1205792
4
(11)
where
120601 =1198980
(1205791+ 1205792)2+
1198981
(1 minus 1205791minus 1205792)2 (12)
Hence the observed information matrix can be expressed as
Iobs (120579minus4) = diag(1198991
1205792
1
0 0) +1198994
1205792
4
times 131⊤3+ A (13)
Journal of Probability and Statistics 5
whereA =
((((
(
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+120601
11989921199012
(11990121205791+ 1205792)2+120601
11989931199013
(11990131205791+1205793)2
11989921199012
(11990121205791+ 1205792)2+120601
1198992
(11990121205791+ 1205792)2+120601 0
11989931199013
(11990131205791+ 1205793)2 0
1198993
(11990131205791+1205793)2
))))
)
(14)
Let se(120579119894) denote the standard error of 120579
119894for 119894 = 1 2 3
Note that se(120579119894) can be estimated by the square root of the 119894th
diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579
119894) by se(120579
119894)Thus a 95normal-based asymptotic con-
fidence interval for 120579i can be constructed as
[120579119894minus 196 times se (120579
119894) 120579119894+ 196 times se (120579
119894)] 119894 = 1 2 3 (15)
Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of
120579minus4 For example 120579
4= 1 minus sum
3
119894=1120579i and the odds ratio 120595 =
12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used
to approximate the standard error of 120599 = ℎ(minus4) and a 95
normal-based asymptotic confidence interval for 120599 is given by
[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)
where
se (120599) = (120597120599
120597120579minus4
)
⊤
Iminus1obs (120579minus4) (120597120599
120597120579minus4
)
10038161003816100381610038161003816100381610038161003816120579minus4= minus4
12
(17)
33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579
minus4) Based on the obtained MLE we inde-
pendently generate
(119899lowast
1 119899
lowast
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(18)
119898lowast
0sim Binomial (119898 120579
1+ 1205792) (19)
Having obtained 119884lowast
obs119872 = 119899 119899lowast
1 119899
lowast
4 and 119884
lowast
obs119878 =
119898119898lowast
0 119898lowast
1 where 119898
lowast
1= 119898 minus 119898
lowast
0 we can calculate the
bootstrap replication 120599lowast= ℎ(120579
lowast
minus4) based on 119884
lowast
obs = 119884lowast
obs119872
119884lowast
obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast
119892119866
119892=1 Consequently a (1 minus 120572)100 bootstrap
confidence interval for 120599 is given by
[120599119871 120599119880] (20)
where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles
of 120599lowast119892119866
119892=1 respectively
34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]
1198670 120595 = 1 against 119867
1 120595 = 1 (21)
The likelihood ratio statistic is defined by
Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)
where 119877denotes the restrictedMLE of 120579 under119867
0 denotes
the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)
To find the restricted MLE 119877 we also employ the 119864119872
algorithm Under1198670 12057911205793= 12057921205794 we have
1205791= (1 minus 120579
119909) (1 minus 120579
119910)
1205792= 120579119909(1 minus 120579
119910)
1205793= 120579119909120579119910
1205794= (1 minus 120579
119909) 120579119910
(23)
In otherwords under1198670we only have two free parameters 120579
119909
and 120579119910 Having obtained the restrictedMLEs 120579
119909119877and 120579119910119877
wecan compute the restricted MLE
119877= (1205791119877
1205794119877
)⊤ from
(23) by
1205791119877
= (1 minus 120579119909119877
) (1 minus 120579119910119877
)
1205792119877
= 120579119909119877
(1 minus 120579119910119877
)
1205793119877
= 120579119909119877
120579119910119877
1205794119877
= (1 minus 120579119909119877
) 120579119910119877
(24)
In what follows we consider the computation of therestricted MLEs 120579
119909119877and 120579
119910119877 Now the complete-data like-
lihood function (5) becomes
119871CQ (120579119909 120579119910| 119884com 1198670)
prop (1 minus 120579119909) (1 minus 120579
119910)1199111
120579119909(1 minus 120579
119910)1199112
times (120579119909120579119910)1199113
(1 minus 120579119909) 1205791199101198994
(1 minus 120579119910)1198980
1205791198981
119910
= 1205791199112+1199113
119909(1 minus 120579
119909)1199111+1198994
1205791199113+1198994+1198981
119910(1 minus 120579
119910)1199111+1199112+1198980
(25)
so that the restricted MLEs of 120579119909and 120579
119910based on the
complete-data are given by
120579119909=1199112+ 1199113
119899 120579
119910=1199113+ 1198994+ 1198981
119873 (26)
respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579
119894 are defined in (23) Finally under119867
0
Λ asymptotically follows chi-squared distribution with onedegree of freedom
6 Journal of Probability and Statistics
4 Bayesian Inferences
To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911
2 1199113 are the same as
those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886
1 119886
4)
is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is
119891 (120579 | 119884obs 119884mis) = GD422
(120579minus4
| a b) (27)
where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898
0 1198981)⊤
and 1199111= 119899 minus (119911
2+ 1199113) minus 1198994 The conditional predictive dis-
tribution is
119891 (119884mis | 119884obs 120579) =3
prod
119894=2
Binomial(119911119894
10038161003816100381610038161003816100381610038161003816
119899119894
120579119894
1199011198941205791+ 120579119894
) (28)
Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=
1198862minus 1 + 119911
2
119886+minus 4 + 119873
(1 +1198980
1198861+ 1198862minus 2 + 119899 minus 119911
3minus 1198994
)
1205793=
1198863minus 1 + 119911
3
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
1205794=
1198864minus 1 + 119899
4
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
(29)
where 119886+= sum4
119894=1119886119894 and the 119864-step is to replace 119911
1198943
119894=2by the
conditional expectations given by (8)In addition based on (27) and (28) the data augmen-
tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix
5 Simulation Studies
In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901
119894=Pr(119882 =
119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model
In the first simulated example let a total of119873 = 119899 + 119898 =
50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579
1198944
119894=1are
listed in the second column of Table 5 We first generate
(1198991 119899
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(30)
and1198980sim Binomial (119898 120579
1+1205792) so that119898
1= 119898minus119898
0The119864119872
algorithm (6) and (8) is used to calculate the MLEs of 1205791198944
119894=1
We repeated this experiment 1000 times The average MLEsof 1205791198944
119894=1are reported in the third column of Table 5 The
corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5
From Table 5 we can see that both the MLEs of 1205791198944
119894=1
in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579
1198944
119894=1in the combination questionnaire model
are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency
In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =
119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6
From Table 6 we can see that the MSEs of 1205791198944
119894=1in the
combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944
119894=1for the proposed combination questionnaire model
in the second simulated example are significantly improvedwhen compared with those in the first simulated example
6 Analyzing Cervical Cancer Data in Atlanta
Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer
For the purpose of illustration we presume that 119883 = 0
is non-sensitive although the number 0ndash3 of sex partners
Journal of Probability and Statistics 7
Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01013 00013 00113 00113 01010 00010 00064 000641205792
050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793
020 02008 00008 00037 00037 02001 00001 00029 000291205794
020 01994 minus00006 00026 00026 02014 00014 00016 000161205791
020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792
030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793
040 04038 00038 00040 00040 04025 00025 00037 000371205794
010 01006 00006 00017 00017 00991 minus00009 00009 000091205791
030 03021 00021 00107 00107 03026 00026 00081 000811205792
025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793
015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794
030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01003 00003 00030 00030 01013 00013 00064 000641205792
050 05015 00015 00029 00029 05019 00019 00041 000411205793
020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794
020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791
020 02007 00007 00041 00041 02049 00049 00057 000571205792
030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793
040 04007 00007 00020 00020 03992 minus00008 00037 000371205794
010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791
030 03032 00032 00054 00054 02965 minus00036 00079 000791205792
025 02488 minus00012 00039 00039 02501 00001 00038 000381205793
015 01481 minus00019 00021 00021 01531 00031 00032 000321205794
030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 7 Cervical cancer data fromWilliamson and Haber [17]
Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)
119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903
4 1205794)
119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903
3 1205793)
Missing 43 (11990312 1205791+ 1205792) 59 (119903
34 1205793+ 1205794)
Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate
is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)
if a respondent was born in JanuaryndashApril (MayndashAugust
SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent
of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899
1= 11990313 = 55 119899
2=
55 + 1199032= 219 119899
3= 55 + 119903
3= 276 119899
4= 1199034= 103 that is
119884obs119872 = 119899 1198991 119899
4 = 653 55 219 276 103 (31)
On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898
0= 11990312
= 43 and1198981= 11990334
= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59
Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Journal of Probability and Statistics
By treating 1205791198944
119894=1as random variables we note that
the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=1199112
119873(1 +
1198980
119899 minus 1199113minus 1198994
)
1205793=1199113
119873(1 +
1198981
1199113+ 1198994
)
1205794=1198994
119873(1 +
1198981
1199113+ 1198994
)
(6)
Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579
119894(1199011198941205791+ 120579119894) that is
119885119894| (119884obs 120579) sim Binomial(119899
119894
120579119894
1199011198941205791+ 120579119894
) 119894 = 2 3 (7)
Therefore the E-step of the EM algorithm computes thefollowing conditional expectations
119864 (119885119894| 119884obs 120579) =
119899119894120579119894
1199011198941205791+ 120579i
119894 = 2 3 (8)
and the M-step updates (6) by replacing 1199112and 119911
3with
previous conditional expectations
Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)
32 Asymptotic Confidence Intervals Let 120579minus4
= (1205791 1205792 1205793)⊤
The asymptotic variance-covariance matrix of theMLE minus4
isthen given by Iminus1obs(minus4) where
Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)
120597120579minus4120597120579⊤
minus4
(9)
denotes the observed information matrix and ℓCQ(120579 |
119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have
ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901
21205791+ 1205792)
+ 1198993log (119901
31205791+ 1205793)
+ 1198994log (1 minus 120579
1minus 1205792minus 1205793)
+ 1198980log (120579
1+ 1205792) + 1198981log (1 minus 120579
1minus 1205792)
(10)
It is easy to show that
120597ℓCQ (120579 | 119884obs)
1205971205791
=1198991
1205791
minus1198994
1205794
+
3
sum
119894=2
119899i119901i119901i1205791 + 120579i
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205792
= minus1198994
1205794
+1198992
11990121205791+ 1205792
+1198980
1205791+ 1205792
minus1198981
1 minus 1205791minus 1205792
120597ℓCQ (120579 | 119884obs)
1205971205793
= minus1198994
1205794
+1198993
11990131205791+ 1205793
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
1
=1198991
1205792
1
+1198994
1205792
4
+
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
2
=1198994
1205792
4
+1198992
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
1205971205792
3
=1198994
1205792
4
+1198993
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205792
=1198994
1205792
4
+11989921199012
(11990121205791+ 1205792)2+ 120601
minus1205972ℓCQ (120579 | 119884obs)
12059712057911205971205793
=1198994
1205792
4
+11989931199013
(11990131205791+ 1205793)2
minus1205972ℓCQ (120579 | 119884obs)
12059712057921205971205793
=1198994
1205792
4
(11)
where
120601 =1198980
(1205791+ 1205792)2+
1198981
(1 minus 1205791minus 1205792)2 (12)
Hence the observed information matrix can be expressed as
Iobs (120579minus4) = diag(1198991
1205792
1
0 0) +1198994
1205792
4
times 131⊤3+ A (13)
Journal of Probability and Statistics 5
whereA =
((((
(
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+120601
11989921199012
(11990121205791+ 1205792)2+120601
11989931199013
(11990131205791+1205793)2
11989921199012
(11990121205791+ 1205792)2+120601
1198992
(11990121205791+ 1205792)2+120601 0
11989931199013
(11990131205791+ 1205793)2 0
1198993
(11990131205791+1205793)2
))))
)
(14)
Let se(120579119894) denote the standard error of 120579
119894for 119894 = 1 2 3
Note that se(120579119894) can be estimated by the square root of the 119894th
diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579
119894) by se(120579
119894)Thus a 95normal-based asymptotic con-
fidence interval for 120579i can be constructed as
[120579119894minus 196 times se (120579
119894) 120579119894+ 196 times se (120579
119894)] 119894 = 1 2 3 (15)
Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of
120579minus4 For example 120579
4= 1 minus sum
3
119894=1120579i and the odds ratio 120595 =
12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used
to approximate the standard error of 120599 = ℎ(minus4) and a 95
normal-based asymptotic confidence interval for 120599 is given by
[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)
where
se (120599) = (120597120599
120597120579minus4
)
⊤
Iminus1obs (120579minus4) (120597120599
120597120579minus4
)
10038161003816100381610038161003816100381610038161003816120579minus4= minus4
12
(17)
33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579
minus4) Based on the obtained MLE we inde-
pendently generate
(119899lowast
1 119899
lowast
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(18)
119898lowast
0sim Binomial (119898 120579
1+ 1205792) (19)
Having obtained 119884lowast
obs119872 = 119899 119899lowast
1 119899
lowast
4 and 119884
lowast
obs119878 =
119898119898lowast
0 119898lowast
1 where 119898
lowast
1= 119898 minus 119898
lowast
0 we can calculate the
bootstrap replication 120599lowast= ℎ(120579
lowast
minus4) based on 119884
lowast
obs = 119884lowast
obs119872
119884lowast
obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast
119892119866
119892=1 Consequently a (1 minus 120572)100 bootstrap
confidence interval for 120599 is given by
[120599119871 120599119880] (20)
where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles
of 120599lowast119892119866
119892=1 respectively
34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]
1198670 120595 = 1 against 119867
1 120595 = 1 (21)
The likelihood ratio statistic is defined by
Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)
where 119877denotes the restrictedMLE of 120579 under119867
0 denotes
the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)
To find the restricted MLE 119877 we also employ the 119864119872
algorithm Under1198670 12057911205793= 12057921205794 we have
1205791= (1 minus 120579
119909) (1 minus 120579
119910)
1205792= 120579119909(1 minus 120579
119910)
1205793= 120579119909120579119910
1205794= (1 minus 120579
119909) 120579119910
(23)
In otherwords under1198670we only have two free parameters 120579
119909
and 120579119910 Having obtained the restrictedMLEs 120579
119909119877and 120579119910119877
wecan compute the restricted MLE
119877= (1205791119877
1205794119877
)⊤ from
(23) by
1205791119877
= (1 minus 120579119909119877
) (1 minus 120579119910119877
)
1205792119877
= 120579119909119877
(1 minus 120579119910119877
)
1205793119877
= 120579119909119877
120579119910119877
1205794119877
= (1 minus 120579119909119877
) 120579119910119877
(24)
In what follows we consider the computation of therestricted MLEs 120579
119909119877and 120579
119910119877 Now the complete-data like-
lihood function (5) becomes
119871CQ (120579119909 120579119910| 119884com 1198670)
prop (1 minus 120579119909) (1 minus 120579
119910)1199111
120579119909(1 minus 120579
119910)1199112
times (120579119909120579119910)1199113
(1 minus 120579119909) 1205791199101198994
(1 minus 120579119910)1198980
1205791198981
119910
= 1205791199112+1199113
119909(1 minus 120579
119909)1199111+1198994
1205791199113+1198994+1198981
119910(1 minus 120579
119910)1199111+1199112+1198980
(25)
so that the restricted MLEs of 120579119909and 120579
119910based on the
complete-data are given by
120579119909=1199112+ 1199113
119899 120579
119910=1199113+ 1198994+ 1198981
119873 (26)
respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579
119894 are defined in (23) Finally under119867
0
Λ asymptotically follows chi-squared distribution with onedegree of freedom
6 Journal of Probability and Statistics
4 Bayesian Inferences
To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911
2 1199113 are the same as
those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886
1 119886
4)
is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is
119891 (120579 | 119884obs 119884mis) = GD422
(120579minus4
| a b) (27)
where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898
0 1198981)⊤
and 1199111= 119899 minus (119911
2+ 1199113) minus 1198994 The conditional predictive dis-
tribution is
119891 (119884mis | 119884obs 120579) =3
prod
119894=2
Binomial(119911119894
10038161003816100381610038161003816100381610038161003816
119899119894
120579119894
1199011198941205791+ 120579119894
) (28)
Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=
1198862minus 1 + 119911
2
119886+minus 4 + 119873
(1 +1198980
1198861+ 1198862minus 2 + 119899 minus 119911
3minus 1198994
)
1205793=
1198863minus 1 + 119911
3
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
1205794=
1198864minus 1 + 119899
4
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
(29)
where 119886+= sum4
119894=1119886119894 and the 119864-step is to replace 119911
1198943
119894=2by the
conditional expectations given by (8)In addition based on (27) and (28) the data augmen-
tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix
5 Simulation Studies
In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901
119894=Pr(119882 =
119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model
In the first simulated example let a total of119873 = 119899 + 119898 =
50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579
1198944
119894=1are
listed in the second column of Table 5 We first generate
(1198991 119899
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(30)
and1198980sim Binomial (119898 120579
1+1205792) so that119898
1= 119898minus119898
0The119864119872
algorithm (6) and (8) is used to calculate the MLEs of 1205791198944
119894=1
We repeated this experiment 1000 times The average MLEsof 1205791198944
119894=1are reported in the third column of Table 5 The
corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5
From Table 5 we can see that both the MLEs of 1205791198944
119894=1
in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579
1198944
119894=1in the combination questionnaire model
are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency
In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =
119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6
From Table 6 we can see that the MSEs of 1205791198944
119894=1in the
combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944
119894=1for the proposed combination questionnaire model
in the second simulated example are significantly improvedwhen compared with those in the first simulated example
6 Analyzing Cervical Cancer Data in Atlanta
Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer
For the purpose of illustration we presume that 119883 = 0
is non-sensitive although the number 0ndash3 of sex partners
Journal of Probability and Statistics 7
Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01013 00013 00113 00113 01010 00010 00064 000641205792
050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793
020 02008 00008 00037 00037 02001 00001 00029 000291205794
020 01994 minus00006 00026 00026 02014 00014 00016 000161205791
020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792
030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793
040 04038 00038 00040 00040 04025 00025 00037 000371205794
010 01006 00006 00017 00017 00991 minus00009 00009 000091205791
030 03021 00021 00107 00107 03026 00026 00081 000811205792
025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793
015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794
030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01003 00003 00030 00030 01013 00013 00064 000641205792
050 05015 00015 00029 00029 05019 00019 00041 000411205793
020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794
020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791
020 02007 00007 00041 00041 02049 00049 00057 000571205792
030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793
040 04007 00007 00020 00020 03992 minus00008 00037 000371205794
010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791
030 03032 00032 00054 00054 02965 minus00036 00079 000791205792
025 02488 minus00012 00039 00039 02501 00001 00038 000381205793
015 01481 minus00019 00021 00021 01531 00031 00032 000321205794
030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 7 Cervical cancer data fromWilliamson and Haber [17]
Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)
119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903
4 1205794)
119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903
3 1205793)
Missing 43 (11990312 1205791+ 1205792) 59 (119903
34 1205793+ 1205794)
Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate
is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)
if a respondent was born in JanuaryndashApril (MayndashAugust
SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent
of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899
1= 11990313 = 55 119899
2=
55 + 1199032= 219 119899
3= 55 + 119903
3= 276 119899
4= 1199034= 103 that is
119884obs119872 = 119899 1198991 119899
4 = 653 55 219 276 103 (31)
On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898
0= 11990312
= 43 and1198981= 11990334
= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59
Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Probability and Statistics 5
whereA =
((((
(
3
sum
119894=2
1198991198941199012
119894
(1199011198941205791+ 120579119894)2+120601
11989921199012
(11990121205791+ 1205792)2+120601
11989931199013
(11990131205791+1205793)2
11989921199012
(11990121205791+ 1205792)2+120601
1198992
(11990121205791+ 1205792)2+120601 0
11989931199013
(11990131205791+ 1205793)2 0
1198993
(11990131205791+1205793)2
))))
)
(14)
Let se(120579119894) denote the standard error of 120579
119894for 119894 = 1 2 3
Note that se(120579119894) can be estimated by the square root of the 119894th
diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579
119894) by se(120579
119894)Thus a 95normal-based asymptotic con-
fidence interval for 120579i can be constructed as
[120579119894minus 196 times se (120579
119894) 120579119894+ 196 times se (120579
119894)] 119894 = 1 2 3 (15)
Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of
120579minus4 For example 120579
4= 1 minus sum
3
119894=1120579i and the odds ratio 120595 =
12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used
to approximate the standard error of 120599 = ℎ(minus4) and a 95
normal-based asymptotic confidence interval for 120599 is given by
[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)
where
se (120599) = (120597120599
120597120579minus4
)
⊤
Iminus1obs (120579minus4) (120597120599
120597120579minus4
)
10038161003816100381610038161003816100381610038161003816120579minus4= minus4
12
(17)
33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579
minus4) Based on the obtained MLE we inde-
pendently generate
(119899lowast
1 119899
lowast
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(18)
119898lowast
0sim Binomial (119898 120579
1+ 1205792) (19)
Having obtained 119884lowast
obs119872 = 119899 119899lowast
1 119899
lowast
4 and 119884
lowast
obs119878 =
119898119898lowast
0 119898lowast
1 where 119898
lowast
1= 119898 minus 119898
lowast
0 we can calculate the
bootstrap replication 120599lowast= ℎ(120579
lowast
minus4) based on 119884
lowast
obs = 119884lowast
obs119872
119884lowast
obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast
119892119866
119892=1 Consequently a (1 minus 120572)100 bootstrap
confidence interval for 120599 is given by
[120599119871 120599119880] (20)
where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles
of 120599lowast119892119866
119892=1 respectively
34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]
1198670 120595 = 1 against 119867
1 120595 = 1 (21)
The likelihood ratio statistic is defined by
Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)
where 119877denotes the restrictedMLE of 120579 under119867
0 denotes
the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)
To find the restricted MLE 119877 we also employ the 119864119872
algorithm Under1198670 12057911205793= 12057921205794 we have
1205791= (1 minus 120579
119909) (1 minus 120579
119910)
1205792= 120579119909(1 minus 120579
119910)
1205793= 120579119909120579119910
1205794= (1 minus 120579
119909) 120579119910
(23)
In otherwords under1198670we only have two free parameters 120579
119909
and 120579119910 Having obtained the restrictedMLEs 120579
119909119877and 120579119910119877
wecan compute the restricted MLE
119877= (1205791119877
1205794119877
)⊤ from
(23) by
1205791119877
= (1 minus 120579119909119877
) (1 minus 120579119910119877
)
1205792119877
= 120579119909119877
(1 minus 120579119910119877
)
1205793119877
= 120579119909119877
120579119910119877
1205794119877
= (1 minus 120579119909119877
) 120579119910119877
(24)
In what follows we consider the computation of therestricted MLEs 120579
119909119877and 120579
119910119877 Now the complete-data like-
lihood function (5) becomes
119871CQ (120579119909 120579119910| 119884com 1198670)
prop (1 minus 120579119909) (1 minus 120579
119910)1199111
120579119909(1 minus 120579
119910)1199112
times (120579119909120579119910)1199113
(1 minus 120579119909) 1205791199101198994
(1 minus 120579119910)1198980
1205791198981
119910
= 1205791199112+1199113
119909(1 minus 120579
119909)1199111+1198994
1205791199113+1198994+1198981
119910(1 minus 120579
119910)1199111+1199112+1198980
(25)
so that the restricted MLEs of 120579119909and 120579
119910based on the
complete-data are given by
120579119909=1199112+ 1199113
119899 120579
119910=1199113+ 1198994+ 1198981
119873 (26)
respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579
119894 are defined in (23) Finally under119867
0
Λ asymptotically follows chi-squared distribution with onedegree of freedom
6 Journal of Probability and Statistics
4 Bayesian Inferences
To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911
2 1199113 are the same as
those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886
1 119886
4)
is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is
119891 (120579 | 119884obs 119884mis) = GD422
(120579minus4
| a b) (27)
where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898
0 1198981)⊤
and 1199111= 119899 minus (119911
2+ 1199113) minus 1198994 The conditional predictive dis-
tribution is
119891 (119884mis | 119884obs 120579) =3
prod
119894=2
Binomial(119911119894
10038161003816100381610038161003816100381610038161003816
119899119894
120579119894
1199011198941205791+ 120579119894
) (28)
Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=
1198862minus 1 + 119911
2
119886+minus 4 + 119873
(1 +1198980
1198861+ 1198862minus 2 + 119899 minus 119911
3minus 1198994
)
1205793=
1198863minus 1 + 119911
3
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
1205794=
1198864minus 1 + 119899
4
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
(29)
where 119886+= sum4
119894=1119886119894 and the 119864-step is to replace 119911
1198943
119894=2by the
conditional expectations given by (8)In addition based on (27) and (28) the data augmen-
tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix
5 Simulation Studies
In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901
119894=Pr(119882 =
119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model
In the first simulated example let a total of119873 = 119899 + 119898 =
50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579
1198944
119894=1are
listed in the second column of Table 5 We first generate
(1198991 119899
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(30)
and1198980sim Binomial (119898 120579
1+1205792) so that119898
1= 119898minus119898
0The119864119872
algorithm (6) and (8) is used to calculate the MLEs of 1205791198944
119894=1
We repeated this experiment 1000 times The average MLEsof 1205791198944
119894=1are reported in the third column of Table 5 The
corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5
From Table 5 we can see that both the MLEs of 1205791198944
119894=1
in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579
1198944
119894=1in the combination questionnaire model
are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency
In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =
119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6
From Table 6 we can see that the MSEs of 1205791198944
119894=1in the
combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944
119894=1for the proposed combination questionnaire model
in the second simulated example are significantly improvedwhen compared with those in the first simulated example
6 Analyzing Cervical Cancer Data in Atlanta
Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer
For the purpose of illustration we presume that 119883 = 0
is non-sensitive although the number 0ndash3 of sex partners
Journal of Probability and Statistics 7
Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01013 00013 00113 00113 01010 00010 00064 000641205792
050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793
020 02008 00008 00037 00037 02001 00001 00029 000291205794
020 01994 minus00006 00026 00026 02014 00014 00016 000161205791
020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792
030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793
040 04038 00038 00040 00040 04025 00025 00037 000371205794
010 01006 00006 00017 00017 00991 minus00009 00009 000091205791
030 03021 00021 00107 00107 03026 00026 00081 000811205792
025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793
015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794
030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01003 00003 00030 00030 01013 00013 00064 000641205792
050 05015 00015 00029 00029 05019 00019 00041 000411205793
020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794
020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791
020 02007 00007 00041 00041 02049 00049 00057 000571205792
030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793
040 04007 00007 00020 00020 03992 minus00008 00037 000371205794
010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791
030 03032 00032 00054 00054 02965 minus00036 00079 000791205792
025 02488 minus00012 00039 00039 02501 00001 00038 000381205793
015 01481 minus00019 00021 00021 01531 00031 00032 000321205794
030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 7 Cervical cancer data fromWilliamson and Haber [17]
Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)
119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903
4 1205794)
119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903
3 1205793)
Missing 43 (11990312 1205791+ 1205792) 59 (119903
34 1205793+ 1205794)
Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate
is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)
if a respondent was born in JanuaryndashApril (MayndashAugust
SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent
of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899
1= 11990313 = 55 119899
2=
55 + 1199032= 219 119899
3= 55 + 119903
3= 276 119899
4= 1199034= 103 that is
119884obs119872 = 119899 1198991 119899
4 = 653 55 219 276 103 (31)
On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898
0= 11990312
= 43 and1198981= 11990334
= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59
Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Journal of Probability and Statistics
4 Bayesian Inferences
To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911
2 1199113 are the same as
those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886
1 119886
4)
is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is
119891 (120579 | 119884obs 119884mis) = GD422
(120579minus4
| a b) (27)
where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898
0 1198981)⊤
and 1199111= 119899 minus (119911
2+ 1199113) minus 1198994 The conditional predictive dis-
tribution is
119891 (119884mis | 119884obs 120579) =3
prod
119894=2
Binomial(119911119894
10038161003816100381610038161003816100381610038161003816
119899119894
120579119894
1199011198941205791+ 120579119894
) (28)
Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode
1205791= 1 minus 120579
2minus 1205793minus 1205794
1205792=
1198862minus 1 + 119911
2
119886+minus 4 + 119873
(1 +1198980
1198861+ 1198862minus 2 + 119899 minus 119911
3minus 1198994
)
1205793=
1198863minus 1 + 119911
3
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
1205794=
1198864minus 1 + 119899
4
119886+minus 4 + 119873
(1 +1198981
1198863+ 1198864minus 2 + 119911
3+ 1198994
)
(29)
where 119886+= sum4
119894=1119886119894 and the 119864-step is to replace 119911
1198943
119894=2by the
conditional expectations given by (8)In addition based on (27) and (28) the data augmen-
tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix
5 Simulation Studies
In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901
119894=Pr(119882 =
119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model
In the first simulated example let a total of119873 = 119899 + 119898 =
50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579
1198944
119894=1are
listed in the second column of Table 5 We first generate
(1198991 119899
4)⊤
sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)
(30)
and1198980sim Binomial (119898 120579
1+1205792) so that119898
1= 119898minus119898
0The119864119872
algorithm (6) and (8) is used to calculate the MLEs of 1205791198944
119894=1
We repeated this experiment 1000 times The average MLEsof 1205791198944
119894=1are reported in the third column of Table 5 The
corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5
From Table 5 we can see that both the MLEs of 1205791198944
119894=1
in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579
1198944
119894=1in the combination questionnaire model
are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency
In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =
119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6
From Table 6 we can see that the MSEs of 1205791198944
119894=1in the
combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944
119894=1for the proposed combination questionnaire model
in the second simulated example are significantly improvedwhen compared with those in the first simulated example
6 Analyzing Cervical Cancer Data in Atlanta
Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer
For the purpose of illustration we presume that 119883 = 0
is non-sensitive although the number 0ndash3 of sex partners
Journal of Probability and Statistics 7
Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01013 00013 00113 00113 01010 00010 00064 000641205792
050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793
020 02008 00008 00037 00037 02001 00001 00029 000291205794
020 01994 minus00006 00026 00026 02014 00014 00016 000161205791
020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792
030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793
040 04038 00038 00040 00040 04025 00025 00037 000371205794
010 01006 00006 00017 00017 00991 minus00009 00009 000091205791
030 03021 00021 00107 00107 03026 00026 00081 000811205792
025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793
015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794
030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01003 00003 00030 00030 01013 00013 00064 000641205792
050 05015 00015 00029 00029 05019 00019 00041 000411205793
020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794
020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791
020 02007 00007 00041 00041 02049 00049 00057 000571205792
030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793
040 04007 00007 00020 00020 03992 minus00008 00037 000371205794
010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791
030 03032 00032 00054 00054 02965 minus00036 00079 000791205792
025 02488 minus00012 00039 00039 02501 00001 00038 000381205793
015 01481 minus00019 00021 00021 01531 00031 00032 000321205794
030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 7 Cervical cancer data fromWilliamson and Haber [17]
Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)
119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903
4 1205794)
119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903
3 1205793)
Missing 43 (11990312 1205791+ 1205792) 59 (119903
34 1205793+ 1205794)
Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate
is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)
if a respondent was born in JanuaryndashApril (MayndashAugust
SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent
of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899
1= 11990313 = 55 119899
2=
55 + 1199032= 219 119899
3= 55 + 119903
3= 276 119899
4= 1199034= 103 that is
119884obs119872 = 119899 1198991 119899
4 = 653 55 219 276 103 (31)
On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898
0= 11990312
= 43 and1198981= 11990334
= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59
Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Probability and Statistics 7
Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01013 00013 00113 00113 01010 00010 00064 000641205792
050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793
020 02008 00008 00037 00037 02001 00001 00029 000291205794
020 01994 minus00006 00026 00026 02014 00014 00016 000161205791
020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792
030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793
040 04038 00038 00040 00040 04025 00025 00037 000371205794
010 01006 00006 00017 00017 00991 minus00009 00009 000091205791
030 03021 00021 00107 00107 03026 00026 00081 000811205792
025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793
015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794
030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901
1 1199012 1199013) = (13 13 13)
Parameter True valueThe combination questionnaire model The hidden sensitivity model
(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE
1205791
010 01003 00003 00030 00030 01013 00013 00064 000641205792
050 05015 00015 00029 00029 05019 00019 00041 000411205793
020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794
020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791
020 02007 00007 00041 00041 02049 00049 00057 000571205792
030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793
040 04007 00007 00020 00020 03992 minus00008 00037 000371205794
010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791
030 03032 00032 00054 00054 02965 minus00036 00079 000791205792
025 02488 minus00012 00039 00039 02501 00001 00038 000381205793
015 01481 minus00019 00021 00021 01531 00031 00032 000321205794
030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]
2
+ Var(120579119894)
Table 7 Cervical cancer data fromWilliamson and Haber [17]
Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)
119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903
4 1205794)
119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903
3 1205793)
Missing 43 (11990312 1205791+ 1205792) 59 (119903
34 1205793+ 1205794)
Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate
is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)
if a respondent was born in JanuaryndashApril (MayndashAugust
SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent
of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899
1= 11990313 = 55 119899
2=
55 + 1199032= 219 119899
3= 55 + 119903
3= 276 119899
4= 1199034= 103 that is
119884obs119872 = 119899 1198991 119899
4 = 653 55 219 276 103 (31)
On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898
0= 11990312
= 43 and1198981= 11990334
= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59
Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Journal of Probability and Statistics
Table 8 MLEs and 95 confidence intervals of 120579 and 120595
Parameter MLE std 95 asymptotic 95 bootstrapCI CI
1205791
023793 002937 [018036 029551] [018006 029879]1205792
024916 002261 [020484 029348] [020375 029359]1205793
035196 002247 [030790 039601] [030731 039666]1205794
016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]
61 Likelihood-Based Inferences Using 120579(0) = 144 as the
initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs
minus4is
Iminus1obs (minus4)
= (
000086302 minus0000442447 minus0000384452
minus000044245 0000511253 minus0000010026
minus000038445 minus0000010026 0000505241
)
(32)
The estimated standard errors of 120579119894(119894 = 1 2 3) are square
roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579
4= 1minus120579
1minus1205792minus1205793
and = 12057911205793(12057921205794) are given by
se (1205794) = 1⊤
3Iminus1obs(minus4) 13
12
se () = 120572⊤Iminus1obs (minus4)120572
12
(33)
respectively where
120572 = (
1205793(1 minus 120579
2minus 1205793)
12057921205792
4
12057911205793(1205792minus 1205794)
(12057921205794)2
1205791(1 minus 120579
1minus 1205792)
12057921205792
4
)
⊤
(34)
These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8
Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8
To test the null hypothesis 1198670 120595 = 1 against 119867
1 120595 = 1
we need to obtain the restricted MLE 119877 Using
(120579(0)
119909 120579(0)
119910)⊤
= (05 05)⊤ (35)
as the initial values the EM algorithm in (26) converged to
120579119909119877
= 064807 120579119910119877
= 052948 (36)
Table 9 Posterior modes and estimates of parameters for thecervical cancer data
Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval
1205791
023793 024025 002934 [018574 030050]1205792
024916 024789 002251 [020365 029192]1205793
035196 035029 002246 [030621 039432]1205794
016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]
in 19 iterations From (24) the restricted MLEs of 120579 areobtained as
119877= (016559 030493 034314 018634)
⊤ (37)
The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867
0is rejected at the 005 level of significance Thus we
can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1
62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T
4) is adopted as the prior dis-
tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1
44 as the initial values
we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status
The posterior densities of the 1205791198944
119894=1and 120595 estimated by a
kernel density smoother are plotted in Figures 1 and 2
7 Discussion
In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable
We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Probability and Statistics 9
0 02 04 06 08 1
0
2
4
6
8
10
14
12
The p
oste
rior d
ensit
y
1205791
(a)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205792
(b)
0 02 04 06 08 1
0
5
10
15
The p
oste
rior d
ensit
y
1205793
(c)
0 02 04 06 08 1
0
5
10
15
20
25
The p
oste
rior d
ensit
y
1205794
(d)
Figure 1 The posterior densities of the 1205791198944
119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of
120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The
posterior density of 1205792 (c) The posterior density of 120579
3 (d) The posterior density of 120579
4
1 2 3 4 5
0
02
04
06
08
The p
oste
rior d
ensit
y
120595
Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Journal of Probability and Statistics
applied to situationwhere two categories 119883 = 0 and 119883 = 1
are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches
Appendix
The Mode of a Group Dirichlet Density anda Sampling Method from a GDD
Let
V119899minus1
= (1199091 119909
119899minus1)⊤
119909119894⩾ 0
119894 = 1 119899 minus 1
119899minus1
sum
119894=1
119909119894⩽ 1
(A1)
denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X
1 X
119899)⊤
isin T119899is said to follow a
group Dirichlet distribution (GDD) with two partitions if thedensity of x
minus119899= (1198831 119883
119899minus1)⊤isin V119899minus1
is
GD1198992119904
(119909minus119899
| a b)
= 119888minus1
GD (
119899
prod
119894=1
119909119886119894minus1
119894)(
119904
sum
119894=1
119909i)
1198871
(
119899
sum
119894=119904+1
119909i)
1198872
(A2)
where 119909minus119899
= (1199091 119909
119899minus1)⊤ a = (119886
1 119886
119899)⊤ is a positive
parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter
vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by
119888GD = 119861 (1198861 119886
119904) 119861 (119886119904+1
119886119899)
times 119861(
119904
sum
119894=1
119886119894+ 1198871
119899
sum
119894=119904+1
119886119894+ 1198872)
(A3)
We write x sim GD1198992119904
(a b) on T119899or xminus119899
sim GD1198992119904
(a b) onV119899minus1
to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density
(A2) is given by [11 22]
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198871
sum119904
119895=1(119886119895minus 1)
119894 = 1 119904
(A4)
119909119894=
119886119894minus 1
sum119899
119895=1(119886119895minus 1) + 119887
1+ 1198872
1 +1198872
sum119899
119895=119904+1(119886119895minus 1)
119894 = 119904 + 1 119899
(A5)
The following procedure can be used to generate iidsamples from a GDD [11] Let
(1) y(1) sim Dirichlet (1198861 119886
119904) on T
119904
(2) y(2) sim Dirichlet (119886119904+1
119886119899) on T
119899minus119904
(3) 119877 sim Beta(sum119904119894=1
119886119894+ 1198871 sum119899
119894=119904+1119886119894+ 1198872) and
(4) y(1) y(2) and 119877 are mutually independent
Define
x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)
Then
x = (x(1)x(2)) sim GD
1198992119904(a b) (A7)
on T119899 where a = (119886
1 119886
119899)⊤ and b = (119887
1 1198872)⊤
Acknowledgments
The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region
References
[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965
[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980
[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984
[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986
[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005
[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005
[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Journal of Probability and Statistics 11
[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007
[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977
[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003
[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008
[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984
[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987
[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996
[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993
[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002
[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994
[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008
[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009
[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013
[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press
[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of