nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · iintroduction...

40
N onparametric estimation of the number of zeros in truncated count distributions Célestin C. KOKONENDJI University of Franche-Comté, France Laboratoire de Mathématiques de Besançon - UMR 6623 CNRS-UFC Email : [email protected] Seminar of “IRP on Statistical Advances for Complex Data” CRM, Bellaterra : 2015.11.12 Joint work with Pere Puig, UAB 1 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dis

Upload: others

Post on 12-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Nonparametric estimation of the number of zerosin truncated count distributions

Célestin C. KOKONENDJI

University of Franche-Comté, FranceLaboratoire de Mathématiques de Besançon - UMR 6623 CNRS-UFC

Email : [email protected]

Seminar of “IRP on Statistical Advances for Complex Data”

CRM, Bellaterra : 2015.11.12

Joint work with Pere Puig, UAB

1 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Célestin Kokonendji
Texte surligné
Page 2: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Acknowledgements :

• Centre de Recerca Matemàtica (CRM) : Intensive Research Program(IRP) on Statistical Advances for Complex Data

• Universitat Autònoma de Barcelona (UAB) : Departament deMatemàtiques & Servei d’Estadistica Aplicada

•Pere PUIG :. Invitation to the IRP on Statistical Advances for Complex Data(“Multivariate over-equi- and underdispersion”, in progress). DoReMi Workshop & Seminari del DEIO (UPC) with MartaPerez-Casany (also for Barcelona’s & Sitges’ Visits). Many excursions (e.g. Costa Brava), “Castanyada”, etc.

–> Moltes Gràcies2 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 3: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Acknowledgements :

• Centre de Recerca Matemàtica (CRM) : Intensive Research Program(IRP) on Statistical Advances for Complex Data

• Universitat Autònoma de Barcelona (UAB) : Departament deMatemàtiques & Servei d’Estadistica Aplicada

•Pere PUIG :. Invitation to the IRP on Statistical Advances for Complex Data(“Multivariate over-equi- and underdispersion”, in progress). DoReMi Workshop & Seminari del DEIO (UPC) with MartaPerez-Casany (also for Barcelona’s & Sitges’ Visits). Many excursions (e.g. Costa Brava), “Castanyada”, etc.

–> Moltes Gràcies2 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 4: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Acknowledgements :

• Centre de Recerca Matemàtica (CRM) : Intensive Research Program(IRP) on Statistical Advances for Complex Data

• Universitat Autònoma de Barcelona (UAB) : Departament deMatemàtiques & Servei d’Estadistica Aplicada

•Pere PUIG :. Invitation to the IRP on Statistical Advances for Complex Data(“Multivariate over-equi- and underdispersion”, in progress). DoReMi Workshop & Seminari del DEIO (UPC) with MartaPerez-Casany (also for Barcelona’s & Sitges’ Visits). Many excursions (e.g. Costa Brava), “Castanyada”, etc.

–> Moltes Gràcies2 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 5: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Outline :

Title : Nonparametric estimation of the number of zerosin truncated count distributions

1 Iintroduction

2 Count distributions with log-convex pgf

3 Fascination to lower bounds of p0

4 Estimating the non-observed number of zeros

5 Applications

3 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Célestin Kokonendji
Texte surligné
Page 6: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

IintroductionCount distributions with log-convex pgf

Fascination to lower bounds of p0Estimating the non-observed number of zeros

Applications

Cholera data set of McKendrickNumber of words knew but unused by ShakespeareNumber of grizzly bear females in Yellowstone

Iintroduction :

• In many practical situations the researcher is not able to observe theentire distribution of counts in an experiment.

• In particular the zeros often are not observed, leading to the so called(zero)-truncated count data.

• For instance : capture-recapture models, used in Biology and Ecology. Thisis a methodology commonly used to estimate an animal population’s size.

• In many cases the estimation of the not observed number of zeros is animportant issue :

4 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 7: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Cholera data set of McKendrick i

Probably the oldest example of estimation of the number of zeros is that ofMckendrick (1926), who analyzed the number of individuals with cholera in223 households in a village in India :

No. of infections 0 1 2 3 4No. of households (frequency) (168) 32 16 6 1

! ! McKendrick argued that a household with no cases of cholera could bebecause its members had not been exposed or because they had beenexposed but they had not been infected.

? ! McKendrick wanted to estimate the number of individuals who wereexposed but did not develop the symptoms.

To do this, he ignored the 168 households with zero cases and he developedan estimator of the number of zeros using the other observations based onthe zero-truncated Poisson distribution ( ?).

i. McKendrick, A. (1926). Application of mathematics to medical problems. Proc. Edinb.Math. Soc. 44, 98-130.

5 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 8: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Number of words knew but unused by Shakespeare ii

? ! Another interesting example arises answering to the following question :

How many words did Shakespeare know ?

The information to be taken into account is that Shakespeare wrote 31534different words, of which 14376 words were used exactly once, 4343 wordswere used exactly twice, 2292 were used exactly three times, and so forth.

Here is a reduced version of the full table reported in Efron and Thisted :

Ocurrences 0 1 2 3 4 5 · · ·

No. of words (frequency) ? 14376 4343 2292 1463 1043 · · ·

In this problem ( ?) the frequency of zeros to be estimated would representthe number of words that Shakespeare knew but did not use in any of hisknown works.

ii. Efron, B., Thisted, R. (1976). Estimating number of unseen species - How many words didShakespeare know ? Biometrika 63, 435-447.

6 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 9: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Number of grizzly bear females in Yellowstone iii

Most of the practical examples related with the estimation of the number ofzeros are related to the capture-recapture sampling scheme.

Keating et al (2002) studied the annual numbers of females withcubs-of-the-year in the Yellowstone grizzly bear population, from 1986 to2001. It is shown below the number of unique females with cubs-of-the-yearthat were seen exactly j times during the year 1998 :

Sights 0 1 2 3 4 5 6 7No. of bears (frequency) ? 11 13 5 1 1 0 2

Each sight is considered as a "capture", so that 11 females has beencaptured exactly once, 13 has been captured twice, and so forth.In this case, the number of bears that has been observed is just 33.

The frequency of zeros f0 represents the number of bears not observed,so that the total number of grizzly bear females would be 33 + f0.

iii. Keating, K., Schwartz, C., Haroldson, M., Moody, D. (2002). Estimating numbers of femaleswith cubs-of-the-year in the Yellowstone grizzly bear population. URSUS 13, 161-174.

7 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 10: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

IintroductionCount distributions with log-convex pgf

Fascination to lower bounds of p0Estimating the non-observed number of zeros

Applications

Discrete Compound Poisson distributionsMixed Poisson distributionsLog-convexity class

Count distributions with log-convex pgf

! ! Very wide class ⊃ count Compound( ?) and Mixed( ?) Poisson

? ! Examples with Differences (, “Desigual”)

! ! Overdispersion (to Poisson)

! ! Zero-inflation (to Poisson)

Siméon Denis Poisson (1781-1840)

8 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 11: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Discrete Compound Poisson distributionsA r.v. X follows a discrete Compound-Poisson (dCP) distribution if

X =

N∑i=1

Yi , with pgf ΦX (t) := EtX =

∞∑k=0

tkP(X = k ) = exp{−λ[1 −Ψ(t)]},

N ∼ Poisson(λ) and Y1, Y2, . . . are iid count r.v.’s, also independent of N withpgf Ψ(·). The dCP distr. constitute a huge family of count distr. acording to :

Feller’s characterization :The dCP are the only one discrete distributions that are infinitely divisible.

See, e.g., Johnson et al (2005) and Steutel and van Harn (2004) forproperties, formulae and algorithms to calculate the probabilities.

Examples of dCP distributions :Hermite, negative binomial, strict arcsine, Poisson-Tweedie, Hinde-Demétrio a

a. Kokonendji,C.C., Dossou-Gbété,S., Demétrio,C.G.B. (2004). Some discrete expo-nential dispersion models : Poisson-Tweedie and Hinde-Demétrio classes. SORT 28,201-214.

9 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 12: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Discrete Compound Poisson distributionsA r.v. X follows a discrete Compound-Poisson (dCP) distribution if

X =

N∑i=1

Yi , with pgf ΦX (t) := EtX =

∞∑k=0

tkP(X = k ) = exp{−λ[1 −Ψ(t)]},

N ∼ Poisson(λ) and Y1, Y2, . . . are iid count r.v.’s, also independent of N withpgf Ψ(·). The dCP distr. constitute a huge family of count distr. acording to :

Feller’s characterization :The dCP are the only one discrete distributions that are infinitely divisible.

See, e.g., Johnson et al (2005) and Steutel and van Harn (2004) forproperties, formulae and algorithms to calculate the probabilities.

Examples of dCP distributions :Hermite, negative binomial, strict arcsine, Poisson-Tweedie, Hinde-Demétrio a

a. Kokonendji,C.C., Dossou-Gbété,S., Demétrio,C.G.B. (2004). Some discrete expo-nential dispersion models : Poisson-Tweedie and Hinde-Demétrio classes. SORT 28,201-214.

9 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 13: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Discrete Compound Poisson distributionsA r.v. X follows a discrete Compound-Poisson (dCP) distribution if

X =

N∑i=1

Yi , with pgf ΦX (t) := EtX =

∞∑k=0

tkP(X = k ) = exp{−λ[1 −Ψ(t)]},

N ∼ Poisson(λ) and Y1, Y2, . . . are iid count r.v.’s, also independent of N withpgf Ψ(·). The dCP distr. constitute a huge family of count distr. acording to :

Feller’s characterization :The dCP are the only one discrete distributions that are infinitely divisible.

See, e.g., Johnson et al (2005) and Steutel and van Harn (2004) forproperties, formulae and algorithms to calculate the probabilities.

Examples of dCP distributions :Hermite, negative binomial, strict arcsine, Poisson-Tweedie, Hinde-Demétrio a

a. Kokonendji,C.C., Dossou-Gbété,S., Demétrio,C.G.B. (2004). Some discrete expo-nential dispersion models : Poisson-Tweedie and Hinde-Demétrio classes. SORT 28,201-214.

9 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 14: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Mixed Poisson distributions iv

A r.v. X follows a Mixed-Poisson (MP) distribution onN := {0,1, . . .} if

pk := P(X = k ) =

∫∞

0e−λ

λk

k !dF(λ), with ΦX (t) =

∫∞

0e−λ(1−t)dF(λ),

where F is a distribution function on [0,∞).

Examples of F (MP) distributions :Poisson (Neyman A), gamma (negative binomial), inverse-Gaussian (Sichelor PIG), Tweedie ⊃ positive stables (Poisson-Tweedie), F for finite supports.

Remark : all Poisson-Tweedie (PTw) ⊆ (MP ∩ dCP) ; PTw ∩ HD = {NB}.

I MP with F for finite supports * dCP.

I dCP ⊃ (Hermite ∪ strict arcsine ∪ HD\NB) * MP.

iv. Grandell, J. (1997). Mixed Poisson Processes. Chapman & Hall, London.10 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 15: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Mixed Poisson distributions iv

A r.v. X follows a Mixed-Poisson (MP) distribution onN := {0,1, . . .} if

pk := P(X = k ) =

∫∞

0e−λ

λk

k !dF(λ), with ΦX (t) =

∫∞

0e−λ(1−t)dF(λ),

where F is a distribution function on [0,∞).

Examples of F (MP) distributions :Poisson (Neyman A), gamma (negative binomial), inverse-Gaussian (Sichelor PIG), Tweedie ⊃ positive stables (Poisson-Tweedie), F for finite supports.

Remark : all Poisson-Tweedie (PTw) ⊆ (MP ∩ dCP) ; PTw ∩ HD = {NB}.

I MP with F for finite supports * dCP.

I dCP ⊃ (Hermite ∪ strict arcsine ∪ HD\NB) * MP.

iv. Grandell, J. (1997). Mixed Poisson Processes. Chapman & Hall, London.10 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 16: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Class of log-convexity pgf :

Proposition (0)Let X be a discrete r.v., Compound- or Mixed-Poisson distributed, with pgfΦX (·). Then log ΦX (·) is a convex function in [0,1].

Proof : Easy for dCP. As for MP [Φ′′Φ ≥ (Φ′)2], let dGt (λ) = e−λ(1−t)dF(λ) :∫∞

0λ2dGt (λ)

∫∞

0dGt (λ) ≥

(∫∞

0λdGt (λ)

)2

(Cauchy − Schwartz). �

PropertiesLog-convexity⇒ Overdispersion (VarX ≥ EX ) and Zero-inflation (p0 ≥ e−EX ).

Class of count distributions with log-convex pgf is wider than (dCP ∪ MP).

Example & “Desigual”ΦX (t) = 1/5 + t/5 + t2/5 + t3/20 + 7t4/20 is a log-convex function in [0,1]

but X is not in (dCP ∪ MP) by direct calculations.

11 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 17: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Class of log-convexity pgf :

Proposition (0)Let X be a discrete r.v., Compound- or Mixed-Poisson distributed, with pgfΦX (·). Then log ΦX (·) is a convex function in [0,1].

Proof : Easy for dCP. As for MP [Φ′′Φ ≥ (Φ′)2], let dGt (λ) = e−λ(1−t)dF(λ) :∫∞

0λ2dGt (λ)

∫∞

0dGt (λ) ≥

(∫∞

0λdGt (λ)

)2

(Cauchy − Schwartz). �

PropertiesLog-convexity⇒ Overdispersion (VarX ≥ EX ) and Zero-inflation (p0 ≥ e−EX ).

Class of count distributions with log-convex pgf is wider than (dCP ∪ MP).

Example & “Desigual”ΦX (t) = 1/5 + t/5 + t2/5 + t3/20 + 7t4/20 is a log-convex function in [0,1]

but X is not in (dCP ∪ MP) by direct calculations.

11 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 18: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Class of log-convexity pgf :

Proposition (0)Let X be a discrete r.v., Compound- or Mixed-Poisson distributed, with pgfΦX (·). Then log ΦX (·) is a convex function in [0,1].

Proof : Easy for dCP. As for MP [Φ′′Φ ≥ (Φ′)2], let dGt (λ) = e−λ(1−t)dF(λ) :∫∞

0λ2dGt (λ)

∫∞

0dGt (λ) ≥

(∫∞

0λdGt (λ)

)2

(Cauchy − Schwartz). �

PropertiesLog-convexity⇒ Overdispersion (VarX ≥ EX ) and Zero-inflation (p0 ≥ e−EX ).

Class of count distributions with log-convex pgf is wider than (dCP ∪ MP).

Example & “Desigual”ΦX (t) = 1/5 + t/5 + t2/5 + t3/20 + 7t4/20 is a log-convex function in [0,1]

but X is not in (dCP ∪ MP) by direct calculations.

11 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 19: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

IintroductionCount distributions with log-convex pgf

Fascination to lower bounds of p0Estimating the non-observed number of zeros

Applications

Some lower bounds of p0An improved inequality

Fascination to lower bounds of p0

from “Desigual” to0 = 1 + eiπ

12 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 20: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Some lower bounds of p0 : Part I (dCP ∪ MP)

Proposition (I)

Let X be a discrete r.v. Compound- or Mixed-Poisson distributed. Then(k + r

r

)pk+rp0 ≥ pk pr , ∀k , r ≥ 1, (1)

where pk = P(X = k ), k ∈ {0,1,2, ...}.

Set of lower bounds of p0 : (1) implies

p0 ≥pk pr

(k+rr )pk+r

, ∀k , r ≥ 1. (2)

Remark : (i) the equalities in (1) or (2) are satisfied iff X is Poisson distributed.(ii) k = r = 1 for the well-known Chao’s (1987) lower bound (Böhning, 2010)

p0 ≥p2

1

2p2. (3)

13 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 21: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Some lower bounds of p0 : Part II (Log-convexity)

In general, Log-convexity does not satisfy the inequalities (1) or (2) ; cf. thepreceding Example & “Desigual” with 3p3p0 < p1p2.

Besides, Log-convexity allows other p0-inequalities, involving also thepopulation mean and again Chao’s lower bound :

Proposition (II)

Let X be a discrete r.v. with a log-convex pgf ΦX (·) in [0,1], such thatE(X) = µ. Then,

i. p0 ≥ exp(−µ) : (Poisson) zero-inflationii. p0 ≥ p1/µ ⇔ µ ≥ p1/p0 :→ Turing’s estimator (Good, 1953)iii. p0 ≥ p2

1/(2p2) : Chao’s lower bound.

Note : - Equalities in (i)-(iii) are satisfied for Poisson distribution.- The inequalities (i)-(iii) are well known either for both or for one ofCompound- and Mixed-Poisson distributions.

14 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 22: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

An improved inequality : Part IIILet X be a r.v. Compound- or Mixed-Poisson distributed with E(X) = µ.Because all the inequalities in (2) and in Prop.(II) are satisfied, a sharper lowerbound of p0 can be obtained taking the maximum of all them. Concretely,

(?) pM := maxr ,k

pk pr

(k+rr )pk+r

⇒ p0 ≥ max{pM , exp(−µ), p1/µ

}. (4)

Lemma (Lanumteang & Böhning (2011), in proof of their Th.1)

Let X be a discrete r.v. Mixed-Poisson distributed, then

p1

p0≤

2p2

p1≤

3p3

p2≤ . . . ≤

kpk

pk−1≤ . . .

Proposition (III)

Under Mixed − Poisson : pM := maxr ,k

pk pr

(k+rr )pk+r

=p2

1

2p2.

15 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 23: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Example 1 : Negative binomial = (HD ∩ PTw) ⊂ MP

pk =

φ+ µ

)φ (µ

φ+ µ

)kΓ(φ+ k )

k !Γ(φ), k = 0,1,2, ...

with mean µ and parameter of shape φ > 0. Direct calculations show that,

pk pr

(k+rr )pk+r

=

φ+ µ

)φΓ(φ+ k )Γ(φ+ r)

Γ(φ+ k + r)Γ(φ), k , r = 1,2, ...

& its maximum is attained for k = r = 1,i.e., at the Chao’s lower bound (3).It agrees with Prop. (III) because NB is a Mixed Poisson. Consequently,

pM =

φ+ µ

)φ φ

φ+ 1. Because pM ≥

φ+ µ

)φ+1

, µ ≥ 1,

direct calculations show that the inequality (4) remains,

p0 ≥ max

(

φ

φ+ µ

)φ φ

φ+ 1, exp(−µ)

. (5)

The maximum in the right part of (5) is attained at exp(−µ), for 0 < µ ≤ µ∗,and at pM , for µ ≥ µ∗, where µ∗ is the solution of the equation exp(−µ) = pM . ♣

16 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 24: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Example 2 : Hermite of 3rd order v⊂ (dCP \ MP)

Consider X a count r.v. Compound-Poisson where the compoundingdistribution takes a finite range of values, 0, 1, 2 and 3. It leads to athird-order Hermite distribution, that can be represented as :

X = X1 + 2X2 + 3X3, with iid Xi ∼P(λi).

Its probabilities, pk = P(X = k ), can be calculated using the recursive relation,

pk = (pk−1λ1 + 2pk−2λ2 + 3pk−3λ3)/k

where p0 = exp(−λ1 − λ2 − λ3), and p−1 = p−2 = 0. This is a dCP \ MP, andconsequently the value of pM in (4) is not always the Chao’s lower bound.

Indeed, taking λ2 = 0.5 and λ3 = 1 numerical calculations show that :- for λ1 = 1.5 the maximum is at the Chao’s lower bound, i.e. pM = p2

1/(2p2),- for λ1 = 2 the maximum is pM = p1p2/(3p3), and- for λ1 = 3 the maximum is pM = p2

2/(6p4). ♣

v. Puig, P., Barquinero, J.F. (2011). An application of compound Poisson modelling to biological do-simetry. Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 467 (2127), 897-910.

17 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 25: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

IintroductionCount distributions with log-convex pgf

Fascination to lower bounds of p0Estimating the non-observed number of zeros

Applications

Improved Chao estimateTuring estimateZI-estimateFinal result

Estimating the non-observed number of zeros

from Chao to

0 = 1 + e iπ

“The Imitation Game”

Alan M. Turing (1912-1954) and ...

¿ How to apply these inequalities to the estimation of the number of zeros ?

18 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 26: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Improved Chao estimateConsider X a count r.v., with probabilities pk , k = 0,1,2, ..., where only thezero-truncated r.v. X |X > 0 (of positive values) are observed.

Let x = (x1, x2, ..., xn) a sample of size n of X |X > 0, and let fk denote thenumber (frequency) of xi equal to k , k = 1,2, ...,m (m is the largest countobserved in the sample). It is evident that

f1 + f2 + · · ·+ fm = n.

Let f0 denote the number of non-observed zeros, to be estimated.The size of the complete sample (counting the zeros) would be

N = f0 + n

(that represents the total number of individuals in the capture-recaptureexperiment). Taking into account that pi ∼ fi/N, the inequalities (2) lead to thefollowing lower bound estimates of f0,

f̂0r ,k =fk fr

(k+rr )fk+r

, 1 ≤ k , r , k + r ≤ m. (6)

The well known Chao’s (1984, 1987) estimator of f0 is obtained for r = k = 1.19 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 27: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Turing estimateThe inequalities (i) and (ii) in Proposition (II) also allow to obtain lower boundestimates of f0. The population mean µ in (i)-(ii) can be replaced by

µ̂ :=s

n + f0, where s = 0 +

n∑i=1

xi .

Then, inequality (ii) in Proposition (II) leads to,

f0n + f0

≥f1/(n + f0)

s/(n + f0),

and isolating f0 we obtain the Turing’s estimator of f0,

f̂0T =nf1

s − f1. (7)

Note : The so-called Good-Turing’s estimator vi of the population size

N̂T = f̂0T + n = n/(1 − f1/s)

underestimates it for the (very wide family of) log-convex-pgf by Prop. (II).vi. See Good (1953), Chao & Lin (2012), Chiu et al (2014), for capture-recapture problems.

20 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 28: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

ZI-estimateReplacing again µ by µ̂ := s/(n + f0) in the inequality (i) of Prop.(II) we obtain,

f0n + f0

≥ exp(−s

n + f0

)⇐⇒

x1 + x

≥ exp(−x̄∗

1 + x

),

where x = f0/n and x̄∗ = s/n. From here, we define the zi-estimator of f0,

f̂0Z = nx̂ , (8)

where x̂ is the unique solution of the equation,

− log( x1 + x

)(1 + x) = x̄∗. (9)

Note : This estimator is well defined because the left part of (9) is a decreasingfunction, becoming infinity at x = 0 and tending to 1 as x grows, and x̄∗ > 1.

Set of (under)estimators of f0 :

f̂0r ,k , f̂0T , f̂0Z .

21 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 29: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

ZI-estimateReplacing again µ by µ̂ := s/(n + f0) in the inequality (i) of Prop.(II) we obtain,

f0n + f0

≥ exp(−s

n + f0

)⇐⇒

x1 + x

≥ exp(−x̄∗

1 + x

),

where x = f0/n and x̄∗ = s/n. From here, we define the zi-estimator of f0,

f̂0Z = nx̂ , (8)

where x̂ is the unique solution of the equation,

− log( x1 + x

)(1 + x) = x̄∗. (9)

Note : This estimator is well defined because the left part of (9) is a decreasingfunction, becoming infinity at x = 0 and tending to 1 as x grows, and x̄∗ > 1.

Set of (under)estimators of f0 :

f̂0r ,k , f̂0T , f̂0Z .

21 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 30: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Final results of estimation

Because f̂0r ,k , f̂0T and f̂0Z underestimate f0 we propose to consider theestimator resulting maximizing all these estimators, that is,

f̂0M = maxr ,k

fk fr(k+r

r )fk+r

, 1 ≤ k , r , k + r ≤ m.

Compound- or Mixed-Poisson

f̂0 = max{f̂0M , f̂0Z , f̂0T

}. (10)

If f̂0C = f21 /(2f2) is the Chao’s estimator (r = k = 1), it is suitable to consider

Log-convex-pgf

f̂ ∗0 = max{f̂0C , f̂0Z , f̂0T

}, (11)

Remark : Variance of f̂0 or f̂ ∗0 is so complicated ! Then, we suggest to use abootstrap method to estimate the variance and the associated confidenceinterval for any given sample,

22 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 31: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

IintroductionCount distributions with log-convex pgf

Fascination to lower bounds of p0Estimating the non-observed number of zeros

Applications

Cholera data set of McKendrickNumber of words knew but unused by ShakespeareNumber of grizzly bear females in Yellowstone

Three Examples of Application

Coming back to :

1 Cholera data set of the McKendrick’s problem

2 Number of words knew but unused by Shakespeare

3 Number of grizzly bear females in Yellowstone

23 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 32: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Cholera data set of McKendrick (1926)Cholera in 223 households in a village in India :

No. of infections 0 1 2 3 4No. of households (frequency) (168) 32 16 6 1

Result : f̂0M = 48, f̂0Z = 32.59 ∼ 33, f̂0T = 33.46 ∼ 33 and f̂0C = 32.

f̂0 = 48 and f̂ ∗0 = 33.

Here : f̂0M = (f1f3)/(4f4) = 48.

Variability using 5000 bootstrap samples (and CI by the quantile’s method) :

Estimator Mean SD 95% CIf̂0 53.69 25.71 [26.05,107.68]

f̂ ∗0 39.70 14.33 [22.38,72.82]

Note that the prior knowledge about the distributional pattern is importantbecause the wide of the confidence interval in general is greater for f̂ ∗0.

24 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 33: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Number of words knew but unused by Shakespeare

A reduced version of the full table reported in Efron and Thisted (1976) :

Ocurrences 0 1 2 3 4 5 · · ·

No. of words (frequency) ? 14376 4343 2292 1463 1043 · · ·

Result (using the full table) : f̂0M = 23793.389 ∼ 23793,f̂0Z = 2446.992 ∼ 2447, f̂0T = 54.86 ∼ 55 and f̂0C = 23793.389 ∼ 23793.

f̂0 = f̂ ∗0 = 23793 ≡ f̂0C the Chao’s estimator.

The simulation of 1000 bootstrap samples produces :

Estimator Mean SD 95% CIf̂0 23808.52 506.32 [22832.90,24835.26]

f̂ ∗0 23791.21 520.26 [22676.48,24726.93]

25 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 34: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Number of grizzly bear females in YellowstoneEstimation of the population of grizzly bears females (Keating et al, 2002) :

Sights 0 1 2 3 4 5 6 7No. of bears (frequency) ? 11 13 5 1 1 0 2

Result : f̂0M = 28.17 ∼ 28, f̂0Z = 5.67 ∼ 6, f̂0T = 5.48 ∼ 5 and f̂0C = 4.65 ∼ 5.

f̂0 = 28 and f̂ ∗0 = 6.

Adding to the observed number of bears 33, the estimated population size is

N̂ = 61 and N̂∗ = 39.

The simulation of 5000 bootstrap samples produces :

Estimator Mean SD 95% CIf̂0 21.35 13.68 [5.83,60.17]

f̂ ∗0 7.30 3.36 [3.05,16.00]

26 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 35: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

IintroductionCount distributions with log-convex pgf

Fascination to lower bounds of p0Estimating the non-observed number of zeros

Applications

Cholera data set of McKendrickNumber of words knew but unused by ShakespeareNumber of grizzly bear females in Yellowstone

“Jo mai perdo.

O bé guanyo, o n’aprenc.”

“I never lose. I either win or I learn.”

27 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 36: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Supplementary References

a. Böhning, D. (2010). Some general comparative points on Chao’s and Zelterman’s estimatorsof the population size. Scand. J. Statist. 37, 221-236.

b. Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scand.J. Statist. 11, 265-270.

c. Chao, A. (1987). Estimating the population size for capture-recapture data with unequalcatchability. Biometrics 43, 783-791.

d. Chao, A., Lin, C.-W. (2012). Nonparametric lower bounds for species richness and sharedspecies richness under sampling without replacement. Biometrics 68, 912-921.

e. Chiu, C.-H., Wang, Y.-T., Walther, B.A., Chao, A. (2014). An improved nonparametric lowerbound of species richness via a modified Good-Turing frequency formula. Biometrics 70,671-682.

f. Good, I.J. (1953). The population frequencies of species and the estimation of populationparameters. Biometrika 40, 237-264.

g. Johnson, N.L., Kemp, A.W., Kotz, S. (2005). Univariate Discrete Distributions (3rd ed.). Wiley,New Jersey.

h. Kemp, A.W., Kemp, C.D. (1966). An alternative derivation of the hermite distribution.Biometrika 53, 627-628.

i. Lanumteang, K., Böhning, D. (2011). An extension of Chao’s estimator of population sizebased on the first three capture frequency counts. Comput. Statist. Data Anal. 55, 2302-2311.

j. Steutel, F.W., van Harn, K. (2004). Infinite Divisibility of Probability Distributions on the RealLine (1st ed.). Dekker, New York.

28 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 37: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Thanks - Gràcies - Merci - Singuila

http://lmb.univ-fcomte.fr/celestin-c-kokonendji

29 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 38: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Proof of Proposition (I)• Steutel and van Harn (2004, Chap.II, p.51) for the Compound-Poissondistributions.• For the Mixed-Poisson distributions, note that the inequalities (1) areequivalent to,∫

0e−λλk dF(λ)

∫∞

0e−λλrdF(λ) ≤

∫∞

0e−λλr+k dF(λ)

∫∞

0e−λdF(λ). (12)

Defining the probability measure over the positive reals,

dG(λ) =e−λdF(λ)∫∞

0 e−λdF(λ),

the inequality (12) can be written as, E(Y r )E(Y k ) ≤ E(Y k+r ), where Y is apositive r.v. with distribution G.It is well known that for any positive r.v. Y , E(Y s)1/s

≤ E(Y z)1/z , for all0 < s ≤ z (moment monotonicity). Without loss of generality we can assumethat r ≤ k . Then,

E(Y k+r ) ≥ E(Y k )(k+r)/k = E(Y k )E(Y k )r/k≥ E(Y k )E(Y r ),

and the proof is complete. �30 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 39: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Proof of Proposition (II)(i) Due to the convexity, the tangent line to log(ΦX (t)) at t = t0 is always lowerthan log(ΦX (t)), that is,

log(ΦX (t)) ≥Φ′X (t0)

ΦX (t0)(t − t0) + log(ΦX (t0)).

In particular, for t0 = 1, taking into account that Φ′X (1) = µ and ΦX (1) = 1, weobtain log(ΦX (t)) ≥ µ(t − 1), and for t = 0 it leads to log(p0) ≥ −µ.(ii) Note that the first derivative of log(ΦX (t)) is an increasing function fort ∈ [0,1]. In particular, the second inequality is deduced from

Φ′X (0)

ΦX (0)≤

Φ′X (1)

ΦX (1).

(iii) The third inequality is a direct consequence of the pgf log-convexity att = 0. Because log(ΦX (t)) is a convex function, calculating the secondderivative we obtain that Φ′′X (t)ΦX (t) − (Φ′X (t))2

≥ 0. Evaluating thisexpression at t = 0 the third inequality directly holds. �Note : Evaluating at t = 1 the expression Φ′′X (t)ΦX (t) − (Φ′X (t))2

≥ 0, wedirectly obtain that any count r.v. having a log-convex pgf is overdispersed.

31 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.

Page 40: Nonparametric estimation of the number of zeros in truncated … · 2016-02-22 · Iintroduction Count distributions with log-convex pgf Fascination to lower bounds of p0 Estimating

Proof of Proposition (III)Because Lemma establishes the set of inequalities,

p0 ≥p2

1

2p2≥

p2p1

3p3≥ . . . ≥

pk p1

(k + 1)pk+1≥ . . . ,

we need only to prove thatp1

(k + 1)pk+1≥

pr

(k+rr )pk+r

, r = 2,3, ...

This inequality is equivalent to∫∞

0e−λλk+1dF(λ)

∫∞

0e−λλrdF(λ) ≤

∫∞

0e−λλr+k dF(λ)

∫∞

0λe−λdF(λ)

Similarly to the proof of Proposition (I), defining the probability measure

dG(λ) =λe−λdF(λ)∫∞

0 λe−λdF(λ),

the inequality can be expressed as, E(Y r−1)E(Y k ) ≤ E(Y k+r−1), where Y is ar.v. with distribution G. Using again the moment monotonicity the proof iscompleted. �

32 Célestin C. KOKONENDJI & Pere PUIG Nonparametric estimation of the number of zeros in truncated count dist.