on the adverse effect of increasing the number of binary symptons in medical diagnosis_springer...

ON THE ADVERSE EFFECT OF INCREASING THE NUMBER OF BINARY SYMPTOMS IN MEDICAL

SUMMARY

DIAGNOSIS USING THE KERNEL METHOD

E. Girelli Bruni

Department of Statistics

University College London

London-England

The modern tendency in the diagnostic process is to use as many technologies as

possible to investigate the maximum number of biological functions. This tendency

finds its justification in the opinion by which a greater quantity of information

must also correspond with a greater comprehension and analysis of the state of

health of the patient (The Lancetl 1976) (LindleYl 1977l.

In this work we try to oppose such a relationship, showing that a greater request

for exams can correspond to a poorer statistical identification of the patient's

state of health. The statistical method we analysed for the diagnostic allocation

is the one suggested by Aitchison (1976).

P.S.

This research has been supported by the National Research Council of Italy by

contract no. 203.10.11.

B. Barber et al. (eds.), Medical Informatics Berlin 1979© Online Conferences Ltd., Uxbridge, England 1979

659

INTRODUCTION

The development of the present paper finds its origins in the attempt to solve a

statistical problem which at the present moment has not been completely and

formally solved.

The problem in question is expressed as the belief that the larger the considered

number of variables, the greater the expected percentage of correct allocations

of the new evidence. This belief would lead us to suppose, for practical purposes

such as medical diagnosis, that we should support the use of more and more complex

multi-dimensional statistical methods, for example global analysis (Gremy, et all

1977). The above-mentioned belief finds support in a paper by Lindley (1977)

opposing the results obtained by Hughes (196B). These results were also discussed

by Ghandresekdran and others (1971) and also by Ghandresekdran and Jain (,975).

It is in studying such a problem that the author has considered the kernel method,

suggested by Aitchison (1976) and subsequently developed by Aitchison (1977). Even

though this paper does not provide a general answer for the above problem it reveals

an unsatisfactory feature of the kernel method when the number of variables is

increased.

In this paper we will refer to the problem of medical diagnosis with J diseases

such that j = 1.2, ••• ,J indicates all possible diseases. The symbol S is the total

amount of information provided by the data bank for the J diseases such that

S = (Sl,S2"",SJ) where Sj is the information in the data bank for disease j.

For each patient in the data bank we suppose we have observed I variables / symptoms

expressed by the index i = 1,2, ••• ,I and we also assume that each symptom has only

two possible facets, k. and k. (usually k. = 1 and k. 0) that are the presence 'Z- 'Z- 'Z- 'Z-

and the absence of symptom i, respectively.

Because the solution of the diagnostic problem in Bayesian terms is to define the

posterior probability

the posterior probability that a new patient - not included in the data bank - has

disease j, conditioned on the vector of facets of the observed symptoms

k = (k 1 ,k2 , ••• ,kI ) and on the information given by the data bank, where 1T(jIA) is

the prior probability for disease j and where A represents all relevant doctor's

knowledge including the information in S.

660

It is important to note that the probability of k might differ if computed on S or

on Sj' Probability p(~\j.S) indicates that the relevant information for k when

disease j occurs is not only contained in Sj but also in other partitions on the

data bank referring to different diseases.

In the present paper. however. we assume that the vector k is differently and

exclusively determined by each disease such that the previous expression can be

rewritten

and we will only concentrate on the problem of defining the likelihoods p(k\j,Sj)'

THE KERNEL METHOD

Using the kernel method suggested by Aitchison (1967) and Aitchison (1977). the

probability p(k\j,Sj) is defined as follows

p(k\j,SJ .• M) =; L Mt(k\S .• A.) j t=1.F. J J

J

(1 )

where Fj is the number of patients with disease j. and where M in the left-hand

side of (1) is a remainder of the function Mt ( . ). this being the adopted kernel

model indexed for the tth patient - though having the same form for any disease -

and where Aj (j = 1 ••.•• J) are the parameters of the kernel. In particular. the

kernel model for binary data is

(2)

in which

(3)

is a measure of distance between the two multi-dimensional points k and

§jt = (8jt1·8jt2·····8jtI) where 8jti is the facet of symptom i of the patient t

with disease j in the data bank. so that ~jt is the facet vector of patient t with

disease j in the data bank.

In (2) Aj is the smoothing parameter between 0 and 1. For Aj = i a uniform

distribution is obtained whatever the data. and for A. = 1 the method estimates J

density simply by the corresponding relative frequencies. Since there is a problem

661

of estimating the parameters Aj'S, the jack-knife likelihood method (leaving one

out) suggested by Habbema (1974) could provide the estimates ~j of Aj'S'

A CRITICAL EXAMPLE

The above kernel method gives good results if the permutations of the 0.1 facets of

the I symptoms are very similar among patients with the same disease, and they

differ significantly from patients with a different disease. In fact, if the data

bank is not particularly good in the above sense, i.e. not very homogeneous, the

kernel method will give poor results, as shown in the following example. At this

point it is also important to note that if the data bank is not very homogeneous in

terms of symptoms - this peculiarity being often considered by any mathematical

model attempting to handle real data - this is not necessarily a bad thing. In fact

it is quite common to have situations in which a non-homogeneous data bank for

various classes of disease could be a good data bank in the medical sense if used

directly by a doctor.

As an example, let us consider two disease classes, a data bank with three patients

in each class and 22 recorded binary symptoms. This situation can be realized as

shown in Table 1. In Table 1 we can see that disease 1 could be defined by a high

probability that the first three symptoms are present while the second three are

absent (the 0 and 1 standing for absence and presence of the symptom respectively),

where disease 2 is. instead. recognisable by the high probability of absence of the

first three symptoms and the presence of the second three. In both diseases the

facets of the symptoms from the seventh to the twentysecond were chosen at random

so that they would be meaningless in discriminating between the two diseases.

We are aware of the fact that the knowledge of the data bank structure is

important extra information which cannot be used by the kernel method. and it is

available to the reader only for the sake of critically-judging the kernel method.

662

Facets of the Symptoms in the Data Bank

Symptoms Patients

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

1 1 1 1 a a a a 1 1 1 1 a a 1 a a a 1 a a 1 1

Dis. 2 1 1 1 a a a 1 1 1 a 1 a a a a 1 a a 1 a 1 1 1 3 1 1 a 1 a a 1 1 1 1 1 1 a a 1 a 1 1 1 a a 1

1 a a a 1 1 1 a 1 1 a a a a a 1 1 a 1 1 1 a a Dis. 2 a a a 1 1 1 a 1 1 1 1 1 a 1 a 1 1 a 1 a 1 1

2 3 a a 1 a 1 1 a a 1 a a a 1 1 a 1 a a a 1 1 a

Table 1

Let us now allocate a new patient to one of the two diseases, after we have observed

his vector of facets

k (1 a a a a a a a a a a a a 0)

In order to do so we have to compute (1) and, if we consider the estimates of

lamda to be ~1 = ~2 = ~ = 0.80, the ratio of expression (1) for the two diseases

is

(4) 0.00781

showing that the kernel method provides a higher probability for disease 2 rather

than for disease 1, contrary to expectation.

What has happened in the above example is the annulment of the relevant information

on the problem of class allocation in one of the two diseases by the presence of

the sixteen irrelevant pieces of information. The new patient - through the

analysis of his first six symptoms - should have been allocated to disease 1, but

because the remaining sixteen facets match those of the 3rd patient of the data

bank with disease 2, the likelihood has ~xpressed a higher'conformity'(Pompilj;

1968) for disease 2 rather than disease 1.

The same conclusion would be obtained for any value of A within the interval

(0.5-1) and for any combination of the sixteen random facets of the new patient

that are appreciabily closer to one of the three vectors of random facets in the

data bank for disease 2 rather than for disease 1.

663

MATHEMATICAL CONSIDERATIONS

In this section the mathematical structure of the kernel method is considered. We

shall prove that the phenomena shown in the previous section represents a general

feature of the kernel method.

~ ~

Let us still consider two diseases. If, for simplicity, Al A2 A, Fl .. F2 a T

and we also write d(~'~jt) • djt , (1) becomes

( 5)

so that if the vector ~ is thought to come from disease 1 and we compute (5) for

j = 1 and j = 2, we obtain the following ratio of likelihoods for the two diseases

(6) p(~ Ij=1~8l~M)

p(~ Ij=2~82~M) R(kIH)

where H represents (j = 1, j = 2, 81' 82' M). If we now consider the simplest

case where T = 1, using (5) and (6) we obtain

(7)

We can see that the power of ~ = A/(1 - A) :'~ .. d2l - dll' has a symmetrical pdf,

obtained from a sum of binomial density functions, when both d 21 and dll have a

binomial pdf of the type B(I,i).

If c " 0 and there are only I random symptoms (with probd of being either present

or absent), the variable ~ varies between -I and +I with probability

(8) p(~) "p(d21 a ~~ dll= 0) + p(d21 = ~ + 1~ dll = 1) +

+ ... + p(d21 = I~ dll= I -~) =

p(-~) .. P(d2l .. o~ dll .. ~) + P(d2l = 1~ dll = ~ + 1) +

+ ... + P(d2l .. I - ~~ dll = I) =

l P(d2l = ~ + i~ dll = i) i=O~I-I~1

l P(d2l = ~ + i)p(dll i) .. i=O~I-I~1

i=oJ-I~1 (I~f+i) [~r[f) [~r

[irIi=oJ_I~1 [IA+i) [f)

664

from which we can obtain E{x} O. Note also that p(x) = p(R). where R = R(~IH).

For example, if I = 1 and the symptom has the probability ~ of being either present

or absent in both diseases, there would only be the four combinations of the data

bank shown in table 2 in which x, p(x) and R are also computed for A = o.Bo, and

where in Figures 1 and 2

Combinations data bank 1 2 3 4

Disease j 1 2 1 2 1 2 1 2

djl 0 0 0 1 1 0 1 1

x 0 1 -1 0

p(x) 0.25 0.25 0.25 0.25

R 1 4 0.25 1

table 2

we have plotted p(x) against x and peR) against R,

p(x) .50

Figure 1

-1 o +1

peR)

Figure 2

o .25 1 + 2 3 4 R R = 1.562

where R = E{R}.

In case the new patient has disease j =2, and there are a symptoms that have prob

ability 1 to be either present if j = 1, or absent if j = 2, and the remaining I - a

symptoms are random for both diseases, expression (m can be written

(9) R = ~x

665

where

If we write y = P21 - P 11 , we have that y varies between (-I + 0) and (I - 0) and

has a symmetrical pdf given by

(10) p(y) = (1JI-O L ( I-a] (I-a] l2 i=o,I-o-lyl IYI+i i

such that E{y} O.

From the above results we than obtain

(11) p{R} = p(cpx) = p(x) = p(y - 0) = p(y)

and

( 12) E{x} = E{y - a} = E{y} - a = - a

x also having a symmetrical pdf between -I and I - 20.

Because the pdf of x is symmetrical around -0 we would also have

( 13)

and, moreover, for any fixed value of -0,

( 14) -0 0 lim (cp ~ R ~ cp ) o I-+«>

so that

( 15) lim peR < cpO) = lim peR > cpO) I-+«> I-+«>

even though, at the same time, E{R} will go to ~, as empirically shown in table 3,

and as can be proved theoretically.

Because for R > cpO or R < cpO we obtain higher conformity'for disease 1 or 2,

respectively - independently of the degree of'conformity'for either disease -

expression (15) tells us that there is an equal probability to 'conform' better to

disease 1 or 2 if I increases to infinite.

In the following table, the probabilities peR > cpa), peR < cpO), p(cp-l ~ R ~ cpa), the

ratio peR < cpO)/P(R > cpO) and E{R} have been computed for 0= 1, I = 2,3.10 respec

tively, and for ~ = o.Bo.

666

I = 2 I = 3 I = 10 ... I = 00

peR > 4>0) 0.0 0.0625 0.2403 ... 0.50

peR < 4>0) 0.75 0.6875 0.5927 ... 0.50 p(4)-l :; R :; 4>0) 0.75 0.6250 0.3520 ... 0.0

peR < 4>0)/p(R > 4>0) 00 5 2.467 ... 1

E{R} 0.391 0.610 13.877 ... 00

Table 3

It could be easily proved that for any value of T, expression (9) would become

(16 ) R = -c [(1 + 4»2]I-C[ L T 4> l 44> 1 + t.:1, (T-1)

where Y t = 1'2U - 1'2 t (t = 1, ••• , u-1, u+1, T), and where U refers to a specific

patient. It is then possible to notice that for I + 00, Yt+ 0 and expression

(16) will go to 00.

It seems also interesting to end this paragraph by noticing that E{R} is a bad

parameter in representing the problem of this paper.

CONCLUSIONS

The results of this paper are conditional on the assumption that there is only a

finite number of symptoms enabling the discrimination of two or more states of

health, and that any increase in the number of symptoms is due only to the increase

of the random symptoms.

Even though we only analysed the kernel method, this paper tries to support the idea

that although statistical multi-d1mensional approaches are increasingly regarded

as important for a better understanding of nature in its complexity, it does not

imply that it is always worthwhile to increase the number of dimensions in order to

solve a diagnostic problem. The main reason for this are that:

(i) doctors are not easily capable of understanding states of health with a

great number of simultaneous relationships, and

(ii) that quite often it is more important to know which are the meaningful

symptoms for defining the state of health of the patients, rather than

increasing the number of symptoms to be considered without having a

complete understanding of their use.

667

REFERENCES

Gremy F .. Goldberg M. (1977), "Decision Making Method in Medicine" in Informatics and Medicine - An Advance Course, edited by P. L. Reichertz and G. Goos. Spinger-Verlag Berlin, Heidelberg.

Lindley D. v. (1977). "The concept of coherence in inference", meeting on 'I fondamenti dell'inferenza statistica' 20-30 April 1977. Published by the Dipartimento Statistico Universita degli studi di Firenze, (1978), pp. 178-207.

Hughes G. F. (1968), "On the mean accuracy of statistical pattern recognizers", IEEE Trans Information Theory, 14, pp. 55-63.

Chandrasekaran B. (1971)' "Independence of measurements and the mean recognition accuracy, IEEE Trans Information Theory, 17, pp. 452-456.

Chandrasekaran B. and Jain A. K. (1975), "Independence, measurement complexity and classification performance, IEEE Trans Systems Men. Cybernet., 5, pp. 240-244.

Aitchison I. J. and Aitken C. G. (1976), "Multiveriate binary discrimination by the Kernel method", Biometrika, 63, pp. 413-420.

Aitchison I. J., Habbema J. D. F. and Kay J. W. (1977)' "A critical comparison of the two methods of statistical discrimination", Applied Statistics, 26, pp.

Habbema J. D. F .. Hermans J. and Van den Broek K. (1974). "A stepwise discriminant _ analysis program using density estimation", Compstat 1974, edited by

G. Bruckman, Vienna: Physica Verlag.

Pompilj G. (1968), "Teoria della conformita", Teorie dei Campioni, Roma.

The Lancet (1976), "Admission Multiphasic Screening", Lancet, 2, p. 7997.

ACKNOWLEDGEMENT

I am grateful to A.F.M. Smith and a.v. Lindley for their helpful comments.

on the adverse effect of increasing the number of binary symptons in medical diagnosis_springer...

Documents