learning with tetrad iv to build a bayesian network for...

Learning with TETRAD IV to build a Bayesian network for thestudy of depression in elderly people

Pilar Fuster-ParraDepartament Matematiques i Informatica

Universitat de les Illes Balears, [email protected]

Carmen Moret-TatayCatedra Energesis de Tecnologıa Interdisciplinar

Universidad Catolica de Valencia, [email protected]

Esperanza Navarro-PardoDepartament Psicologia Evolutiva i de la Educacio

Universitat de Valencia, [email protected]

Pedro Fernandez-de-Cordoba CastellaInstitut Universitari de Matematica Pura i Aplicada

Universitat Politecnica de Valencia, [email protected]

Abstract

The paper describes the process of learning using TETRAD IV, from a complete databasecomposed of twenty two variables (characteristics) and sixty six observations (cases),which has been obtained from different validated psychological tests applied to a reducedset of people: the elderly. Learning both the structure and parameters has been performedwith TETRAD IV. Then, a Bayesian network has been implemented in Elvira.

1 Introduction

Discovering causal relationships from observa-tional data is an important problem in empiri-cal science (Druzdzel and Glymour, 1995),(Claasen and Heskes, 2010). The purpose ofthe present article is to describe the processof learning with TETRAD IV, to develop aBayesian network (BN) in the domain of de-pression in elderly people from a database withcomplete observations. Then, the BN can beused to evaluate and investigate the depression(Butz et al., 2009). BNs can be considered atthe intersection of Artificial Intelligence, statis-tics and probability (Pearl, 2000) and consti-tute a representation formalism strongly related

to knowledge discovery and data mining (Heck-erman, 1997). BNs are a kind of probabilis-tic graphical models (PGMs) (Larranaga andMoral, 2011), which combine graph theory (tohelp the representation and resolution of com-plex problems) and probability theory (as a wayof representing uncertainty). A PGM is definedby a graph where nodes represent random vari-ables and arcs represent dependencies amongsuch variables. The graphical structure cap-tures the compositional structure of the causalrelations and general aspects of all probabilitydistributions that factorize according to thatstructure (Glymour, 2003). A PGM is calleda BN when the graph connecting its variables isa directed acyclic graph (DAG). A BN may be

causal or not depending whether the arcs in theDAG represent direct causal influences or not.If the arcs represent causal influences we willtalk of causal BNs (Spirtes et al., 1993). Theassumption that arcs represent causality in elic-itation process is very useful (Buntine, 1996).

In Section 2 we analyze the problems asso-ciated with depression in elderly. In Section 3we describe the process of learning the struc-ture from data using TETRAD IV. In Section4 we present the process of learning the param-eters. In Section 5 we present the use of anapproximative updater. In Section 6 we showthat depression variable classifies the 80.30% ofcases under study. In Section 7 we present a BNto study depression in elderly, which was imple-mented in Elvira. Finally, some conclusions andfurther remarks are given in Section 8.

2 Depression in elderly

Depressive symptomatology is especially rele-vant to psychology and mental health: this isthe fourth cause of disability in Spain and it isexpected to be the second cause in the next fiveyears (Spanish Ministry of Health, 2007). In theelderly, 4 from 10 doctors appointment are as-sociated to this pathology. Nowadays, it is notyet possible to establish the cause of depression.Therefore, we considered interesting to explorewhich variables may influence depressive symp-toms and, which are influenced by them.

Although the global manual of mental illness,created by the APA (American Psychiatric As-sociation, 2002) does not set margins for theelderly, pathoplasty differs, usually appearingin an atypical and unspecific way, accompaniedby other pathologies, often characteristic at thisage (Jimenez et al., 2006). Thus, we followedDSM-IV-TR criteria, valid for all age groups.We have also included other variables that, ac-cording to the literature, may be related to de-pressive symptoms in the elderly (Lang et al.,2011) as age, gender, marital status, presenceof cognitive impairment, level of physical de-pendence, coping, psychological well-being, etc..But, at the same time, we tried to examine re-lationships between them.

3 Learning the structure from data

The problem of discovering the causal structureincreases with the number of variables (Sucarand Martınez-Arroyo, 1998), (Cheng et al.,2002), e.g. for each pair of variables, X andY there are four possible kinds of connections:X → Y, X ← Y, X ↔ Y, X Y. Thenumber of distinct possible causal arrangementsof n variables is 4(number of pairs of variables)

(Glymour, 2003). With twenty two variablesthere would be 4(21+20+···+2+1) = 1.190853 ×10139 possible connections between variables.

Figure 1: Pattern obtained when GES algo-rithm performs the search on data.

Basically, there are two approaches to struc-ture learning (Jensen and Nielsen, 2007):i)search-and-score structure learning, andii)constraint-based structure learning. Search-and-score search algorithms assigns a number(score) to each BN structure, and we look forthe structure model with the highest score.Constraint-based search algorithms establisha set of conditional independence analysis onthe data (Margaritis, 2003). Based on thisanalysis an undirected graph is generated. Us-ing additional independence test, the networkis converted into a BN. Being PC the mostknown example of constraint-based learning al-gorithms. For discrete or categorical variables,PC uses either a chi square of independence orconditional independence (Spirtes et al., 1993).

In order to obtain the DAG we useTETRAD IV (Scheines et al., 1998).

Figure 2: Pattern obtained when GES algo-rithm performs the search on data with priorknowledge.

Table 1: Search algorithms results on data.Search df χ2 pvalue

PC 224 402.3027 0PCPattern 214 297.9925 0.0001PCD 224 402.3027 0CPC 214 283.286 0.0011JPC 226 396.2216 0JCPC 226 396.2216 0GES 200 195.3489 0.5796

The software is available as freeware inthe TETRAD IV suite of algorithms atwww.phil.cmu.edu/projects/tetrad. Severalsearch algorithms, which are integrated inTETRAD IV, were tested on the data (PC,PCPattern, PCD, CPC, JPC, JCP and GES),and we found GES (greedy equivalence search)(Chickering, 2002) as the one best behaved withrespect to our data (see Table 1). We only gota completely connected graph with GES algo-rithm, so we choose this one.

The search process gives a pattern, wheremost of the arc directions were defined, thatis, we obtained a partially directed graph G.We only had connections without any directionamong variables: maritalstatus – sex, brcsme –caerep – visionproblems. It is also possible toobtain the DAG in this pattern, being markedwith yellow arrows the connections which were

Table 2: GES algorithm applied on data (model1 ), applied on data with prior knowledge (model2 ) and with prior knowledge obtained from PC,PCD, CPC, JPC and JCPC (model 3, model 4,model 5, model 6 and model 7 respectively).Model Search df χ2 pvalue

1 GES 200 195.3489 0.57962 GES 201 201.0426 0.48593 GES 199 192.754 0.61144 GES 199 192.754 0.61145 GES 200 198.1167 0.52446 GES 199 194.5962 0.57497 GES 199 194.5962 0.5749

Table 3: Comparisons among GES models.Model comparison χ2 df pvalue

model 2 versus 1 5.6937 1 0.9829740model 2 versus 3 8.2886 2 0.9841455model 1 versus 3 2.5949 1 0.8927918model 2 versus 4 8.2886 2 0.9841455model 1 versus 4 2.5949 1 0.8927918model 2 versus 6 6.4464 2 0.9601726model 1 versus 6 0.7527 1 0.6143773model 5 versus 6 3.5205 1 0.9393858model 2 versus 5 2.5949 1 0.8927918model 5 versus 4 5.3627 1 0.9794281model 5 versus 3 5.3627 1 0.9794281

no defined in direction in Figure 1, this graph G’extends graph G, because i) G and G’ have thesame adjacencies and ii) if A → B is in G thenA → B is in G’ . G’ is a consistent DAG exten-sion of graph G because i) G’ extends G, ii) G’is a directed acyclic graph and iii) pattern(G)= pattern(G’) (Meek, 2003). The model ob-tained by this way (only from the original data)we called model 1. Then, we perform anothersearch using GES algorithm, where we take thedata and prior knowledge (Heckerman et al.,1995), which was included as tears to orderthe variables: maritalstatus, sex, brcsme, caerepand visionproblems. The order introduced wasthe same given in the first search, that is theyellow directions marked in Figure 1. Another

DAG was obtained, and we called it model 2,see Figure 2. The only difference between bothmodels is that in model 2 the arc connectingcaeafn with worklevel (which was in model 1 )has been removed, as a consequence the degreeof freedom has increased one unit (Glymour etal., 1986). We also take into account a pos-sible model 3, which was obtained from dataand prior knowledge (connections obtained byPC algorithm defined by tears), and GES algo-rithm was applied. We proceed like that withPCD, CPC, JPC and JCPC, and we called themodels: model 4, model 5, model 6 and model7 respectively. A resume of the results can beseen in Table 2. In order to compare the mod-els explained in Table 2 we follow as in (Bollen,1996) and we compute the differences of the χ2

statistics and p-values. We obtain Table 3.Model comparisons make clear model 2 as the

model of choice. Model 2 was compared to anon constrained model (model 1 ) and to severalconstrained models (model 3, 4 and 5 ). Model7 has not been considered in the comparison aswe obtained the same results than for model 6.Among the small number of considered models,model 2 has strong statistical support. Howevermany other models could be constructed.

4 Learning the parameters

In a BN the conditional probability distribu-tions are called the parameters. A prior prob-ability, in a particular model, it is the proba-bility of some event prior to updating the prob-ability of that event, within the framework ofthat model, using new information. A posteriorprobability is the probability of an event afterits prior probability has been updated, withinthe framework of some model, based on new in-formation. Parameters have been obtained us-ing the estimator box of TETRAD IV. A Dirich-let estimation has been performed, which es-times a Bayes instantiated model using a Dirich-let distribution for each category. The probabil-ity of each value of a variable (conditional on thevalues of the variable’s parents) is estimated byadding to a prior pseudocount (which is 1, bydefault) of cases (see Table 4), for each configu-

Table 4: Dirichlet parameters (pseudocounts)for variable a = age conditional on combina-tions of its parent values (with total pseudo-count in row shown), where s and b denote vari-ables selfratedhealth and barthelindex respec-tively, and d. = dependent, i. = independents b a =< 76.5 a => 76.5 t. c.

bad d. 1.0000 1.0000 2.0000bad i. 1.0000 1.0000 2.0000good d. 1.0000 1.0000 2.0000good i. 1.0000 1.0000 2.0000

Table 5: Table of expected values of prior prob-abilities for variable a = age conditional oncombinations of its parent values (with totalpseudocount in row shown), where s and b de-note variables selfratedhealth and barthelindexrespectively.s b a =< 76.5 a => 76.5 t. c.


ration, the number of cases in which the variabletakes that value and then dividing by the totalnumber of cases in the prior and in the datawith that configuration of parents variables (seeTable 5). Using a Dirichlet distribution withparameters α1, α2, . . . , αk then the prior proba-bilities are (Naepolitan, 2004):

p(x1) =α1

mp(x2) =

α2

m. . . p(xk) =

αk

m(1)

where m = α1 + . . .+αk. After seeing x1 occurss1 times, x2 occurs s2 times, . . ., and xk occurssk times in n = s1 + . . .+sn trials, the posteriorprobabilities are as follows:

P (x1 | s1 . . . sk) =α1 + s1

m + n

P (x2 | s1 . . . sk) =α2 + s2

m + n(2)

...

P (xk | s1 . . . sk) =αk + sk

m + n

Table 6: Table of expected values of posteriorprobabilities for variable a = age conditional oncombinations of its parent values, where s and bdenote variables selfratedhealth and barthelindexrespectively.s b a =< 76.5 a => 76.5 t.c.


The numbers α1, α2, . . . , αk are usually ob-tained by our experience to having seen the firstoutcome occurs α1 times, the second outcomeoccurs α2 times, . . . , and the last outcome oc-curs αk times. These αi are the pseudocountsof TETRAD IV. Several possibilities were con-sidered: i) α1 = α2 = . . . = αk = 1 in thiscase we do not have any knowledge at all con-cerning the relative frequency; ii) α1 = α2 =. . . = αk < 1 when it is believed that the rela-tive frequency of the i-th value is around αi/N ;iii) α1 = α2 = . . . = αk > 1 when someonewants to impose his/her relative frequencies onthe system. However we got better results ofclassification using a pseudocount = 1 (case i)than in the other cases. Another possibility wasconsidered by expressing prior indifference withprior equivalent sample size (Naepolitan, 2004);there are variables in our problem which havetwo of three values, so for variables with 2 valueswe assign a pseudocount of 1.5 to each value anda pseudocount of 1 to variables of three values,probabilities were slightly different however wegot again that e.g. depression variable classifiedworst.

From the pseudocounts, TETRAD IV deter-mines the conditional probability of a category(see Table 6). This estimation is done by tak-ing the pseudocount of a category and dividingit by the total count for its row.

5 Updater

An approximate updater has been implemented.The updater of TETRAD IV allows us to up-

Figure 3: Updater with some evidence,worklevel = medium, brcsme = no.

date the values of other parameters given a setof predetermined values of parameters in thatmodel. There is also the possibility of using anexact updater, but the first one is very fast. InFigure 3 and 4 some evidence has been given tothe update: sex = woman, age => 76.5, mar-italstatus = widow, worklevel = medium, deaf-prob = yes, caefsp = presence, caeafn = pres-ence, caerlg = presence; the difference betweenFigure 3 and 4 is the variable brcsme = no inFigure 3 and brcsme = yes in Figure 4, with theupdater we obtained a clear variation on depres-sion variable. The probability of depression =yes increase to 0.75 with variable brcsme statedto no (see Figure 3); and the probability of de-pression = yes decrease to 0.125 when brcsmevariable is stated to yes. The blue lines, andthe values listed across from them, indicate theprobability that the variable takes on the givenvalue in the input instantiated model. The redlines indicate the probability that the variabledepression takes on the given value, given the

Figure 4: Updater with some evidence,worklevel = medium, brcsme = yes.

evidence that we added to the updater.

6 Validation

TETRAD IV takes as input a categorical dataset and a Bayes instantiated model and for agiven variable, estimates that variable’s valuesin that case. Figure 5 shows that there is a80.30% correctly classified with respect to de-pression variable. Figure 6 shows the ReceiverOperating Characteristics (ROC) also known asrelative operating curves, a graphical plot of thesensitivity, or true positive rate versus false pos-itive rate. The ROC curve falls within the quad-rant 1 × 1 and the area under it is used as apredictive indicator of goodness. The area un-der curve (AUC) is defined as the probabilityof correctly classifying a pair of cases (positiveand negative) randomly selected.

In Figure 5 we can see the contingency ta-ble or confusion matrix, where TN = 47 (truenegative) is the number of correct predictions

that an instance is negative, FP = 2 (falsepositive) is the number of incorrect predictionsthat an instance is positive, FN = 11 (falsenegative) is the number of incorrect predic-tions that an instance is negative, and TP = 6(true positive) is the number of correct pre-dictions that an instance is positive. Thus,TNR = 47

47+2 (true negative rate or proportionof negatives cases that were classified correctly),TPR = 6

11+6 (true positive rate or proportionof positive cases that were correctly identified),FPR = 2

47+2 (false positive rate or proportionof negatives cases that were incorrectly classifiedas positive), FNR = 11

11+6 (false negative rateor proportion of positives cases that were incor-rectly classified as negative), and precision P isthe proportion of the predicted positive casesthat were correct P = 6

2+6 = 0.75. The accu-racy of depression variable is 80.30% (ACC =

TP+TNTP+FN+FP+TN = 47+6

66 = 0.80303030). Fromthese figures we can see how the BN provides acomputationally efficient prediction system forthe study of depression.

Figure 5: Percentage correctly classified for de-pression = no.

7 Bayesian network for depression

TETRAD IV has been developed to be able tobuild a Bayesian network (Spirtes et al., 1993),because it can be used as a expert system froma database. However TETRAD IV is not a pro-

Figure 6: ROC plot for depression = no.

Figure 7: Bayesian network with 22 nodes im-plemented in Elvira.

gram of bayesian networks, it is not available agraphical interface with the graphical and quan-titative structure at the same time. TETRADIV can not work with missing data, as usuallydo Bayesian networks.

From the parameters and structure obtainedby TETRAD IV a Bayesian network has beenimplemented using Elvira (a tool for buildingand evaluating graphical probabilistic models,the software can be obtained for free at http://www. ia.uned.es/ elvira)(see Figure 7).

The independencies from the graph are trans-lated to the probabilistic model. The joint prob-ability distribution of the BN of Figure 7 re-quires the specification of 22 conditional prob-ability tables, one for each variable conditionedto its parents’set.

The joint probability distribution factorized

as a product of several conditional distributionsand denotes the dependency/independencystructure by a DAG:

P (x1, . . . x22) =22∏

i=1

P (xi | Pa(xi)) (3)

Based on the magnitude of influence and inthe sign of influence (Lacave et al., 2011), thenetworks implemented in Elvira offer the pos-sibility of automatic coloring of links as it canbe seen in Figure 8. Most of the connectionsare red or positive links, undefined connectionsare in purple and negative in blue. The coloringand the width of links help to detect wrong in-fluences. There are only two exceptions in thenetwork for depression: wais3pe → cognitiveim-pairment and caeeea → depression for obviousreasons.

Figure 8: Bayesian network with automatic col-oring of links feature of Elvira.

8 Concluding remarks

The structure and parameters have been ob-tained from complete data using the softwareTETRAD IV. ROC curves and confusion ma-trix have been taken into account. From theobtained results a Bayesian network has beenimplemented using the software Elvira.

Further work is oriented to the study andevaluation of depression in elderly using the im-plemented BN, which can be used to calculatenew probabilities when new information is in-troduced (evidence), and therefore some inter-esting conclusions could be achieved.

Acknowledgments

This work was financially supported by theUV-INV-AE11-39878 project and the MICINNTIN2009-12359 project ArtBioCom.

References

American Psychiatric Association (APA), (2002).Manual diagnostico y estadıstico de los trastornosmentales. Texto revisado. (DSM-IV-TR).Barcelona: Masson.

Bollen, K.A. (1989). Structural equations with la-tent variables. Wiley series in probability andmathematical statistics.

Buntine, W. (1996). A guide to the literature onlearning probabilistic networks from data. IEEETransactions on Knowledge and Data Engineer-ing, vol. 8, (2):195–210.

Butz, C. J., Hua, S., Chen, J. and Yao, H. (2009).A simple graphical approach for understandingprobabilistic inference in Bayesian networks. In-formation Sciences, 179, 699-716.

Claasen, T. and Heskes, T. (2010). Learning causalnetwork structure from multiple (in)dependencemodels. Proceedings of the 5th European Work-shop on Probabilistic Graphical Models, 81-89,Helsinki, Finland.

Cheng, J., Greiner, R., Kelly, J., Bell,D. and Liu, W.(2002). Learning Bayesian networks from data: aninformation-theory based approach. Artificial In-telligence, 137, 43–90.

Chickering, D. (2002). Optimal structure identifi-cation with greedy search. Journal of MachineLearning Research, 3(3), 507–554.

Druzdzel, M.J. and Glymour, C. (1995). What docollege ranking data tell us about student reten-tion: causal discovery in action. Intelligent Infor-mation Systems IV. Proceedings of the workshopheld in Poland.

Glymour, C. (2003). The mind’s arrows: Bayes netsand graphical causal models in psychology. MITPress, New York.

Glymour, C., Scheines, R., Spirtes, P. and Kelly,K. (1986). Discovering causal structure. Technicalreport CMU-PHIL-1.

Hekerman, D. (1997). Bayesian networks for datamining. Data mining knowledge discovery, 1, 79–119.

Heckerman, D., Geiger, D. and Chickering, D.M.(1995). Learning Bayesian networks: the combi-nation of knowledge and statistical data. MachineLearning, 20, 197–243.

Jensen, F.V. and Nielsen, T.D. (2007). Bayesian net-works and decision graphs. Information Science &Statistics. Springer.

Jimenez, A., M., Galvez Sanchez, N. y Esteban Saiz,R. (2006). Depresion y Ansiedad. En SociedadEspanola de Geriatrıa y Gerontologıa (SEGG),Tratado de Geriatrıa para Residentes. pp. 235-239.

Lacave, C., Luque, M. and Dez, F.J. (2011). Expla-nation of Bayesian networks and inflence diagramsin Elvira. Journal of Latex Class Files,vol. 1, 11.1511-1528.

Lang, G., Resch, K., K. Hofer, K., Braddick, F. yGabilondo, A. (2010). La salud mental y el bi-enestar de las personas mayores. Hacerlo posible.Madrid: IMSERSO.

Larranaga, P. and Moral, S. (2011). Probabilisticgraphical models in artificial intelligence. AppliedSoft Computing, 1511-1528.

Margaritis, D. (2003). Learning Bayesian networkmodel structure from data. PhD Thesis of CMU-CS-03-153.

Meek, Ch. (2003). Causal inference and causal ex-planation with background knowledge. 403-410.

Spanish Ministry of Health (Ministerio de Sanidad).(2007). Estrategia en salud mental del sistemanacional de salud, 2006. Espana: Ministerio deSanidad y Consumo.

Naepolitan, R. E. (2004). Learning Bayesian net-works. Prentice Hall.

Pearl, J. (2000). Causality. Models, reasoning andinference. Cambridge university press.

Sucar, L.E. and Martınez-Arroyo, M. (1998). Inter-active structural learning of Bayesian networks.Expert Systems with Applications, 15, 325–332.

Scheines, R., Spirtes, P., Glymour, C., Meek, C. andRichardson, T. (1998). The TETRAD project:constraint based aids to causal model specifica-tion. Multivariate Behavioural Research, 33:1, 65–117.

Spirtes, P., Glymour, C. and Scheines, R. (1993).Causation, prediction and search. Springer-Verlag.

learning with tetrad iv to build a bayesian network for...

Documents