machine learning, information r etrieval and user modeling

UM’03 9th International Conference on User

Modeling http://www2.sis.pitt.edu/~um2003/

22 – 26 June 2003 Johnstown, Pennsylvania, USA

Papers for the UM’03 Workshop

Machine Learning, Information Retrieval and User Modeling

http://www.cs.rutgers.edu/mlirum/mlirum-2003/

Johnston, PA, 22/23 June 2003

Edited by Sofus A. Macskassy Ross Wilkinson Ayse Goker Mathias Bauer

Machine Learning, Information Retrieval and User Modeling

Organizing Committee

Sofus A. Macskassy, Leonard N. Stern School of Business, NYU Ross Wilkinson, CSIRO

Ayse Goker, Robert Gordon University Mathias Bauer, DFKI

iii

Organizing Committee Sofus A. Macskassy Stern School of Business, NYU [email protected]

Ross Wilkinson CSIRO [email protected] Ayse Goker Robert Gordon University [email protected] Mathias Bauer DFKI [email protected]

Program Committee

Nick Belkin Rutgers University [email protected] Daqing He University of Maryland [email protected] Martin Müller Universität Osnabrück [email protected] Ralph Schäfer DFKI [email protected]

Ingrid Zukerman Monash University [email protected]

iv

Preface

User model acquisition is a difficult problem. In Machine Learning, the information available to a user modeling system is usually limited, and it is hard to infer assumptions about the user that are strong enough to justify non-trivial conclusions. Classical acquisition methods like user interviews, application-specific heuristics, and stereotypical inferences often are inflexible and unsatisfying. In Information Retrieval, user models have been limited to lists of terms relevant to an information need. The list is usually very short for ad hoc querying and longer for information filtering tasks. Information systems that could benefit from having a user model should be able to adapt to individual users, to learn about their preferences and attitudes during the interaction (to construct a user profile), and memorize them for later use. Moreover, these user profiles could represent a starting point for the creation of user communities based on shared interests or goals. Further, the system should be able to update its model is a user changes interests. Machine Learning (ML) is concerned with the formation of models from observations. Hence, learning algorithms seem to be promising candidates for user model acquisition systems. Information Retrieval (IR) is concerned with the study of systems for representing, organising, retrieving and delivering information based on content. User modeling is the glue. As the better we model users, the better we can satisfy their information needs. We also aim to provide a forum for researchers who are not necessarily familiar with the diverse aspects of UM/ML/IR to be able to get acquainted with the possibilities of collaboration between the communities. Thus, our main goal is to build further bridges between three communities: User Modeling, Machine Learning, and Information Retrieval. We welcome your contributions to addressing these issues. Our main goal is to build further bridges between three communities: User Modeling, Machine Learning, and Information Retrieval. Sofus A. Macskassy RossWilkinson Ayse Goker Mathias Bauer

v

The primary questions we would like to address are:

1) How can we apply Machine Learning and Information Retrieval techniques to acquire and continuously adapt user models?

a) What role can and should the user play in reviewing and refining their own model? b) What are issues in modeling the user vs. modeling the intermediary for IR? c) How can intelligent agents be used when in charge of managing the interaction with an

information system? d) How can we evaluate user-adaptive IR systems? Is it based on effective retrieval, user

experience, reaction and satisfaction? e) Where/How does the user fit into the picture? What kind of user feedback is

helpful/needed, and how can the user query/use the learned model? f) How can ML be used for building user communities based on common interests, and

background? How do you apply IR techniques to these? g) In the case of the description of a concrete application: Why did you choose these

particular techniques? How did they affect the success of your application? What general conclusions can you draw from your experiences?

2) SIG issues:

a) What has been done since the last SIG meeting? b) How can SIG facilities be made more useful? c) What are possibilities for cooperation between SIG members? d) What could be activities the SIG should engage in? e) How can we get more people involved? f) What are the issues/problems that drive current research? g) What are the ways we can combine these three fields such that changes in any field does

not break the integrated system? Are there any standards or good practices for integration that can be identified to address this issue at this stage?

Additional themes and topics we would like to explore as they relate to the questions above:

• Moving user models beyond queries in IR • Matching algorithms when user models are more sophisticated • Exploring information delivery models when user models are more sophisticated • Acquisition of user models appropriate to an information environment • ML solutions to support to the navigation of Web sites • ML solutions for intelligent information retrieval, especially in large repositories • ML for extraction and management of user profiles • ML for building user communities based on common interests, and background • User interaction in intelligent IR • Evaluation of user-adaptive IR systems • Intelligent user interfaces in IR • Personalization of Web sites • Personalization for Web users

vi

WORKING AGENDA (20-25 minutes allotted for each presentation, followed by 5-10 minutes of discussion)

7:30 - 9:00 Registration 9:00 - 9:15 Welcome and Introduction to Workshop (Sofus A. Macskassy)

Summary of past UM-ML-IR/ML4UM workshops, relation to current status in field and workshop 9:15 - 10:30 Session: Machine Learning for User Model Acquisition Chair: Sofus A. Macskassy

Discriminant Analysis as a Machine Learning Method for Revision of User Stereotypes of Information Retrieval Systems Xiangmin Zhang

ML4UM for Bayesian Network User Model Acquisition Frank Wittig

30 minutes open related discussion

10:30 - 11:00 coffee break 11:00 - 12:30 Session: Retrieval and Users

Chair: Ingrid Zukerman / Daqing He Visualization of a user model in educational document retrieval Swantje Willms

A Reinforcement Strategy for (Formal) Concept and Keyword Weight Learning for Adaptive Information Retrieval Rohan K. Rajapakse and Mike Denham

30 minutes open related discussion

12:30 - 1:45 lunch 1:45 - 3:15 Session: Users, Navigation, and Environments Chair: Cecile Paris

A Connectionist Model of Spatial Knowledge Acquisition in a Virtual Environment Corina Sas, Ronan Reilly and Gregory O'Hare

Statistical machine learning for tracking hypermedia user behavior Sylvain Bidel, Laurent Lemoine, Frederic Piat, Thierry Artires, Patrick Gallinari 30 minutes open related discussion

3:15 - 3:45 coffee break 3:45 - 5:15 Session: SIG Discussions & Wrap-up Discussion Chairs: Sofus Macskassy and Ingrid Zukerman

Open SIG Discussions on issues related to the workshop, issues not covered by the presentations; Summary of the present workshop, action points.

5:30 – 6:45 joint workshop panel 7:15 - 8:30 Dinner

vii

Table of Contents

Discriminant Analysis as a Machine Learning Method for Revision of User Stereotypes of Information Retrieval Systems .........................................................................................................1

Xiangmin Zhang ML4UM for Bayesian Network User Model Acquisition .............................................................11

Frank Wittig Visualization of a user model in educational document retrieval..................................................21

Swantje Willms A Reinforcement Strategy for (Formal) Concept and Keyword Weight Learning for Adaptive Information Retrieval.....................................................................................................................29

Rohan K. Rajapakse and Mike Denham A Connectionist Model of Spatial Knowledge Acquisition in a Virtual Environment .................40

Corina Sas, Ronan Reilly and Gregory O'Hare Statistical machine learning for tracking hypermedia user behavior .............................................48

Sylvain Bidel, Laurent Lemoine, Frederic Piat, Thierry Artires, Patrick Gallinari

1

Discriminant Analysis as a Machine Learning Method for Revision of User Stereotypes of Information Retrieval Systems

Xiangmin Zhang

School of Communication, Information and Library Studies, Rutgers University 4 Huntington Street, New Brunswick, NJ 08901

[email protected]

Abstract. This paper proposes to use the discriminant analysis technique as a machine learning method to adjust memberships of stereotypes, based on the user’s in-depth, task-related knowledge contained in the user models. The paper reports an empirical study on the user stereotypes of information retrieval (IR) systems. The participants were first assigned into stereotypes based on their self-reported characteristics. Their memberships in the stereotypes were then tested and predicted using the discriminant analysis, based on their IR knowledge. The pre-assigned membership and the predicted membership of each stereotype were compared. The study demonstrates that the discriminant analysis technique can be used to detect the conflicts between individual users’ knowledge and the assumption held by stereotypes that all members in a stereotype share common knowledge. The technique can be used to revise/reclassify a person’s membership of a stereotype based on the person’s knowledge. Implications and future directions of the study are discussed.

1. Introduction

Stereotype technique has been widely used in user modeling systems (e.g., Rich: 1989; Chin: 1989; Brajnik, Guida, & Tasso: 1990; Kobsa, Muller, & Nill: 1994; Fernandez-Manjon, Fernandez-Valmayor, & Fernandez-Chamizo: 1998; and Paterno & Mancini, 2000). One advantage of using stereotype technique is that the knowledge about a particular user will be inferred from the related stereotype(s) as much as possible, without explicitly going through the knowledge elicitation process with each individual user. Another advantage is that the information about user groups/stereotypes can be maintained with low redundancy (Rich, 1989; Fink & Kobsa, 2000).

Nevertheless, using stereotypes is not without problems. The major problems with the stereotype-based user modeling have been related to the tasks of correctly classifying the user and of keeping the knowledge in the user model consistent (solving conflicts between assumptions in stereotypes and between the stereotypes themselves) (Bellika, Hartvigsen, & Widding: 1998). These problems have been pointed out by many researchers, e.g., Bellika, Hartvigsen, & Widding (1998), Brajnik, Guida and Tasso (1990) and Shapira, Shoval, & Hanani (1997).

2

2 Xiangmin Zhang

The problems are largely due to the way a stereotype is formed. Most stereotypes are formed based merely on users’ external characteristics and on subjective human judgment, usually of a number of users/experts (Shapira, Shoval, & Hanani: 1997). To improve the accuracy of stereotypes, various ways have been proposed to construct user classes. These include using fuzzy set theory (Mitchell, Woodbury, & Norcio, 1994) and a user questionnaire and cluster analysis for various user data (Shapira, Shoval, & Hanani, 1997).

The current paper proposes to use the discriminant analysis technique as a machine learning method to adjust stereotype memberships based on some in-depth user knowledge, after the stereotypes have been implemented. The paper demonstrates the use of discriminant analysis in an empirical study on the user stereotypes of information retrieval (IR) systems. The study used the technique to detect the conflicts between individual users’ knowledge about IR systems and their assignment of stereotypes based on some explicit user characteristics. The participants were re-assigned (predicted) by the discriminant analysis into appropriate stereotype(s).

In the remainder of this paper, the discriminant analysis technique is first very briefly introduced. The stereotypes and the participants of the empirical study are then described, which is followed by a discussion of the user knowledge that served as the input for the discriminant analysis. The results of data analyses are presented in the next section. Finally, the paper concludes with a discussion of the results and implications for use of the technique in user modeling.

2. Discriminant Analysis Technique

The discriminant analysis extracts from a group members’ data a group classification criterion (SAS Institute Inc., 1988) or a classification rule (Huberty, 1994) for classifying each observation in the group. Based on this classification criterion, a posterior possibility of the group membership for each member is calculated. This posterior possibility is the probability of an individual's membership in a group based on the individual's data. The individual is predicted to be in the group for which the person's posterior possibility is maximum (compared to the posterior possibilities for other groups).

If an individual's calculated posterior possibility of belonging to a pre-determined group is not the maximum one, the individual’s initial assignment will be judged as being "misclassified" into that group. An error rate (probabilities of misclassification) estimate for a whole group will be calculated.

The reader is referred to Huberty (1994) for a detailed technical description of the technique. This technique can be used as machine learning in determining the memberships of stereotypes. The re-assignment of stereotype members based on the new knowledge the system has obtained about individual users. It is particularly suitable for maintaining implemented stereotypes because it does not break up the initially defined stereotypes.

3

Discriminant Analysis as a Machine Learning Method for Revision of User Stereotypes of Information Retrieval Systems 3

3. Stereotypes and Participants in the Study

The stereotypes that were evaluated in this study are described in Figure 1. These stereotypes are categorized in a hierarchical structure along four different dimensions: educational and professional status, first (or native) language, academic discipline, and level of computer experience. A user belongs to different stereotypes when categorized on different dimensions. Uses of these stereotypes are common in IR user studies. For example, academic discipline has been found having an impact on IR performance. Engineering and science major students performed better than those in social sciences and humanities, particularly on complex search tasks (e.g., Borgman, 1989; Zhang, 2002). If a user were a social sciences student and the IR system the user would be using were able to identify the person’s academic discipline background, the system could provide tailored help on constructing a powerful search query, and also provide the results to the user with appropriate technical details.

Fig. 1. User Stereotypes of IR Systems

Sixty-four people participated in the study. They were recruited from four

populations: professional librarians and information specialists, graduate students (both doctoral and master's), undergraduate students and high school students. The participants were pre-assigned into the corresponding stereotypes based on their self-reported explicit characteristics. The distribution of participants in different stereotypes is summarized in Table 1.

In Table 1, the first column lists the numbers of participants on educational and professional status. The second column lists the numbers of participants according to their first language: native English and non-native English (Non-Engl). In the third column, the distribution of the participants in two academic disciplines is presented: Science/Engineering (abbreviated as Sci./Engi.) and Social Sciences/Humanities (abbreviated as Soc./Hum.). It should be noted that the participants in these categories

Generic U

ser

Educational and professional status

First/native language

Academic area

Computer experience

Libr./Info. professional

Graduate students

Undergrads Highschool students

Native English Non-English

Sci./Engi. users

Soc.Sci./ Hum. users

High level users

Medium level users

Low level users

4

4 Xiangmin Zhang

were only university students (both graduate and undergraduate). Librarians and high school students were not included. The last column, column 4, presents the distribution of participants in terms of computing experience.

Table 1. Distribution of Experiment Participants

First Language Academic Discipline Computer Experience Educational & Professional Status English Non-Engl. Sci.& Engi. Soc. & Hum. High Med. Low

Librarian 4 4 N/A1 N/A 8 0 0 Graduate 10 8 9 9 9 9 0 Undergraduate 4 10 6 8 7 4 3 High School 7 17 N/A2 N/A 2 9 13 Total 25 39 15 17 26 22 16 1 "N/A": not applicable to librarians. They were recruited as a group who were particularly

trained for information retrieval tasks and they were not asked to state their undergraduate or graduate disciplines other than library and information science.

2 The high school students did not yet have a formal academic orientation.

4. User Knowledge about IR Systems

When considering user modeling, the first question to be asked should be "what is being modeled" (Spark Jones, 1989), i.e., what types of user properties should be modeled. Bushey, Manuney, & Deelman (1999) suggest that users be categorized based upon the characteristics/behavior that are important to the design of the related system.

In this study, users' knowledge about IR systems was modeled and analyzed to see if members of a stereotype would have the same knowledge level. This knowledge is considered as an important factor affecting a user’s search performance (Allen, 1996). An assumption is made that a member of a stereotype has the knowledge that other members within this group commonly have. Table 2. Concepts and Attributes Used in Final Data Analysis

Concepts Attributes 1. browsing 2. classification

1. form/process

3. data structure 4. document content

2. targeted/untargeted

5. feedback 6. information need 7. interface 8. query 9. search

3. specific to IR systems /applicable to all information systems (ISs)

A participant’s IR knowledge is represented by the person’s ratings on 9 concepts about IR systems against three attributes∗ (Zhang & Chignell, 2001). These concepts

∗ Initial ratings are against 8 attributes. However, significant differences were found on only the

3 ones.

5


and attributes are listed in Table 2, and the sample worksheet that was used by the participants to rate the concepts against the attributes is illustrated by Fig. 2.

Fig. 2. Sample Concept Rating Worksheet

Concept (1)

Attribute(1)(left pole) 1

2

3 4 5

X

Attribute(1)(right pole)

Attribute(2)(left pole) 1

2

3 4 5

X

Attribute(2)(right pole)

...

...

Attribute(n)(left pole) 1 2 3 4 5 X Attribute(n)(right pole)

On the worksheet, all attributes were transformed into five point scales, with "1" at

the left poles and "5" at the right poles of the attributes. In case some participants had difficulty in understanding an attribute or a concept, or they thought an attribute was not applicable to a concept, a "not applicable" option was added to the scales, which was represented by an "X" sign.

To reveal unexpected dimensions (or factors) among the original variables and to reduce the number of original variables to fewer ones (Mulaik, 1972), the original ratings were summarized using the factor analysis technique. Using the principal components approach in factor analysis, with the varimax rotation, the original 27 (9 concepts x 3 attributes) variables were transformed into principal factors. The first nine factors were selected to use because their eigenvalues were greater than 1, which is a norm used in factor analyses. These 9 factors accounted for 68% of the total variations from the original ratings.

Each of the 9 factors represented certain original variables (ratings). Factor loadings and interpretations of the factors are summarized in Table 3.

On each factor, a participant had a factor score. A high factor score means the concepts were rated on the high value end of the attribute scale in the factor. A low score means the concepts were rated on the low value end of the scale.

Table 3. Rotated Factor Structure for Concept Ratings

Factor and Variable1 Factor Loadings2 Factor 1: Purposefulness of Querying Information need: targeted/untargeted Query: targeted/untargeted Search: targeted/untargeted Document content: targeted/untargeted Factor 2: Applicability of Data Organization Data structure: specific to IR systems/applicable to all ISs Doc. Content: specific to IR systems/applicable to all ISs Feedback: specific to IR systems/applicable to all ISs Interface: specific to IR systems/applicable to all ISs Classification: specific to IR systems/applicable to all ISs Factor 3: Function of Querying Information need: form/process

0.84 0.82 0.69 0.57 0.77 0.72 0.65 0.65 0.50 0.73

6

6 Xiangmin Zhang

Query: form/process Search: form/process Factor 4: Applicability of Querying Query: specific to IR systems/applicable to all ISs Information need: specific to IR systems/applicable to all ISs Search: specific to IR systems/applicable to all ISs Factor 5: Applicability of Browsing Browsing: specific to IR systems/applicable to all ISs Feedback: form/process Factor 6: Function of Data Structure Data structure: form/process Interface: targeted/untargeted Factor 7: Purposefulness of Browsing Browsing: targeted/untargeted Factor 8: Function of document Document content: form/process Factor 9: Purposefulness of data structure Data structure: targeted/untargeted

0.70 0.60 0.82 0.66 0.53 0.85 0.52 0.74 0.72 0.86 0.84 0.79

1Variables (ratings) within each factor are presented in descending order of their factor loadings. Each variable consists of a concept and an attribute that is italicized.

2Factor loadings are sorted by factor and only those loadings greater than .50 are shown.

5. Results of Discriminant Analysis on Factor Scores

Using the four user characteristics as the grouping variables, and the 9 factors as the predicting variables, the discriminant analysis technique was used to examine if an individual’s predicted membership of a stereotype, based on the factor scores, was consistent with the person’s pre-assigned membership of a stereotype. If not consistent, which stereotype this person should be assigned into based on the individual’s factor scores.

For this study, equal prior probability was used. That is, no assumption was made about the size of the various populations covered in the study.

The analysis was performed using SAS Windows version 8.2. The discriminant analysis generated two results for each pre-defined stereotype:

the misclassification or error rate for the whole stereotype and the predicted (or corrected) members of the stereotype. The latter for all stereotypes by the 4 grouping variables are summarized in Tables 4a to 4d, respectively. The error rates are illustrated in Figures 3a through 3d.

In Tables 4a to 4d, the results are presented in two dimensions for each stereotype. A row shows the classification results for a pre-defined stereotype and the column displays the result for a predicted stereotype. For a pre-defined stereotype, the results show which stereotypes the original members of this stereotype should be classified into based on the factor scores, including the current stereotype itself. For a predicted stereotype, the results show from which original pre-defined stereotypes the members of the current predicted stereotype come from. Both numbers and percentages of participants are displayed. Italic percentages in the tables are the percentages that the discriminant analysis predicted.

7


Table 4a. Classification Results for User Stereotypes on Educational & Professional Status Variable Predetermined User Stereotypes

Predicted User Stereotypes

Educational & Professional Status

Librarian Graduate Undergraduate High school Total1

Librarian 8(100%),(89%)3 0 (0%), (0%) 0 (0%), (0%) 0 (0%), (0%) 8 (100%) Graduate 0 (0%), (0%) 11 (61%), (65%) 4 (22%), (24%) 3 (17%), (14%) 18 (100%)

Undergrad 0 (0%), (0%) 2 (14%), (12%) 6 (43%), (35%) 6 (43%), (29%) 14 (100%) High school 1 (4%), (11%) 4 (17%), (24%) 7 (29%), (41%) 12 (50%), (57%) 24 (100%)

Total2 9 (100%) 17 (100%) 17 (100%) 21 (100%) 64 (100%)

Table 4b. Classification Results for User Stereotypes on Native Language Variable

Predetermined User Stereotypes


Native Language Native English Non-native English Total1 Native English 25(100%),(50%)3 0 (0%), (0%) 25 (100%)

Non-native English 25(64%),(50%) 14 (36%), (100%) 39 (100%) Total2 50 (100%) 14 (100%) 64 (100%)

Table 4c. Classification Results for User Stereotypes on the Academic Background Variable



Academic Background

No Major

Sci./Engi.

Soc.Sci./Hum.

Prof.

Total1

No Major 13(54%),(65%)3 3 (13%), (18%) 7 (29%), (39%) 1 (4%), (11%) 24 (100%) Sci./Engi. 3 (20%), (15%) 11 (73%), (65%) 1 (7%), (5%) 0 (0%), (0%) 15 (100%)

Soc.Sci.Hum. 4 (24%), (20%) 3 (18%), (18%) 10 (59%), (56%) 0 (0%), (0%) 17 (100%) Professional 0 (0%), (0%) 0 (0%), (0%) 0 (0%), (0%) 8 (100%), (89%) 8 (100%)

Total2 20 (100%) 17 (100%) 18 (100%) 9 (100%) 64 (100%)

Table 4d. Classification Results for User Stereotypes on the Computer Experience Variable



Computer Experience

High Medium Low Total1

High 12(46%),(71%)3 7 (27%), (41%) 7 (27%), (23%) 26 (100%) Medium 3 (14%), (18%) 9 (41%), (53%) 10 (45%), (33%) 22 (100%)

Low 2 (13%), (12%) 1 (6%), (6%) 13 (81%), (43%) 16 (100%) Total2 17 (100%) 17 (100%) 30 (100%) 64 (100%)

(Notes for Tables 4a to 4d: 1Number of subjects who were pre-assigned in the group. 2Number of subjects who were predicted to be in the group by the discriminant analysis. 3Italic percentages are the percentages of the predicted membership in stereotypes.)

8

8 Xiangmin Zhang

For example, Table 4a describes the results for the stereotypes on the educational & professional status variable. Four stereotypes were defined along this dimension: librarians/information professionals, graduate students, undergraduate students, and high school students. For the librarians/information professionals stereotype, which is displayed on the first row and in the first column, the number of pre-determined members is 8, with its total number and percentage appear in the last cell of the row. None of these 8 members was re-assigned into other types by the discriminant analysis, as in the cells under the other 3 types the numbers and percentages are all zeroes. However, these 8 original members account only 89% of the predicted membership by the data analysis, as indicated by the italic (89%) in the first cell. Go down along the column, the other 11% (one member) of the predicted group membership came from the original high school type. This pre-defined high school student was judged as “mistakenly” classified and the student should be classified in the librarians/professionals group. So the total for the predicted librarians/information professionals type is 9, as indicated by the italic total at the bottom of the same column. Tables 4b to 4d display the results for the stereotypes on first/native language, academic background, and computer experience, respectively.

Fig.3a. Error rat es for user stereotypes on educat ional & professiona l st atus

variable

0

0.1

0.2

0.3

0.4

0.5

0.6

Libr ar ian Gr aduate Under gr aduate High School

E duc a t ion a l & p r o fe ssiona l st a t us

Fig. 3b. Error rates for user stereotypes on first language variable

0

0.2

0.4

0.6

0.8

Eng. NonEng.

First language

Err

or

rate

Fig.3c. Error rates for user stereotypes on academic

background

00.20.40.6

Sci

/en

gi.

Soc

/Hu

man

Pro

f.

No

maj

or

Academic background

Err

or

rate

Fig.3d. Error rates for user stereotypes on computer

experience

00.5

1

High Medium Low

Level of experience

Err

or

rate

Figures 3a to 3d complement Tables 4a to 4d. They show the error rates for stereotypes in each category. For example, Figure 3a depicts the error rates for the stereotypes along the educational and professional status dimension.

As shown in the figure, the error rate for librarians/information professionals is zero, since none of the 8 pre-determined members of this type was re-assigned into other types. Graduate students type had an error rate of 0.39, because o.39 percent of its pre-defined members were re-assigned into other groups: 22% (n=4) were re-assigned into the undergraduate students group and 17% (n=3) were re-assigned as

9


high school students. Undergraduate students group had the highest error rate: 0.5714. Of 14 pre-assigned undergraduate participants, only 6 (43%) were consistent with their pre-assigned "undergraduate" status. The other 8 (57%) were misclassified ones: 2 (14%) were predicted as graduate students and another 6 (43%) were predicted as high school students. The high school student stereotype had an error rate of 0.5. Of its pre-assigned 24 members, 12 (50%) were predicted as "high school" participants by the analysis. Another 12 were misclassified ones. One (4%) of them was predicted as "librarian”, 4 (17%) as "graduate" participants, and 7 (29%) as "undergraduate" participants.

6. Discussion and Conclusions

The results of the empirical study show that some stereotypes are reliable and accurate ones, such as librarians, while some others are not good or accurate in representing the members based on the task related knowledge. Uses of these inaccurate stereotypes will bring performance issues to a user modeling system. The results also show that the medium type along a dimension (such as undergraduate between graduate and high school students, and the medium level of computer experience between the high and low) suffer the highest error rates. This probably implies that it is not be a good idea to have medium types because there are no very clear features/attributes to separate the medium type from the others at the two ends of the dimension. Perhaps a good practice would be to construct just a few very distinctive stereotypes, such as librarians and students, rather than to have many for a given system.

The empirical study described in this paper demonstrated that the discriminant analysis technique can detect the conflicts of individual knowledge and the assignment of stereotypes based on some explicit user characteristics, and can correct inaccurate assignments that are not match an individual’s knowledge. As the system learns from the users and accumulates sufficient user data (can be performance data or the users’ task-related knowledge) that is related to the task, the technique can be used as a machine learning method to automatically adjust the memberships of stereotypes and thus resolve the conflicts between individuals’ knowledge and the knowledge held by stereotypes. One advantage to use the discriminant analysis is that the pre-defined stereotypes do not need to be changed. Only the memberships of the stereotypes will be re-assigned.

One limitation of this research is that the system performance data with different user stereotypes, both before and after the revision of the memberships of the stereotypes, was not available. Future research is needed to test the effectiveness of the method in real IR tasks with real users, to see if the system and user performance will improve after stereotype memberships are adjusted. It is currently not clear how seriously the errors in the membership assignments of stereotypes would affect the IR system performance. The assignment of participants into various stereotypes could also be done manually by some human experts based on the participants’ concept ratings, so the results could be compared with that from the discriminant analysis.

10

10 Xiangmin Zhang

References

Allen, B.L. (1996). Information Tasks: Toward a User-Centered Approach to Information Systems. New York: Academic Press.

Bellika, J.G., Hartvigsen, G., & Widding, R.A. (1998). The virtual library secretary: A user model-based software agent. Personal Technologies, 2:162-187.

Borgman, C. L. (1989). All users of information retrieval systems are not created equal: An exploration into individual Differences. Information Processing & Management, 25(3):237-251

Brajnik, G., Guida, G., & Tasso, C. (1990). User modeling in expert man-machine interfaces: A case study in intelligent information retrieval. IEEE Transactions on Systems, Man and Cybernetics. 20(1):166-185.

Bushey, R., Manuney, J.M., & Deelman T. (1999). The development of behavior-based user models for a computer system, In: Proceedings of the Seventh International Conference on User Modeling, Banff, June 20-24, 1999; SpringerWien New York, pp.109-118.

Chin, D.N. (1989). KNOME: Modeling what the user knows in UC. In A. Kobsa, W. Wahlster (Eds.): User Models in Dialog Systems, Springer-Verlag, Berlin. pp.35-51.

Fernandez-Manjon, B., Fernandez-Valmayor, A., & Fernandez-Chamizo, C. (1998). Pragmatic user model implementation in an intelligent help system. British Journal of Educational Technology, 29(2): 113-123.

Fink, J. & Kobsa, A. (2000). A review and analysis of commercial user modeling servers for personalization on the World Wide Web. User Modeling and User-Adapted Interaction, 10:209-249.

Harvey, C. F., Smith, P. & Lund P. (1998). Providing a networked future for interpersonal information retrieval: InfoVine and user modeling. Interacting with Computers, 10: 195-212.

Huberty, C. (1994). Applied Discriminant Analysis. John Wiley & Sons, New York. Kobsa, A., Muller, D. & Nill, A. (1994). KN-AHS: An adaptive hypertext client of the user

modeling system BGP-MS. In: Proceedings of the Fourth International Conference on User Modeling. Hyannis (MA), pp.99-105.

Mitchell, K., Woodbury, M.A. & Norcio, A.F. (1994). Individualizing user interfaces: Application of the grade of membership (GoM) model for development of fuzzy user classes. Information Science, 1:9-29.

Mulaik, S.A. (1972). The Foundations of factor analysis. McGraw-Hill Inc. p.174.

Paterno, F. & Mancini, C. (2000). Effective levels of adaptation to different types of users in interactive museum systems. Journal of the American Society for Information Science. 51(1):5-13.

Rich, E. (1989). Stereotypes and user modeling. In A. Kobsa, W. Wahlster (Eds.): User Models in Dialog Systems, Springer-Verlag, Berlin. pp.35-51.

SAS/STAT User's Guide, Release 6.03 Ed., SAS Institute Inc., Cary, NC, USA, 1988. Shapira, B., Shoval, P. & Hanani, U., (1997). Stereotypes in information filtering systems.

Information Processing and Management, 33(3):273-287. Sparck Jones, K. (1989). Realism about user modeling. In: A. Kobsa & W. Wahlster (Eds.):

User Models in Dialog Systems, Springer-Verlag, Berlin Zhang, X. & Chignell, M. (2001). Assessment of the effects of user characteristics on mental

models of information retrieval systems. Journal of the American Society for Information Science & Technology, 52(6): 445-459

Zhang, X. (2002). Collaborative relevance judgment: A group consensus method for evaluating user search performance. Journal of the American Society for Information Science and Technology, 53(3):220-235.

11

ML4UM for Bayesian Network User Model Acquisition

Frank Wittig

Department of Computer Science, Saarland University,P. O. Box 15 11 50, D-66041 Saarbrucken, Germany

[email protected]

Abstract. This paper addresses the primary workshop question on “how to applymachine learning techniques to acquire and continuously adapt user models” forthe particular representation of user models as Bayesian networks. On the basis ofan integrative framework for learning Bayesian networks for user modeling anduser-adaptive systems, respectively, we discuss some of the methods we devel-oped along these lines, in the light of the questions that are to be discussed duringthe workshop. As this paper is intended to give an overview of our research, itomits some technical details, that can be found in related publications.

1 General Issues in Machine Learning for User Modeling

We begin by quoting this workshop’s “Call for Papers”:

“Machine learning is concerned with the formation of models from observa-tions. Hence, learning algorithms seem to be promising candidates for usermodel acquisition systems.”

As this statement indicates in subtle manner, a successful application of machinelearning (ML) techniques in the user modeling (UM1) context (ML4UM) is indeed anon-trivial task that usually includes much more than a straightforward application ofan ML algorithm together with collected user data. The following list summarizes sometypical issues that are critical regarding a successful application of ML techniques forthe acquisition and adaptation processes of UMs (see [1] for a detailed discussion of asubset of this list):

– Limited training data– Individual differences between users– (Temporally) changing aspects of the domains– Complexity / efficiency of the learning algorithms– Interpretability of the learned UMs– Characteristics of typical UM data, i.e., implicit feedback, missing data– Exploitation of a priori available knowledge

1 We will use the abbreviation ‘UM’ for both terms ‘user modeling’ and ‘user model’. It willbecome clear from the context which one is meant.

12

One of the main workshop questions is that on “how the user fits into the picture”and “how he can query/use the UM”. A prerequisite to enable the user to query/use theUM in a reasonable manner is that the learned UM has to be understandable for him/her(or at least, that it can be transformed into an understandable representation). Suchinterpretable UMs enable or make it easier to (let the system) generate justificationsand/or explanations of decisions. It is well-known that the provision of such informationusually leads to a better acceptance of the system by its users.

Thus, the issue of ensuring the interpretability of results of the ML process is a keyissue regarding explanation components or queries to the UM by users. For a wide rangeof formalisms for representing UMs such as rule-based representations, decision trees,fuzzy methods etc., this can be achieved quite straightforwardly, but there also existmethods that do not yield interpretable UMs in their basic form in this sense, e.g., artifi-cial neural nets, k-nearest-neighbors methods. Bayesian networks (BNs) lay somewherebetween those extremes of the spectrum of (not) interpretable UM representations, aswe will discuss in more detail later.

In addition to proposing potential solutions to the primary question of how to con-tinuously adapt a learned BN UM, in this paper we focus on results we achieved inour research efforts to ensure the interpretability of the learned BNs. Although we donot explicitly address IR issues, many of the results are relevant for IR because BNsbecome also applied in (user-adaptive) IR systems (e.g., [2]).

2 Learning Bayesian Networks for User Modeling

Bayesian networks have become increasingly popular as one of the inference techniqueof choice for user-adaptive systems. Table 1 lists some recent research of UM with BNsin a wide range of application scenarios that applies to some extent ML techniques.Note, that these systems (except our READY-system) use off-the-shelf learning methodsthat were not developed with the particular UM context in mind. The main goal of ourresearch is to adapt existing BN learning algorithms for an application in user-adaptivesystems and/or to develop new ones that are especially well suited for this particularcontext with those critical issue listed in the previous section.

System Domain Batch-Learning Adaptation

Albrecht et al. [3] MUD Games CPTs –Billsus and Pazzani [2] personalized news CPTs CPTsLau and Horvitz [4] WWW search CPTs –Horvitz et al. [5] Office-alarms CPTs & structure –Nicholson et al. [6] ITS CPTs & structure –Bohnenberger et al. [7]: READY Dialog CPTs & structure CPTs & structure

Table 1. User-adaptive systems that use ML for BNs

A BN consists of two components: The first component is a directed acyclic graph(DAG)—the structure of the BN—that represents the causal independencies that hold

13

in the domain to be modeled. Nodes represent random variables and directed links be-tween nodes are commonly interpreted as causal influences between these variables.The second component of a BN is a vector of conditional probability tables (CPTs)that quantify the (uncertain) relationships between nodes and their parents. A node’sCPT consists of conditional probabilities for each state of the node conditioned on itsparents’ state configuration. A BN represents a joint probability distribution over thestates of its variables. On the basis of this representation, inference algorithms can beapplied to derive conclusions about arbitrary sets of variables conditioned on availableevidences related to other variables observed in the domain.

There are several properties of BNs that make them well suited for an application inthe user modeling context: First, it is common practice to interpret the networks’ linksin a causal manner, a fact that contributes to both a potentially simplified constructionprocess and a more interpretable user model from the user’s point of view. Second, BNsare able to handle uncertainty in the domain under consideration with regard to arbitrarysubset of variables, e.g., a user’s goals, interests, etc. There exist several formalismsthat are strongly related to BNs and have been successfully applied in a variety of UMscenarios, e.g., influence diagrams, object-oriented BNs and dynamic BNs.

User-Adaptive System

Online AdaptationOffline Learning ofInterpretable BNs

DomainKnowledge

Parametersfor Adaptation

InterpretableBNs

ExperimentalData

UsageData

Preprocessing

CPTAdaptation

StructureAdaptation

CPTLearning

StructureSelection

Preprocessing

Fig. 1. Learning Bayesian networks for user modeling. (Cylinders represent repositories of dataand BNs, respectively, ellipses stand for algorithmic procedures and boxes symbolize larger con-ceptual entities. The flow of information between these components is depicted by directed arcs.)

As a basis for our research on learning BNs for UM we used the integrated ap-proach as shown in Figure 1. This generic framework is flexible with regard to several“dimensions” that will be discussed below. Typically, a particular user-adaptive systemdoes not make use of all aspects of this integrated approach, e.g., many systems omitthe structural learning/adaptation part.

Offline learning and online adaptation (see the grey boxes in Figure 1) During the of-fline phase, general UMs are learned on the basis of data from other previous system

14

users or data acquired by user studies. These models are in turn used as a starting pointfor the interaction with a particular new user. The initial general model is then adaptedto the individual current user and can be saved after the interaction for future use whenthis particular user will interact the next time with the system. In such a situation, thisindividual model can be retrieved from the model base and thus, there is no further needto start with the general model, yielding probably a better adaptation right from the be-ginning. Note, that the offline learning procedure may also yield parametric informationon how to adapt to individual users. The general idea behind this approach is that dif-ferent parts of the learned UM need different ways of adaptation, i.e., some parts needfaster adaptation to an individual user than others. Details on this and a comparison ofalternative methods of adaptation to individual users will be discussed in a followingsection.

Experimental data and usage data Two further dimensions concern the kind or typeof data that is available. In principle, we distinguish between (a) experimental data and(b) usage data (see upper part of Figure 1). Experimental data is collected in controlledenvironments just as done in psychological experiments. Usage data is collected duringthe real interaction between users and the system. Obviously, these two types differcharacteristically: Usage data often includes more missing data and rare situations areunderrepresented in such data sets, while experimental data mostly does not representthe “reality”. Often, a combination of both types occur. Because of our offline/onlineapproach we can handle this problem for example by learning a general model on thebasis of experimental data and then adapting it using usage data of the individual user.

Learning the BNs’ conditional probabilities and structures Since BNs consist of twocomponents, the learning and adaptation tasks are also two-dimensional: (a) learningthe conditional probabilities (CPTs) and (b) learning the BNs’ structures. For both par-tial tasks exist a number of standard algorithms (see [8] for an overview). In the UMcontext, we often have to deal with sparse data but on the other side in most cases wehave additional domain knowledge available. This is reflected in our approach by intro-ducing such knowledge into the learning procedures to improve the results, especiallywhen the data is indeed sparse (see upper left part of Figure 1). For the learning of theconditional probabilities we developed a new method that we will briefly describe inthe following section. When learning structures, background knowledge can be incor-porated by specifying “starting” structures manually that reflect the basic assumptionsthat one makes in the domain. This bipartite character of the learning task is reflectedin our new techniques we developed for the adaptation of initially learned BN UMs.

Degree of interpretability As already discussed, an important point made by many re-searchers in the UM community is the interpretability and transparency of the models.The issue of interpretability is therefore also an integral part of all aspects of our ap-proach. We try to ensure or at least improve the interpretability of the final learnedmodels, e.g. by respecting whatever background information that is a priori availableand that can be introduced into and exploited within the learning process.

15

3 Improving the Degree of Interpretability

An important issue related to the interpretability of BNs are so-called hidden variablesfor the learning task. Variables are called hidden, if there are no values available forthem in the training data—e.g. due to missing sensors or because their values cannot beobserved directly at all. Consider for example a user-adaptive system (like our READY

prototype) that aims to adapt to the level of cognitive load its user is currently sufferingfrom, e.g., by interacting with him or her in a less error-prone way (for example byproviding very short and clear instructions) in order to make it less likely that the userwill make a mistake using the system. Such an entity like “cognitive load”—or moregeneral “stress”—cannot be measured directly. The only chance is to draw inferencesbased on observable values for symptoms like heart-rate, pupil-dilatation and so on. Forsuch hidden variables, sophisticated learning methods have to be applied. In the case ofBNs, the well-known EM procedure (see, e.g., [8]) as well as the gradient-based APNmethod [9] are commonly used to learn the CPTs of hidden variables.

+

+

+

+ −

− +

SECONDARY TASK?

(2)

TIME PRESSURE?

(2)

ACTUAL WM LOAD

(3)

RELATIVE SPEED OF SPEECH

GENERATION (3)

QUALITY SYMPTOM?

(2)

NUMBER OF SYLLABLES

(4)

SILENT PAUSES

(4)

ARTICULATION RATE

(4)

INDIV PAR. FOR

DISFLUENCIES (4)

INDIV PAR. FOR NUMBER

OF SYLLABLES (4)

INDIV PAR. FOR SILENT

PAUSES (4)

INDIV PAR. FOR ARTICU- LATION RATE

(4)

Fig. 2. Example BN UM with hidden vari-ables. (The numbers in parenthesis representthe number of possible discrete values)

Figure 2 shows an example BNUM that includes two hidden variables“Actual Cognitive Load” and “Rela-tive Speed of Speech Generation”. Weused this particular BN to model theconsequences of time pressure in ad-dition to the need to perform an ad-ditional task on the speech of experi-mental subjects of an empirical studythat we performed (see [10] for a de-tailed description). The underlying as-sumption that we used to construct theBN’s structure is that a person is ableto reduce its actual cognitive load byslowing down its speed of speech gen-eration. A wide range of speech symp-toms has been observed that can beused to make inferences about the values of the two hidden variables. As this shortdescription indicates, such a structure is interpretable in that it represents the under-lying (theoretical) assumption. Beside an interpretation of the structure itself by theexperienced user, an explanation component for BNs (see, e.g., [11]) can generate ver-bal statements to justify its decisions, like “An increase in time pressure leads with highprobability to an increase in the actual cognitive load which in turn leads to a smallernumber of syllables produced by the experimental subjects.”. A structure without thesehidden variables may yield something less elaborate like “A high time pressure yields asmall number of syllables.”, that may leave the user asking why this should be this way.

The problem of learning BNs with hidden variables is that the learning procedurestypically yield resulting CPTs related to these variables that do not reflect the intendedrelationships. For example, an application of the EM algorithm with the structure of Fig-ure 2 learned a CPT for “Actual Cognitive Load” that reflected that an increase of timepressure would result in a decrease of actual cognitive load—a relationship between

16

Network for Experiment 2

9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9

10.0 10.1 10.2 10.3 10.4 10.5

Ave

rage

neg

ativ

e lo

g-lik

elih

ood

per

case

0 5

10 15

Vio

latio

n

F G I A C H J D B E Learned Network

9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9

10.0 10.1 10.2 10.3 10.4 10.5

0 5

10 15

Learned Network G

0 10 20 30 40 50 60 70 80 90 100Iteration

Fig. 3. Improving the BNs’ Interpretability. (Explanations in text)

these two variables that is not conform to common knowledge. Since a similar observa-tion is made with regard to the children of the hidden variable, the overall behavior ofthe BN (e.g., influences of the time pressure variable on the speech symptoms) remainsas expected. Such effects are largely due to the high dimensionality of the search space.Typically, the learning algorithm becomes stuck in one of the large number of localoptima that numerically model the underlying joint probability distribution reasonablywell, but fail to represent the intended semantics regarding these hidden variables.

In order to address this problem, we developed a method to introduce availablequalitative background knowledge into the learning process that we called “learningwith qualitative constraints”. A detailed description of the method is presented by [12].The basic idea is the extension of the standard scoring function for learning the CPTsθ of a BN—the (log-)likelihood of the data (log)P (D|θ)—with a “penalty term”violation(θ,C) that measures the amount of violated qualitative constraints and thusthe degree of the BN’s interpretability. These qualitative constraints C are specified be-fore learning takes place and represent known qualitative (monotonic) relations betweentwo variables, i.e., statements such as “An increase of time pressure leads to a higheractual cognitive load” (cf. the ’+’ and ’-’ signs Figure 2). In this way, (intermediate) so-lutions that “violate” the qualitative knowledge—and are therefore less interpretable—are scored worse than the more interpretable ones. Thus, the learning with qualitativeconstraints contributes to finding interpretable local optima. Such learning results aresignificantly better suited for an application within explanation components besides theside-effect that it becomes easier to locate potential shortcomings in the UM.

Figure 3 shows the results of a 32-fold cross-validation of this method with em-pirical data of 32 subjects of our experimental study. Note, that the results presentedin [12] were based on synthetic data. The new analysis presented in this paper showsthat the method is able to cope with real-world data and the results from earlier ar-tificial analyses also hold for this more challenging case. On the left-hand side, wepresent the ten different results for ten different randomly initialized starting BNs forthe learning task. The upper part shows the average negative log-likelihood values ofthe final learned BNs. We observe an improvement in the quality of the learning results

17

by the introduction of qualitative constraints (black bars vs. gray ones).2 That shows,that the introduction of available prior knowledge in order to avoid some of the “unin-terpretable” local optima helps to learn better models. More interesting regarding thefocus of the workshop are the results in the lower part that show the values of the penaltyterm that measures the degree of interpretability. We see, that our method indeed yields(more) interpretable BNs with penalty values near 0. The right-hand part of Figure 3contains the time course of learning of one particular of the ten learning tasks. Thetwo upper curves represent the results when scoring against the test data (within thecross-validation methodology) whereas the lower ones are those when scoring againstthe data used for learning. We see that our method limits overfitting, i.e., it ensures thatwhile learning continues, the quality of the intermediate results does not degrade. Thelower graph shows that the penalties are eliminated early in the learning procedure.

Overall, our method yields two contributions: (a) a reduction of overfitting and (b)an improvement of the degree of interpretability of the learned BNs—which has beenthe main motivation for the development of the learning with qualitative constraints.

4 Alternative CPT Adaptation Methods

In this section, we discuss the workshop’s primary question of how to learn and con-tinuously adapt UMs for the particular representation as a BN. We present and discussa comparative study of alternative methods. A more elaborate discussion of the algo-rithms and their evaluation is presented by [13]. Here, we focus on those aspects thatare relevant to the present workshop.

When learning UMs, two common alternatives exist: (a) learning general UMs onthe basis of observations acquired from a sample of users and (b) learning individualUMs on the basis of data from a particular user. Each of this approaches has its owntypical benefits. A natural strategy is to combine the advantages of general and individ-ual UMs: Learn a general UM which can be applied to a new user; adapt the model toeach user during the interaction. We discuss two such adaptive approaches: (i) the pa-rameterized UM that makes use of individual parameter variables to model individualdifferences between users within a general UM (see the bottom line variables of Figure2) and (ii) the differential adaptive BN UM, that we introduced in [13]. Essentially, theparameterized UM is a dynamic BN in which the individual parameters represent vari-ables whose “real” value do not change over time, e.g., in the example BN of Figure2 the user’s average articulation rate. The differential adaptive UM essentially learns ageneral UM together with adaptation rates for the different parts of the model based onthe variances of the characteristics of the different users. These local adaptation ratesare used within the standard Bayesian adaptation procedure for conditional probabili-ties [8] to revise the different parts of the UM with different speeds in the light of newinteraction data from a new user. The underlying idea is that those parts where largedifferences between individual users exist become adapted faster to the individual userthan those on which most users agree.

Figure 4 shows prototypical results of a comparison of the four alternatives usingour experimental data for the prediction of a particular variable of the learned BN UM

2 Note, that a lower value represents a better result.

18

when individual differences between users are indeed present. The evaluation consistedof a 32-fold cross-validation using a BN structure similar to Figure 2 but without hiddenvariables, i.e., the independent variables were connected directly to the speech symptomvariables. After the value has been predicted, these observations are used as evidencesto adapt the UM according to the proposed method. The graph presents the averagedresults.

The results for the general model (i.e. the offline learning of a BN UM withoutany online adaptation to the individual user) are shown by the solid thick curve. First,note that the only reason why this curve is not a straight horizontal line is that there isconsiderable random fluctuation in the quadratic loss variable. Looking at the generaland the individual UM (starting from scratch, using only the adaptation mechanism toincrementally acquire the UM), we see that there are important individual differencesbetween the users: The individual UM catches up to the general UM and significantlyoutperforms it after it has processed enough interaction data. Moreover, the differentialadaptive model outperforms the parameterized model as well as the others significantly.

8 16 24 32 40 48 56 64 72 80

(a) Number of Observations

0.50

0.55

0.60

0.65

Ave

rage

Qua

drat

ic L

oss

Individual General Parametrized Adaptive

Fig. 4. Comparison of adaptation methods

Our experience with a num-ber of similar analyses with differ-ent datasets in different domainscan be summarized as follows: Al-though the parameterized and dif-ferential adaptive models performbest overall, the general and in-dividual models each show com-petitive performance under certainconditions. Consequently, one ofthese models may be turn out to bebest if these conditions are met andthe practical considerations speakfor the model in question, e.g., noneed for an online adaptation pro-cedure or an offline learning phase.

5 Structural Learning and Adaptation

The preceeding sections have dealt with the learning and/or adaptation of the BNs’CPTs. In the following, we will discuss the structural learning part.

Structural learning algorithms are seldomly applied in the UM context. There areseveral reasons for this observation: (a) the structure of BNs can often be specified ade-quately by experts on the basis of a causal interpretation of the links, (b) the complexityof structure learning algorithms make an application in many scenarios difficult, and—amore practical reason—(c) there exist few BN software packages that currently includesuch methods (although this situation seems to be changing now). Thus, the questionarises, whether it is worth to apply structural learning of BNs in UM.

Since the structure encodes the causal relationships between the variables, it repre-sents an important contribution to the BN’s overall interpretability. To exploit available

19

causal background knowledge for the learning task, it is possible to specify (a) a start-ing structure for the search procedure that is conform with the prior knowledge and isexpected to be modified only regarding minor parts, and (b) so-called structural con-straints that limit the search space. Such structural constraints may be the presence orabsence of a particular link, or the fact that a variable needs to remain parentless duringlearning. The latter applies for example to independent variables (of a user study).

100

54

60

3

40

2

20

1

0

0

6.85

6.8

6.75

6.7

6.9

6.95

80

7

Log−

Like

lihoo

dA

vera

ge n

egat

ive

Number of iterations

structrual learning

CPT learning only

Fig. 5. Structural vs. CPT learning

We performed an initial study us-ing our empirical dataset to comparelearning with against without the struc-tural part (see Figure 5). Again, a 32-fold cross-validation was conducted us-ing the SEM method of [14] for struc-tural learning. The EM algorithm hasbeen used to learn the CPTs for the fixedstructure of Figure 2 that has also beenchosen to be the starting point for thestructure search with SEM. The resultsshow that in our scenario it is indeedworth to apply structural learning to in-crease the learned BNs’ quality.

20151050

0.75

0.65

0.55

0.7

0.5

0.6

Ave

rage

P(e

xper

imen

tal c

ondi

tion)

structural learningCPT learning only

Number of observations

Fig. 6. Recognition results

By a second analysis, we evaluatedwhether this better quality with regard torepresenting the empirical data does alsohold when the structurally learned BNswere used in one of our application sce-narios: The recognition of time pressureand the presence/absence of a secondarytask on the basis of observed speechsymptoms. Therefore, the learned BNswere used as time slices of dynamic BNsto infer the values of these two variableson the basis of several utterances. Wefollowed the same evaluation procedureas described by [10] where we presented

the analogous results for the CPT learning case with a fixed BN structure. Figure 6shows that the recognition accuracy also slightly improves with the modeling quality(represented by the likelihood measure) of the structurally learned time slice.

6 Conclusion With Regard to Workshop Questions

We addressed the primary workshop question for a particular UM representation bypresenting an overview of an integrative conceptualization for learning BNs for user-adaptive systems. Some parts of this framework have been discussed in more detail byreporting results that we achieved in our research efforts along these lines. The overallgoal of this research is the adaptation of existing methods and/or the development of

20

new ones that are specialized to the particular issues that play an important role whenapplying ML techniques in the UM context.

Within this framework, we emphasized the need for learning interpretable UMs,which is in our opinion strongly related to one of the questions listed in the workshop’sCFP as a prerequisite to be able to provide a functionality for inspecting/querying theUM.

References

1. Webb, G., Pazzani, M.J., Billsus, D.: Machine learning for user modeling. User Modelingand User-Adapted Interaction 11 (2001) 19–29

2. Billsus, D., Pazzani, M.J.: A hybrid user model for news story classification. In Kay, J.,ed.: UM99, User Modeling: Proceedings of the Seventh International Conference. Springer,Wien (1999) 99–108

3. Albrecht, D.W., Zukerman, I., Nicholson, A.E.: Bayesian Models for Keyhole Plan Recog-nition in an Adventure Game. User Modeling and User-Adapted Interaction 8 (1998) 5–47

4. Lau, T., Horvitz, E.: Patterns of search: Analyzing and modeling Web query refinement. InKay, J., ed.: User Modeling: Proceedings of the Seventh International Conference, UM99,Vienna, New York, Springer Wien New York (1999) 119–128

5. Horvitz, E., Jacobs, A., Hovel, D.: Attention-sensitive alerting. In Laskey, K.B., Prade, H.,eds.: Uncertainty in Artificial Intelligence: Proceedings of the Fifteenth Conference. MorganKaufmann, San Francisco (1999) 305–313

6. Nicholson, A., Boneh, T., Wilkin, T., Stacey, K., Sonenberg, L., Steinle, V.: A case studyin knowledge discovery and elicitation in an intelligent tutoring application. In Breese, J.,Koller, D., eds.: Uncertainty in Artificial Intelligence: Proceedings of the Seventeenth Con-ference, San Francisco, Morgan Kaufmann (2001) 386–394

7. Bohnenberger, T., Brandherm, B., Großmann-Hutter, B., Heckmann, D., Wittig, F.: Em-pirically grounded decision-theoretic adaptation to situation-dependent resource limitations.Kunstliche Intelligenz 16 (2002) 10–16

8. Heckerman, D.: A tutorial on learning with Bayesian networks. In Jordan, M.I., ed.: Learningin graphical models. MIT Press, Cambridge, MA (1998)

9. Binder, J., Koller, D., Russell, S., Kanazawa, K.: Adaptive probabilistic networks with hid-den variables. Machine Learning 29 (1997) 213–244

10. Muller, C., Großmann-Hutter, B., Jameson, A., Rummer, R., Wittig, F.: Recognizing timepressure and cognitive load on the basis of speech: An experimental study. In Vassileva,J., Gmytrasiewicz, P., Bauer, M., eds.: UM2001, User Modeling: Proceedings of the EighthInternational Conference, Berlin, Springer (2001)

11. Druzdzel, M.J.: Qualitative verbal explanations in Bayesian belief networks. Artificial Intel-ligence and Simulation of Behaviour Quarterly 94 (1996) 43–54

12. Wittig, F., Jameson, A.: Exploiting qualitative background knowledge in Bayesian networklearning algorithms. In Boutilier, C., Goldszmidt, M., eds.: Uncertainty in Artificial Intelli-gence: Proceedings of the Sixteenth Conference, San Francisco, Morgan Kaufmann (2000)644–652

13. Jameson, A., Wittig, F.: Leveraging data about users in general in the learning of individualuser models. In Nebel, B., ed.: Proceedings of the Seventeenth International Joint Conferenceon Artificial Intelligence, San Francisco, CA, Morgan Kaufmann (2001) 1185–1192

14. Friedman, N.: Learning belief networks in the presence of missing values and hidden vari-ables. In: Proceedings of the Fourteenth International Conference on Machine Learning.(1997)

21

Visualization of a user model in educational

document retrieval

Swantje Willms

Department of Information Science and Telecommunications,

University of Pittsburgh, Pittsburgh, PA, USA

[email protected]

Abstract. We propose to visualize the user model in relation to the

documents that are relevant to the user's task. Each component of the

user pro�le is represented by a point of interest in a reference-point based

visualization. We employ this visualization in a learning environment

where its purpose is to help the student in identifying relevant documents

for study.

1 Introduction

We propose to combine information retrieval (IR) and user modeling by visual-izing the user model in relation to the documents that are relevant to the user'stask. We address the workshop question: How can the user use the learned usermodel? The user is aided in using the model by giving her visual access to therelationship between her interests and knowledge (the components of the usermodel) and the retrieved information (documents).

In our research, we focus on the user model in a learning environment. Stu-dents have access to valuable resources outside of the lecture notes and requiredreadings. A reason that these are often not accessed is that it is cumbersomefor the student to assess their immediate relevance. When the user model isrepresented in a vector based representation, it can directly be visualized ina reference-point based visualization. In a spatial visualization like WebVIBE,each user pro�le component can be represented as a separate reference point(e.g. interest, knowledge) among other reference points (e.g. lecture, query).

The visualization can help the user to identify documents that they want toexplore, in particular its purpose is to help the student identify relevant docu-ments for study.

2 Related Work

In most information retrieval systems the query or information need is the onlyaspect of the user that is represented. No real user models are employed.

A need to employ user modeling techniques for visualization systems has beenrecognized [4]. A couple of experimental tools for visualizing user models havebeen proposed [5, 11]. They focus on visualizing the user model itself, whereas

22

my work focuses on the relationship between the user model and the documents.Spatial visualization has been researched for IR tasks, but reference points havemostly been query terms and similar items. Even though the user pro�le is statedas an obvious example of a reference point by Korfhage [6, p.163], there hasn'tbeen much research on this aspect.

We want to include the user model in the visualization: The goal is to helpthe students to identify documents that they want to study taking into accountthe lecture topic on one hand and what is known about the student, i.e., theuser pro�le, on the other hand.

In the context of the learning environment we are exploring, a map basedvisualization called KnowledgeSea [1] has been used. KnowledgeSea visualizesall the concepts associated with a class and the underlying whole collection ina learning environment. It displays a two-dimensional map of the educationaldocuments, with a set of keywords in each cell that describe the documents inthis cell (Figure 1). Documents that are semantically related are close to eachother on the map. If they are in the same cell, they are considered very similar.

3 Representation and Visualization

The vector model provides a consistent representation of all components as vec-tors, i.e., of the documents, the user model components, and any other points ofinterest.

3.1 Vector Representation

In the vector model of information retrieval [7], indexing terms are regarded asthe coordinates of a multidimensional information space. Each document in thevector model is described by a vector of term values.

A reference point or point of interest (POI) is any de�nable point in thedocument space; it is a point or concept against which a document can be judged.A reference point is de�ned by a set of weighted terms, represented by a vectorof term values.

The meaning of the value or weight di�ers slightly depending on the kind ofitem that is represented by the vector, i.e., document or POI (see section 4), buta higher weight always represents something like higher importance or relevance.

The user model in our research is represented as vectors of user pro�le com-ponents. Section 4.1 describes how the user pro�le components are acquired. Thepurpose of the visualization of user pro�le components as points of interest is tohelp the student to identify relevant documents for study. In the past, relevantdocuments for the student have been shown as a list, but a multi-dimensionaldisplay may be easier to understand. It is more exible because it can show howthe potential study documents relate to the POIs.

23

Fig. 1. KnowledgeSea map for \Introduction to Programming" class

24

3.2 Spatial Reference-Point Visualization

In a spatial visualization, each document is displayed as a point in the documentspace and placement of documents is based on attraction according to similarityto the reference points. Overall trends about POI-document distribution in a setof documents can emerge from a spatial display. Spatial visualizations explicitlyvisualize varying degrees of similarity.

The VIBE (Visual Information Browsing Environment) [8] spatial referencepoint visualization interface focuses on presenting the document as a single entitywithin a document collection. VIBE places the points of interest at the vertices ofits display. Individual documents are represented as icons [9]. VIBE is based onthe ratio of similarity measures for a document with respect to multiple referencepoints. The user places the icons that represent the reference points on the screenwhere a reference point can be any attribute to which a numeric associationstrength can be assigned. A weighted centroid model (spring model) is thenused automatically to place the document icons according to the strength of theirrelationships to the reference points. Therefore, the position doesn't indicate theabsolute signi�cance of a document, but instead the relative signi�cance of thereference points for the document. The reference points can be arbitrarily locatedand moved on the two dimensional space which helps disambiguate the locationsof documents. Showing the documents with respect to multiple reference pointsallows the users to perceive relationships of each document with respect to allof those reference points.

In WebVIBE [12], the Java version of VIBE [8], the POIs are representedby magnets as a metaphor for attraction (Figure 2); as in the original VIBE,they de�ne a coordinate system on the display to present a virtual documentspace. Each POI in WebVIBE is based on a vector of keywords representinga user pro�le component or other POI (see section 4). The structure of thepresentation is user-de�ned in WebVIBE as well, because the users can changethe display interactively by selecting and placing the POIs.

4 Components and Points of Interest

The POIs for the display will be generated behind the scenes based on the criteriadescribed below and used for a WebVIBE display. Each POI consists of a termvector. The following sections describe how the di�erent POIs are derived andrepresented.

4.1 User Model

The user model will consist of components including the student's interest andthe knowledge of the student, each treated as a separate POI in the visualization.

The students express their interest by highlighting the part of a documentthat is relevant to them or by rating the document for relevance. In a processsimilar to that used for WebMate [3], the highlighted parts (or the positively

25

Fig. 2. A sample WebVIBE display

rated document) are preprocessed by deleting stop words and stemming, andpossibly extracting titles to give them more weight. The frequency vector for thedocument is extracted. Then it is either added to a set of vectors that representthe student's interests or combined with the one most similar to it.

Student knowledge is represented by weighting keywords (from unknownto known) based on test scores. Students will take a quiz after a lecture topichas been covered in class (this evaluation could also be done on a voluntarybasis). Each multiple choice question in a quiz is associated with some terms orkeywords (extracted from a paragraph from lecture material) that correspond tothe concept(s) addressed by the question. If the student answered the questioncorrectly, it is assumed that she knows the concept represented by the key-words. Answering several questions with the same keyword correctly increasesthe weight for a keyword. This increased weight represents increased knowledge(up to a maximum).

Future work may include an explicit representation of student needs basedon test scores. In addition, group models of the class interest or the interests ofgroups of students could be incorporated as POIs.

4.2 Other POIs

Apart from the user model, another important POI, the lecture topic, is rep-resented by keywords extracted from the corresponding lecture notes document(a set of slides). A non-homogeneous lecture will be split into two or three topicsif necessary.

26

In future work, other POIs such as those representing a traditional querymay be included as an option for the display.

4.3 Similarity

The similarity between the POIs and the documents is computed using a simi-larity measure such as the cosine measure. The cosine measure has been widelyused for its simplicity and e�ectiveness [10]. The position of each documentwithin the WebVIBE display is based on the ratio of the similarity measuresfor a document with respect to the POIs. A weighted centroid model places thedocuments according to the strength of their relationship to the POIs.

5 Research Design Overview

Our system shows the relevant documents in a multidimensional similarity-baseddisplay that shows relationships to the user model. The visualization displaysthe relationship of the study documents to the student's knowledge, interest,and the lecture topic. We will explore two hypotheses. The �rst hypothesis isthat the students may be able to �nd relevant documents faster when they havea visualization of their interest and knowledge available and use it to identifyrelevant study materials. The study will determine whether there is an improvedsuccess rate in �nding relevant documents within the allotted time frame vs. acontrol condition using presentation in lists, i.e., without visualization and usermodel. A second hypothesis is that students are more likely to make use of exter-nal material when they can judge its relevance with the help of the visualization.We will conduct two kinds of experiments involving real users, the students en-rolled in a class, to explore these hypotheses: 1. Long term observation and 2.Controlled sessions. In addition, the students' satisfaction with the visualizationwill be assessed through a questionnaire.

5.1 Resources

Some of the needed material already exists for classes such as \Introduction toProgramming". The existing lecture notes in form of slides build the internaldocument collection. In addition, the students have access to a couple of onlinetutorials as external resources. These can be referenced by page, the pages forone tutorial making up one collection.

5.2 Visualization

Di�erent \sections" of a class will have di�erent conditions, e.g. one with Web-VIBE visualization, one without (control condition). WebVIBE will be employedin the context of the KnowledgeSea visualization [1] described earlier in this pa-per. KnowledgeSea provides an overview of the keywords associated with the ed-ucational documents. On the two-dimensional map of the educational resources

27

that builds the core of KnowledgeSea (Figure 1), each cell displays a set of key-words associated with the documents located in this cell. Usually, the user wouldbe linked to a list of documents from the cell. Instead the users will be linked toa WebVIBE display (Figure 2) so that they can see how these documents relateto each other with respect to the POIs.

The WebVIBE display the student is presented with will have at least thethree POIs: interest, knowledge, and lecture topic as described above. Thestudent will want to learn about concepts that are far away from her knowledgeand close to her interest or the topic. The centroid of the lecture can be seen as atarget because it represents what the student is supposed to know. Therefore, thestudent may want to explore those documents that relate to both their interestand the lecture topic, and also look at documents that integrate these with herprevious knowledge, but not at documents that are related to knowledge only.

5.3 Experiments

Over the course of the semester, we will keep track of the documents that thestudents access. Using these data, we can �nd out if students are more likely toaccess external material if they access the documents through the visualizationsystem.

In addition, we will schedule a session of speci�ed length in which the studentis asked to identify study material relevant to a speci�c topic. In this controlledsession, we can assess whether students �nd relevant documents faster with thevisualization. At the beginning of this focused session, the student will stateher goal in natural language; a list of keywords associated with this goal couldbe used as a starting point for the student's interest. In this manner, the stu-dent's perceived goal will be adjusted (like the model for the user's objective inMETIOREW [2]).

6 Conclusions and Future Work

In this paper, we present a way of visualizing the user model in the context ofeducational document retrieval. The visualization represents each user pro�lecomponent as a reference point. The information is delivered to the users sothey can assess the relationship between the documents and the user model.Currently considered POIs are an individual user's interest, knowledge and thelecture topic.

Additional conditions that one could incorporate in a future study includethe following: One could extend the user model by (1)(a) the class interest asa group model extracted from the interests of individual students or (b) severalseparate group models based on "categories" of students such as good, average,and poor students; (2) an explicit representation of student needs based on testscores weighted inversely to the representation for knowledge (i.e. low scoresget high weights) and on topics that have been covered in class, but which the

28

student has never accessed. This POI could prove especially useful for reviewsessions.

In addition, di�erent ways of acquiring and updating this user model and itsin uence on the visualization can be explored as long as it can be representedas term vectors. The visualization could provide additional or alternative POIssuch as the class/group interest and needs just mentioned or a POI based on atraditional keyword query.

One could add additional document collections such as a glossary.The WebVIBE visualization could be employed in a stand-alone fashion with-

out the context of the map based KnowledgeSea visualization.

Acknowledgments

I would like to thank Michael Lewis and Peter Brusilovsky at the Departmentof Information Science and Telecommunications at the University of Pittsburghfor their support with this project.

References

1. Brusilovsky, P., Rizzo, R.: Map-Based Horizontal Navigation in Educational Hyper-text. Journal of Digital Information 3(1) (2002)

2. Bueno, D., Conejo, R., David, A.A.: METIOREW: An objective oriented contentbased and collaborative recommending system. Third Workshop on Adaptive Hy-pertext and Hypermedia (AH2001) (2001) 310{314

3. Chen, L., Sycara, K.: WebMate: A personal agent for browsing and searching. 2ndInternational Conference on Autonomous Agents (Agents'98) (1998) 132{139

4. Grawemeyer, B.: User Adaptive Information Visualization. 5th Human CentredTechnology Postgraduate Workshop, University of Sussex, School of Cognitive andComputing Sciences (HCT-2001) (2001)

5. Kliger, J.: Model Planes and Totem Poles: Methods for Visualizing User Models.Master of Science, MIT, Media Lab (1995)

6. Korfhage, R.R.: Information Storage and Retrieval. New York: Wiley (1997)7. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing.

Communications of the ACM 18 (1975) 613{6208. Olsen, K.A., Williams, J.G., Sochats, K.M., Hirtle, S.C.: Ideation through visual-

ization: the VIBE system. Multimedia Review 3(3) (1992) 48{59.9. Olsen, K.A., Korfhage, R.R., Sochats, K.M., Spring, M.B., Williams, J.G.: Visu-

alization of a document collection: The VIBE system. Information Processing andManagement 29 (1993) 69{81

10. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGrawHill (1983)

11. Uther, James, Kay, Judy: VIUM: AWeb-Based Visualisation of Large User Models.Ninth International Conference on User Modeling (UM'03) (2003)

12. Homepage for WebVIBE. URL: http://www2.sis.pitt.edu/~webvibe/WebVibe/

29

A Reinforcement Learning Strategy for (formal) Concept and Keyword Weight Learning for Adaptive Information Retrieval

Rajapakse, R.K. and Denham, M.

Centre for Neural and Adaptive Systems, University of Plymouth, UK {rohan,mike}@soc.plym.ac.uk

Abstract. This paper reports our experimental investigation into the use of a reinforcement learning strategy to learn weights of (formal) concepts and keywords to support Information Retrieval. This work is a part of our main research objective of using more elegant construct of a concept rather than simple keywords as the basic element of representation and matching. The framework used for achieving this was based on the theory of Formal Concept Analysis (FCA) and Lattice theory. Features or concepts (formulated according to FCA) of each document (and query) are represented in a separate concept lattice and are weighted separately with respect to the document. The document retrieval process is viewed as a continuous conversation between queries and documents, during which documents are allowed to learn a consistent set of significant concepts to help their retrieval. The learning strategy used was based on relevance feedback information that makes the similarity of relevant documents stronger and nonrelevant documents weaker. Test results obtained on the Cranfield collection show a significant increase of average precisions as the system gains more experience.

1 Introduction

The human brain is unquestionably the best IR machine in terms of effectiveness. Unfortunately, the study of human brain functions is still in its infancy and how exactly the brain works is not clear yet. However, the superiority of human brain in IR tasks seems to come by three major properties of the brain.

1. its ability to read and understand the concepts, ideas or meanings central to the document 2. its ability to reason out the usefulness of documents to information needs (queries) based on

the understanding of the concepts (ideas or meanings) gained by reading the contents of documents and queries, and

3. its learning capability to make us adaptive to the environment allowing us to gain knowledge through learning or interaction with the environment.

Understanding concepts, ideas or meaning by an IRS is typically achieved via a document representation scheme. The basic element fundamental to the representation of textual material inside an IRS has been the “keyword”. As far as the human brain is concerned it is unrealistic to treat a “keyword” as the sole representative of a concept. Though detail of how exactly the brain formulates ideas (or concepts) and how they are structured in the brain for efficient storage and reasoning are not clear, its remarkable accuracy and robustness to deal with imprecision and vagueness of the IR problem suggests that they would possibly be much more complex than simple keywords. Therefore in this work, we assumed that the formalism and the structure of an idea or a concept in the brain is more complex that it requires a more elaborate entity to represent it within an IRS and also that the concepts in the brain are kept in a more complex structure that retains the underlying interconnections between concepts. Furthermore this interconnected structure is assumed to be the key factor that allows the brain to disambiguate meaning of an individual concept with respect to the collective meaning of the context in which the concept being used and that helps its reasoning mechanism.

The framework provided by FCA [3,10] is the closest formalism we found that gets along with this line of thinking. FCA formulates concepts in terms of objects and their properties or attributes and provides a way of combining and organising individual concepts (of a given context) into hierarchically ordered conceptual structure (i.e. a concept lattice structure). A concept lattice represents and conveys a broader picture of the knowledge that the combination of the individual concepts of the context possesses. The adaptivity or the learning aspect of the human brain, with respect to IR, works in the following way. A document that is found to be not useful for a query in the past is unlikely to be tried again for

30

the same or similar information need(s) in the near future (by the same individual). This is because the brain remembers at least the main concepts of a recently seen document, if not all of it, and that it knows that the document does not help the information need at hand. What this essentially means is that the experience gained at the early search sessions help later sessions. However, the brain does not retain all its memories forever. The memory fades away (forgets) over time. This forgetting feature, though looks undesirable, is an extremely useful property for the adaptation and also to prevent information explosion in the brain. We used a reinforcement learning strategy to mimic these properties in our IR model. Based on the relevance feedback information given by the user for the retrieved documents, significance of (formal) concepts and keywords that contributed for the retrieval of documents that the user finds useful are made stronger (increased) and the significances of the concepts and keywords of the rest of the documents (the ones that the user finds not useful or not interested in) are made weaker (decreased, as they have contributed towards a false hit). In the following we first describe briefly how concept matching between a query and a document lattice is performed. The reader is referred to [17,18] for detail of document representation in Concept lattices, implementation of concept lattices in BAM structures and further detail of concept matching. Our reinforcement learning strategy is presented next followed by the experimental results of the system obtained on Cranfield collection and the conclusions.

2 Use of Concept Lattices in IR in the Past Compared to our Approach

Use of concept lattices in IR in the past has been mainly on developing browsing mechanisms for domain specific IR [1,4-9,12,14,15]. Typically, a single large concept lattice is created based on the keywords present in documents. Objects in concepts are the document identities (Identification numbers and attributes are the keywords in them. This formalism is not much different to the keyword based document categorization approaches, except for the organisation of groups of documents hierarchically in Concept Lattices. The user is provided with a starting node, and allowed to navigate the lattice by expanding the nodes and traversing between nodes. The stating node can either be the root node or any other node decided based on an initial keyword based search. We have identified a number of disadvantages in the past approaches as found in the papers [1,4-9,11-16]. (1) the formulation of concepts as objects being document identities and attributes being the keywords present in those documents is rather unrealistic in terms of the way the human brain formulate, perceive and communicate concepts. (2) use of a single large concept lattice to represent the entire document collection is computationally very expensive, and as a result such systems are limited to smaller document collections. (3) most of the past models are limited for browsing only (4) Creation of the lattice needs complex lattice building algorithms. Also traversing in the lattice is expensive and increases proportional to the size of the lattice. (5) once created, the lattice is fixed and no learning facilities is provided in any of the past FCA based approaches. Our approach is different to those approaches at least in the following four ways:

1. concepts are formulated according to how the human brain might do it, i.e. by extracting

subjects, topics or objects mentioned in the text as objects of the formal concepts and properties or attributes of them as attributes of the formal concepts. A set of ad-hoc rules (see in [17,18]) was developed to extract such features from natural language text based on syntactic structures of the sentences.

2. each document (query) is represented by a single (separate) lattice. Advantage of this is two fold; firstly it allows us to operate on smaller lattices rather than on a single large lattice, and secondly, it allows maintaining different weights for the same concept in different document representations.

3. concept lattices are encoded in BAM structures [13]. Learning a BAM with a concept lattice[2,17,18] is much more efficient than using complex lattice building algorithms. Also updating lattice representations of documents with additional concepts is much easier with BAMs [18] and no node traversing overhead involved.

4. finally, we have employed a reinforcement learning strategy based on relevance feedback information to interactively learn document representations by (1) learning significances of concepts (formal concepts and keywords) and (2) updating the lattices with additional (query) concepts that would help retrieving documents.

31

3 Concept Matching Between a Query and a Document

A node in a Concept Lattice represents a formal concept of the type described above. The process of reasoning the usefulness of a given document to an information need (query) is achieved based on the concepts common to the query and the document as similar to the way common features/terms are used in conventional IR, i.e. based on how similar the nodes in query concept lattice to the nodes in the document concept lattice (i.e. node matching). Following example illustrates node matching between a query and a document. Consider the context of planets given in Table 1. This context can be regarded as a representation of a document that talks about the solar system. The concept lattice of this context is given in Fig. 1 (Right). Consider an information need containing the object “Mars” and a “moon” as a property of it. Assume that the concept lattice of this query contains a node representing the formal concept {Ma} {my} (see Fig. 1. (left). Note that we disregard the rest of the nodes in the query lattice in this illustration. Document nodes that match with this query node are also shown in Fig. 1.

Size Distance from

Sun Moon Planet

small medium large near far yes no

Mercury x x x Venus x x x Earth x x x Mars x x x Jupiter x x x Saturn x x x Uranus x x x Pluto x x x Neptune x x x

Table 1. A Context of Planets

As can be seen in the above figure (Fig.1), comparing query nodes with document nodes is not as simple and straightforward as comparing just simple terms or keywords. Here, we need to maintain the consistency of our treatment to certain terms as objects and certain others as attributes. In addition, we wish to take into account the superiority/generality of concepts within the concept hierarchy in order to match more specific concepts and also to avoid duplications. However, problems of natural language such as synonymy, polesemy, and other problems related to the variability of vocabulary which cause mismatches between concepts and the size variability (number of objects and attributes) between query and document concepts etc. causes 100% match between a query and a document node impossible. Instead a mechanism to perform partial matching is required.

{,Ma} {my}

{Ma} {}

{} {ss,sm}

{E,Ma,J,S,U,N,P} {my}

{E,Ma,P} {ss, my}

{E,Ma} {ss,dn ,my}

{Me,V,E,Ma} {ss,dn}

{Me,V,E,Ma,J,S,U,N,P} {}

{} {ss,sm,sl,dn,df,my,mn}

{Me,V,E,Ma, P} {ss}

{J,S,U,N,P} {df, my}

{Me,V} {ss,dn ,mn} {P} {ss, df, my} {J,S} {sl, df, my}{U,N} {sm, df, my}

Fig. 1. Node Matching between two Concept Lattices

32

{E , Ma} {ss , dn , my} Fig. 2. Concept Weighting

WE,ss WE,dn

WE,my

WMa,my

WMa,dn WMa,ss

3.1 Partial Matching

We define a partial match between two concepts as a concept m consisting of objects common to the query and the document extents in its extent and attributes common to the query and the document intents in its intent. This can be formally defined as given below: Let q=<A,B> and d=<C,D> be two formal concepts, then the partial match between the two concepts is given by the concept m = <A∩C,B∩D> , where A,B,C,D are sets of terms of which terms in A and C are interpreted as objects and terms in B and D are interpreted as attributes according to the FCA formalism.

3.2 Concept Weighting

Partial matching of concepts needs a mechanism for determining the significance of a partial match. Assigning significances weights to full concepts doesn’t help it. Therefore weights were assigned to single object-attribute pairs (unit-concepts) as shown in Fig. 2.

The object-attribute pairs (unit concepts) in m determine how similar the two concepts q and d are. If the query concept is identical to the document concept (ideal case) a complete node match occurs, i.e. m=q=d=<A,B>=<C,D>.

4 Keyword Matching

It is not uncommon for two concepts to share a common object or attribute but not both (hence no unit-concept match occurs). This could happen due to various reasons such as the problems caused by term/element mismatches or the two nodes actually represent two distinct concepts though the same term/element has happened to be used by chance. Whatever the reason, we do not want to ignore the possible contributions that such common features (keywords) between two nodes may make in retrieving a relevant document. Such a single object or attribute matches are considered as “keyword” matches and significance weights are maintained for individual keywords as well. The following diagram illustrates the difference between concept and keyword matches between concepts (nodes).

Query Concept(s) Document Concept(s)

{ Ma my } { E , Ma } { ss , dn , my }

{ Ma my } { Me , V , E , Ma , P } { ss }

Note that the keyword match of “Ma” will eventually be pruned out (in this case) as unit-concept matches with “Ma” in them do take place in this example.

Unit-concept match

keyword match

33

5 Learning Strategy

Our learning strategy works by accepting the user feedback in the form of yes and no (i.e. accepts a document as relevant or reject it as irrelevant) and accordingly improving the document representation.

Traditionally, relevance feedback has been used to reformulate the query with additional information to support IR. Even though it has shown as much as 20% improvement on recall and precision, one of the drawbacks of this approach is that it does not support learning. Important user decisions (user feedback) obtained are used only within one query session for searching one information need. The results gained by relevance feedback at one query session are usually not available for the subsequent query sessions, because the IR system does not retain them or their implications. A separate learning mechanism is required to make such systems adaptive.

Instead, in our model user feedback is used to update the document representations and the modifications made to the documents are retained. We expect the document representations to converge to a well representative set of concepts (for each document) over a period of time. Such a set of concepts, indeed, will become more personalized to the vocabulary and the writing style of the end user, as (i) it is the concepts of the user formulated queries that are amended into relevant document representations and (ii) it is the user’s relevance assessments that are used for reinforcing significance weights of init-concepts and keywords in documents.

Our reinforcement learning process works as follows; if user says a particular (retrieved) document is relevant to a given query, all unit-concepts of the query that are not present in the document are added to the document representation with an initial weight value. In case that a particular unit concept of the query is already present in the document, we consider it as an important unit concept (because it has made some contribution for the document’s retrieval in the first place) and therefore its weight is increased by a small amount (∇w) as described in section 5.1 below. The concept addition may result in unnecessary unit concepts getting into the document’s representation. But we expect such unnecessary concepts to be penalized by our learning strategy and end up with low weights in the long run.

Conversely, if a user says a particular (retrieved) document is not relevant to the query then the weights of matching units (unit-concepts and keywords) that are common to the query-document pair (i.e. those that contributed for the document’s retrieval) are decreased to say that those units, though present in both query and the document, are not very important to decide the relevancy of the document to the query.

5.1 Significance Weights and Step Size of Weight Changes

Both keyword and unit-concept weights are initialized at the beginning with the initial value of 2.5 and are allowed to learn over the user interactions. The range of a weight was selected arbitrarily and constrained to positive values between [0.1-5.0]. The minimum value of the weight range was set to 0.1 instead of zero (0) to avoid complete ignorance of the existence of a feature.

The step size is determined proportional to the current value of the weight. The idea is to make the changes based on how far it is from the boundary towards the direction of change (i.e. to the to boundary if rewarding or to the bottom boundary if penalizing. See Fig.3 below.). This makes learning faster if the difference between the current weight and the boundary towards which the modification is made is larger and slower otherwise. The weight modification formula is:

Wnew = Wold + η∆W

Where, ∆W = Wmax -Wold if a positive reinforcement or ∆W = Wold - Wmin if a negative reinforcement and η is the learning rate.

If decreasing Wnew

Wmax=5.0

Wmin=0.1

Wold

If increasing Wnew

∆W = Wmax - Wold

∆W = Wold - Wmin

Fig. 3. Weight Modification Steps

34

5.2 Learning Rates for Rewarding and Penalizing

The learning rate is a constant that tells the proportion of the weight difference to take into account for the actual step size of the modification. The nature of IR is such that, only a few of the many documents retrieved are judged as useful by the user. Therefore only those few documents that the user decides useful for his information need are rewarded. All the other documents in the retrieved set are regarded as false hits and therefore are penalized. As a result, on average, the weights of concepts/keywords tend to be negatively reinforced more times than they are positively reinforced. This imbalance of negative and positive reinforcements may lead all weights to end-up with the minimum weight value (0.1) allowed, if not dealt appropriately. A way to get round this problem is to use different learning rates for positive and negative reinforcements. Deciding precise values for positive (η) and negative (β) learning rates is difficult as it depends on a number of factors including the number of queries, composition of the queries and user judgments etc. Based on the results of a few preliminary experimentations on Cranfield collection, they were set to take η=0.04 and β=η/3=0.0133.

Fig. 5 illustrates the learning strategy described above using an example. In that the query consists of only two unit-concepts and the documents retrieved for this query contains two relevant documents (Doc35 and Doc50) and two not relevant documents (Doc20 and Doc100). Matching units, weight updating and concept addition (according to our learning mechanism) are shown in the diagram for the documents Doc35 and Doc20.

5.4 Informative Factors of Comparison Units

The weight reinforcement strategy described above treats all units (unit-concepts and keywords) as equals, i.e. the weight of each unit-concept/keyword is reinforced by the amount decided based on the learning rate and the current value of the weight as described above regardless of their informative-ness. Since, not every concept/keyword is equally informative, we made use of 4 levels of informativeness to take into account the informativeness of a comparison unit based on the number of terms that they possess. The weights of concepts/keywords are re-weighted using 4 pre-decided weighting factors (we call them “Informative Factors”) at the time of computing similarity measure (RSV). Note that they are not used (directly) in weight modifications. Given below are the four different levels of informative-ness we considered. They are listed in the increasing order of their informative-ness and the values experimentally chosen (not optimized) for informative factors are given within brackets.

a. Single-term keywords (1.0) b. Key Phrases (Keywords with more than one term) (1.6) c. Unit-concepts with single- term components (both object and attribute) (2.0) d. Unit-concepts with multi-word components (at least one component constitutes more than one

term) (3.0)

a b wab … … p q wpq … … u v defwuv

Matching units withDoc 35

p q u

Query concepts

p q u v

Retrieved documents Doc 35 Doc 20 Doc 100 Doc 50 …

Matching units withDoc20

u v p

m n wmn … u v wuv … …

a wa … …

u wu

m wm …

p wp …

oldwu+∇w oldwp-∇w

oldwuv-∇w oldwpq+∇w Doc35 Doc20

Fig. 5. Reinforcement Learning Strategy

35

6 Retrieval Process and Similarity (RSV) Computation

Retrieval process begins when a user issues a query (a natural language expression). This query expression is pre-processed for concept extraction and a concept lattice of the extracted concepts is then set-up. Then the concept lattice of each document in the collection (one at a time) is also set-up and the nodes of query concept lattice are compared with the nodes of the document concept lattice for partial matching.

6.1 Candidate Node/Concept Pairs for Comparison

Not all query concepts match with all document concepts and therefore attempting to perform such a matching is not worthwhile. Instead, we extract “candidate” concept pairs to match between the query and the document based on the presence of common unit-concepts and keywords between them. The candidate concept extraction process works mainly by looking for the most specific concept in the document lattice for each query object (i.e. using object concepts). Attribute concepts (i.e. the most generic concept containing a given attribute) are also used in the cases where a related object concept is not available in the document. During this process, we make sure to extract the most specific concepts wherever possible and also not to extract the same concept pair more than once. Also we avoid extracting document (query) concepts that are general (in the general-specific hierarchy in the concept lattice) to any of the already extracted document (query) concepts to match with the same query (document) concept. In addition, in case if an object or attribute in the query appears as both object and an attribute in a document representation, we check whether there is any order relation (in the concept hierarchy) between them in order to avoid matching two related document concepts with the same query concept. Only the most specific concept is considered for matching in such cases.

However there are some cases where we find the same query object appearing both as an object and attribute in document representations, but they represent two different ideas/concepts (i.e. they are not related in the concept hierarchy). In this case, the attribute concept given by document lattice for the quey object is also taken in to account as a candidate concept to matched with the object concept obtained from the query lattice for the query object, in addition to the object concept obtained from the document lattice.

6.2 Similarity Measure (RSV)

Candidate concept/node pairs decided (as described above) to match between a query and a document are detected are extracted from the corresponding lattices and compared for partial concept matching (i.e for unit-concept matching) and for keyword matching (in the absence of a unit-concept match with a given common keyword). Matching unit-concept pairs and keywords are then subject to pruning for removing duplicates. The sum of the significant weights of remaining unit-concepts and keywords (after pruning) is taken as the similarity measure (RSV) between the query and the document.

7 The Evaluation/Test Strategy

Given the unavailability of an appropriate evaluation methodology for evaluating the dynamic properties of interactive IR systems, we were compelled to use our own test strategy (which we call incremental Learning-Testing Strategy) for testing the performance dynamics of the system as it learns and gains experience. This was achieved by splitting the set of queries into two sets (training and testing sets) and then training the system on a (cumulative) subset of training queries at a time (i.e at each training session). Splitting the query set into training and testing tests so that they both equally represent query space (in terms of desired properties such as degree of overlap in relevance assessments and degree of similarity) is a difficult task. This was done based on the degree of overlap in relevant assessments, as it is the most important factor that helps interactive learning in our model. Degree of overlap was measured in terms of number documents assessed as relevant to each query. Out of the 225 queries available in the Cranfield collection 65 queries were used for testing and 160 queries for training. More queries were allocated to the training set simply because we needed more queries to create more training-testing (sub) sessions. This gives a fairly representative set of queries for testing in

36

terms of cross-relations, but does not guarantee that queries are equally distributed in terms of their similarity and expressiveness in natural language (length).

A training set for each training phase was created by adding 40 randomly selected queries from the set of full training set (of 160 queries) into the training set used at the previous training session. No query is picked more than once. So, the numbers of queries trained at the four training sessions were 40, 80, 120 and 160. Each query was iterated 20 times at each training session. The order of presentation of queries to the system was made random. At the end of each training session, the system was tested with the 65 test queries and the similarity measures were recorded for each query-document pair.

8 Results

Fig. 6 shows that the performance of the system (i.e. non-interpolated Average Precisions) increases considerably over training. Note that the shape of the curve varies depending on the amount of learning taken place at each training session and how much those learning helps retrieving relevant documents for test queries, but the starting and ending points remain the same. This is because the selection of training queries for sub training sessions from the full training set (160) is done randomly.

Fig. 7 shows the P-R curves of test results obtained after each training session. It provides more evidence to the performance gains shown above by the system over training. The performance gain shown by the system is a combined result of the use of unit-concepts & keywords and the use of our reinforcement learning strategy. Contribution of individual components (of matching and learning) towards this result is further analysed and compared below in figures 8,9 and 10.

Fig. 8 shows that the concept matching alone without any learning (the flat curve) is not of much use. Problems with concept extraction and with mismatches caused by the vocabulary differences and word ambiguity in natural language are the main causes for the poor performance of concept matching (only). These problems are severe in our case compared to simple keyword matching, because (i) Concept extraction from source documents is more complex and difficult, as it needs identification of a two terms or phrases one as an object and the other as a property possessed by the object (ii) Mismatch problem is doubled in our case because a concept match needs both the object and attribute constituents of a query concept (unit-concept) to match with an equivalent in a document.

Allowing the system to learn concept weights and also adding query concepts to relevant documents each alone have shown some improvements of performance over learning but not sufficient enough (Fig. 7). Concept learning does not (and it is not expected to) solve the two main causes of the poor performance stated above. Though, concept addition helps documents to learn (through user

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10Recall Levels

Pre

cisi

on

Prec40

Prec80

Prec120

Prec160

Fig. 7. P-R Curves at Different Levels of Training

Fig. 6. Average Precision Over Training

Fig. 6. Average Precision Over Training

0.350

0.297

0.261

0.216

0.332

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0 40 80 120 160N o . o f t ra ining Queries

37

interactions) different ways users might refer to them (experience) and there by helps both word mismatch problem and poor concept extraction, it has not shown a significant improvement. This is mainly due to the fact that these results were based on testing unseen queries and lack of sufficient overlaps in the collection (i.e. use of same unit-concepts to represent similar documents) has hampered evaluation of the main property our learning strategy - retrieval of a document by a query as a result of the document being reinforced (updated) by another query. However the interesting point here is that they both show increasing trends in performances. As a result of these positive improvements of each component, their combination has shown a significantly better improvement.

Interestingly the performance of the system has increased when only the keyword-learning component is used despite the fact that Keyword matches at each testing session were the same in this case. It is the same set of queries that were tested on the same collection and no concepts (and hence no keywords) are added to the documents during training. This is solely a result of keyword weight learning only. Our learning strategy seems to have assigned higher weights to the keywords that were significant at least in terms of the document distinguishing power.

Finally, the performance curve of the system in its full capacity shows that taking keyword matching into account helps improve performance. The reasons for this improvement are that (1) keywords help initial picking up documents for reinforcing and (2) keyword matches (that takes place in the absence of unit-concept matches) help increasing the similarity scores (RSVs) of documents and thus help ranking document with more features common with the query above the ones with less features. The second point is valid only if more keyword matches occur with relevant documents than with non-relevant documents. This is a well-known observation first made by B. Croft. Though no experiments were targeted at examining the validity of this feature, the improvements showed by the system with keyword matching evident its validity. In case if more keyword matches occurred with non-relevant documents, those non-relevant documents are pushed up in the rank list and as a result the performance of the system should have degraded.

Fig.9 and Fig.10 compare the performance of the system with its different learning components when only concept matching and keyword matching (respectively) are considered.

They both show that combining all three learning components give better performance results for both concept matching (only) and keyword matching (only). This result further confirms that the better performance shown by the system in its full capacity (in Fig. 1) is a combined result of all learning components on both concept matching and keyword matching but not a result of a subset of them.

However note that not all the keyword matches that takes place in the case of testing for “keyword matching only” becomes keyword matches when testing for “both keyword and concept matches” as the keywords that participate in concept matches are considered duplicates and pruned out. Therefore, the test results of “both keyword and concept matching” are not the same as the sum of the two cases of “concept matching only” and “keyword matching only”.

0.3010.287

0.253

0.1850.17

0.19

0.21

0.23

0.25

0.27

0.29

0.31

0 40 80 120 160

No. of Training Queries

Avg

. Pre

c.

with Concept Addition and Keyword Learning

with Concept Addition only

with Keyword Learning only

with No Learning at all

0.267

0.2510.246

0.216

0.246

0.21

0.22

0.23

0.24

0.25

0.26

0.27

0.28

0 40 80 120 160No. of Training

Queries

Avg

. Pre

c.

with Concept Addition and Concept Learning with Concept Learning Only with Concept Addition Only with No Learning at all with Concept Addition Only

Fig. 9. Concept Matching Only Fig. 10. Keyword Matching Only

0.267

0.246

0.216

0.350

0.251

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0 40 80 120 160No. of t r a in ing

Que r i e s

Concept Mat ching only wit h Concept Addit ion & LearningConcept Mat ching Only wit h Concept Addit ion OnlyConcept Mat ching only wit h no learning at all Both Concept & Keyword Mat ching wit h Full RL Concept Mat ching Only wit h Concept Learning Only

Fig. 8. Contribution of Learning on Performance

38

9 Conclusions

We have shown firstly a way of using more elaborate and true concepts for creating more meaningful representations of textual material and using them for explicit concept matching; secondly a radically different approach of using concept lattices in IR and its feasibility; thirdly the importance of an interactive learning strategy and the effectiveness of retaining the learnt knowledge for future use; and finally, the advantage of using a hybrid approach that takes into account both concept matching, keyword matching together with concept addition and weight learning for developing an IR system. A main characteristic our system is that it becomes more and more tuned to its environment or strictly speaking to its inputs as it learns. Consistency of training examples i.e. the consistency of users in terms of the use of vocabulary in query formulation and making relevance assessments is essential for our system to converge or better tuned for its inputs. Consistency is maximized when only one user uses the system. Essentially, this makes the system customized to its only user thus making it more personalized. On the other hand learning in multi-user environments help learning more exhaustive and better-generalized representations. In particular it helps the system to learn different possible ways of formulating queries (i.e different ways of referring into the same document) by different end users with different vocabularies. However the consistency among the users in making relevance assessments is essential for convergence. In an environment with more inconsistent users, the system dynamics and therefore retrieval performance may vary rapidly in time. As a result a given user may not be guaranteed the same relevant document for the same query issued at a later attempt. According to these observations, the system is likely to perform better in more personalized (single user) environments as well as in multi-user environments with similar or consistent users. Indeed it has the potential to outperform conventional keyword based systems in such environments.

Finally, this research is concluded with the following comment/recommendation. This work was a first step towards making use of more elegant concepts as much similar as possible to the formulation of concepts in the human brain and also allowing end users to (implicitly) decide the significances of concepts in the documents through an interactive learning strategy. The difficulty of automatic concept extraction from text and lack of sufficient background information in documents for building more complete concept hierarchies can be mentioned as the major drawbacks that crippled our investigation of the full potential of concept matching within the framework of FCA. However, despite these drawbacks, the performance results obtained are impressive and encouraging. In particular this work shows a way of performing true concept matching and the feasibility of using FCA in a different and more advantageous way to that of existing FCA based approaches. We are optimistic of the potential of the FCA framework to deliver better performance provided a better, more meaningful document/query representations are created through the incorporation of background knowledge using external knowledge sources and extraction of more meaningful concepts from text using future advancements of NLP technology. Such work is left for the future.

References

1. Becker, P., Hereth, J., Stumme, G.: ToscanaJ - An Open Source Tool for Qualitative Data Analysis. In: Workshop proceedings of FCA KDD workshop of the 15th European Conference on Artificial Intelligence (ECAI’02), July 21-26 2002, Lyon, France (2002)

2. Bělohlávek, R.: Representation of Concept Lattices by Bidirectional Associative Memories. Neural Computation Vol 12 No.10 October (2000) 2279-2290

3. Burmeister, P.: Formal Concept Analysis with ConImp: Introduction to the basic features.Technical Report, Technische Hochschule Darmstadt, Germany (1998)

4. Carpineto, C., and Romano, G.: A Lattice Conceptual Clustering System and Its Application for Browsing Retrieval. Machine Learning , 24(2) (1996) 95-122

5. Carpineto, C., and Romano, G.: Order-theoretical ranking. JASIS Vol.51 No.7 (2000) 587-601 6. Cole, R., Eklund, P.W.: Scalability of Formal Concept Analysis. Computational Intelligence, Vol 2, No. 5

(1993) 7. Cole, R.J., Eklund, P.K.: Application of Formal Concept Analysis to Information Retrieval using a

Hierarchically Structured Thesaurus. International Conference on Conceptual Graphs, ICCS'96, Sydney (1996) 1-12

8. Cole,R., Eklund,P., Stumme,G.: CEM-A Program for Visualization and Discovery in Email. In: D.A. Zighed, J. Komorowski, J. Zytkow (Eds): Proceedings of PKDD 2000, LNAI 1910 (2000)

39

9. Cole,R., Eklund,P., Stumme,G.: Document retrieval for Email Search and Discovery using Formal Concept Analysis. Journal of Applied Artificial Intelligence, Vol. 17, No. 3 (2003)

10. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer-Verlag Berlin Heidelberg , ISBN 3-540-627771-5 (1999)

11. Godin, R., Missaoui, R., April, A.: Experimental comparison of navigation in a Galois lattice with conventional information retrieval methods. International Journal of Man-Machine Studies, 38 (1993) 747-767

12. Kim, M., Compton, P.: A Web-based Browsing Mechanism Based on Conceptual Structures. 9th International Conference on Conceptual Structures (ICCS 2001), Guy W. Mineau (Eds.), California USA, CEUR-WS, Stanford University (2001) 47-60

13. Kosko, B. : Bidirectional Associative Memory. IEEE Transactions of Systems, Man and Cybernetics, 18(1) (1988) 49-60

14. Lindig, C.: Concept-Based Component Retrieval. Working Notes of the IJCAI-95 Workshop: Formal Approaches to the Reuse of Plans, Proofs (1995)

15. Priss, U.: A Graphical Interface for Document Retrieval Based on Formal Concept Analysis. In: Proceedings of 8th Midwest Artificial Intelligence and Cognitive Science Conference, AAAI Technical Report CF-97-01 (1997) 66-70

16. Priss, U.: Lattice-based Information Retrieval. Knowledge Organization, Vol.27, 3 (2000) 132-142 17. Rajapakse, R.K., Denham, M.: A Concept based Adaptive Information Retrieval Model using FCA-BAM

combination for concept representation. In: Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR’02), March 25-27, 2002, Glasgow, UK (2002)

18. Rajapakse, R.K., Denham, M.: Information Retrieval Model Using Concept Lattices for Content Representation. In: Workshop proceedings of FCA KDD workshop of the 15th European Conference on Artificial Intelligence (ECAI’02), July 21-26 2002, Lyon, France (2002)

40

A Connectionist Model of Spatial Knowledge Acquisition in a Virtual Environment

Corina Sas1, Ronan Reilly2 and Gregory O’Hare1

1Computer Science Department, University College Dublin Belfield, Dublin 4, +353 1 716 {2922, 2472}

{corina.sas, gregory.ohare}@ucd.ie 2Computer Science Department, National University of Ireland

Maynooth, Co. Kildare, +353 1 708 3846 [email protected]

Abstract. This paper proposes the use of neural networks as a tool for studying navigation within virtual worlds. Results indicate that the network learned to predict the next step for a given trajectory, acquiring also basic spatial knowledge in terms of landmarks and configuration of spatial layout. In addition, the network built a spatial representation of the virtual world, rather like a cognitive map. The benefits of this approach and the possibility of extending the methodology to the study of navigation in Human Computer Interaction applications are discussed.

1 Introduction

This study is part of an ongoing research the purpose of which is to identify the procedural and strategic rules governing user navigational behavior within virtual space. Environmental psychology provides an empirical basis for a better understanding of how humans perceive and understand the space. The work described here confirms the idea that acquiring an internal representation of the environment is a very complex process involving primarily landmarks identification and understanding of spatial layout configuration. These two basic procedures are well known as route-based knowledge and survey knowledge [8]. Without underestimating the role of traditional methods, we propose the use of neural networks as an alternative tool for studying navigation within virtual worlds.

Neural networks have proven particularly suited to finding patterns in large amounts of complicated and imprecise data, and detecting trends that are too complex to be noticed by humans [2]. While neural networks have been fruitfully exploited by artificial intelligence researchers, their adoption within HCI has been limited. They have been primarily applied to pattern recognition [18]. Finlay identified four areas of HCI, which involve pattern recognition problems, such as task analysis and task evaluation, natural interaction methods such as gesture, speech, handwriting, and adaptive interfaces [5]. Neural networks provide a very powerful toolbox for modeling complex high-dimensional non-linear processes [13]. ANNs have many advantages over the traditional representational models, in particularly distributed

41

representations, parallel processing, robustness to noise or degradation and biological plausibility [10]. We consider that at least part of these strengths can be harnessed to model user’s behavior in terms of spatial knowledge acquisition. A simple recurrent network [4] has feedback which embodies short-term memory. This makes it suitable for application to symbolic tasks that have a sequential nature.

2 Navigation within Virtual Environments

Virtual Environments (VE) have become a rich and fertile arena for investigating spatial knowledge. Within the VE, the user’s set of actions is restricted, consisting mainly of navigation, locomotion, object selection, manipulation, modification and query [6]. As Sayers (2000) observed navigation has been found to be central to the usability of interfaces to VEs on desktop systems [21]. VEs offer the context for training and exploration, enabling the replacement of training and exploration within the physical world. This proves partially attractive when experiencing the real world is expensive, dangerous or hard to be achieved [3]. Evidence of significant similarities in the acquisition of spatial knowledge from real and VEs has been identified [11]. A further advantage consists of their powerful tractable characteristic [1], which enables accurate spatio-temporal recording of users’ trajectory within the virtual space. Attempts to understand spatial behavior in both real and artificial worlds were primarily concerned with highlighting the symbolic representation of spatial knowledge.

2.1 Symbolic Cognitive Models of Navigation

The study of navigation in the area of HCI has developed mostly in the field of cognitive modeling, benefiting from inputs provided by both environmental psychology and geography [14]. Modeling of spatial knowledge has constituted a central research theme for the last four decades. Golledge presented different models of declarative knowledge acquisition, together with their relevant applications in the area of spatial cognition [7]. Kuipers developed several computational models for navigation, underlying the procedural knowledge embedded in the spatial representations [12]. The basic idea resides in the individuals’ set of interactions with the environment, which facilitates a structured storage of perceptual experiences. These memorized experiences would enable users to build a more generalized structure for exhibiting an emergent spatial behavior unperformed before [7].

2.2 Connectionist Models of Navigation

Previous studies have shown that recurrent neural network can predict both circular and figure eight trajectories [4,9,17,19,22]. However, due to the fact that the figure eight trajectory crosses itself, the training was more difficult for this type of trajectory. In our case, the trajectories covered by users are more complex than a circle or figure eight, even though some of them resemble a circular shape.

42

3 Methodology

Research in the area of navigation within VEs has been generally focused on large-scale virtual worlds [3]. In this study we utilized ECHOES1 [16], as an experimental test-bed. It is a virtual reality system, which offers a small-scale world, dense, static and with a consistent structure. Adopting a physical world metaphor, the ECHOES environment comprises a virtual multi-story building, each one of the levels containing several rooms: conference room (Fig.1), library (Fig.2), lobby etc.

Fig. 1. Virtual Conference Room Fig. 2. Virtual Library

The present study captures the spatial behavior of users exploring an unfamiliar VE. Users can navigate from level to level using a virtual elevator. The rooms are furnished and associated with each room are a set of user functions.

A sample of 30 postgraduates in the Computer Science Department was asked to perform two tasks within the virtual world, namely exploration and searching. In order to gain familiarity with the environment and learn movement control, the subjects were asked to look for a particular object within the virtual building for about 20 minutes. This exploratory task provided the primary data for the neural network approach. Furthermore, subjects were asked to find a particular room in the virtual building, namely the library. We considered the time and length of trajectory involved in this searching task as performance indicators. Based on these, we identified the quality of spatial knowledge acquisition and the efficiency of the exploratory strategy.

A comprehensive set of data consisting of users’ positions was recorded throughout the experiment. Each movement greater than half a virtual meter, and each turn greater than 30o were recorded.

We present a connectionist simulation to test whether a network can build a cognitive map as an internal representation of environmental information [8] in terms of both landmarks and configuration of the spatial layout. The basic idea is that an input vector consisting of current Cartesian coordinates together with information about the nearest landmark is sufficient to induce the network’s internal abstractions for predicting the next position. To test our hypothesis, an Elman-style simple recurrent neural network was used to learn the trajectory and to predict the next step.

1 ECHOES (European Project Number MM1006) is partially founded by the Information

Technologies, Telematics Application and Leonardo da Vinci programmes in the framework of Educational Multimedia Task Force.

43

The implementation was carried out by using Stuttgart Neural Network Simulator (SNNS). The network architecture [4] is presented in Figure 3 and consists of 6 input nodes, 12 hidden nodes, 12 context nodes and 6 output nodes.

The network input consists of a sequence of users’ trajectories. At each time step t, an input vector is presented consisting of user’s position, orientation angle, distance to the nearest landmark (the distance to the nearest point of the landmark) and its associated position (coordinates of the center of the landmark). For this simulation we considered only the trajectories performed on the ground floor of the virtual building. Figure 4 presents an overhead image of this level. After each trajectory was entered, the network has been reset, by entering an input which zeros out the output [15]. The output pattern represents the input vector of time t+1. All the input values have been normalized.

copy

input layer context layer

hidden layer

output layer

Position, Heading, Distance, Landmark Position

input t+1

Fig. 3. Network Architecture Fig. 4. Ground Floor

Using the backpropagation learning procedure [4] the network was taught to predict for each current position the next position in time. Table 1 presents an example of input/output vectors, which reflect a rotation (the third element in the vector changes), followed by a translation which moves the user in the proximity of a new landmark (the last two elements in the vector change).

Vector X Y Rotation Distance Landmark (X) Landmark (Y) Input t 0.031 0.258 0.097 0.179 0.000 0.078 Output t 0.031 0.258 0.018 0.179 0.000 0.078 Input t+1 0.031 0.258 0.018 0.176 0.000 0.078 Output t+1 0.025 0.336 0.018 0.195 -0.219 0.531

Table 1. Input / Output Vectors Example

At this stage of our work, we expanded the notion of landmark to any feature added to spatial layout. Therefore, apart from any piece of furniture, we considered also the choice points such as doors and lift entrance. Identifying which ones among these features prove to be salient and able to capture attention – being thus an authentic landmark – is a task to be solved by the network. We divided randomly the entire set of data into five parts, using three of them for training, one for validation and one for testing. The network was trained for 1000

44

epochs, with 24 trajectories composed of 4668 input vectors. It was tested with 12 trajectories consisting of 1573 input vectors. The average trajectory length was 160 vectors. The learning rate was 0.001, the initial weights set within a range of 0.5 and the momentum was 0.

4 Results

In Table 2 we present the results of testing the network, obtained by computing the Euclidean distance between the output vectors predicted by the network and the expected output vectors.

Input description Percent Correct

User’s next position – X coordinate 97.13 % User’s next position – Y coordinate 92.30 % User’s next orientation (heading) 86.90 % Distance to the next nearest landmark 99.87 % Next nearest landmark position – X coordinate 90.27 % Next nearest landmark position – Y coordinate 86.77 %

Table 2. Prediction accuracy of each input element based on Euclidean distance

As can be seen, the network generalizes well for all the input elements. However, for a prediction to be correct all the input elements should simultaneously be within specified limits (e.g. 1 virtual meter for Cartesian coordinates, 30 degree for rotation and 2.5 virtual meters for distance estimation). With respect to the composite criterion of accuracy, the network still performs adequately, the success rate being 67.57%.

5 Discussion

The preliminary results obtained by training the recurrent neural network proved promising, indicating that the network not only learns to predict the next step for a given trajectory, but it also learns the spatial layout in terms of landmark configuration.

In order to highlight the internal representation of the network, we performed a series of analyses. Firstly, we were concerned whether the network could learn any boundaries. Since all the vectors predicted were within the limits delineating the ground floor layout, it seems that the network did indeed learn the floor boundaries. Another issue we were interested in was whether the network built any cognitive map of the virtual space.

Figure 5 presents the “cognitive map” derived from network representation of landmarks’ positions. It shows the layout of the virtual space (ground floor) in terms of the centers of the landmarks, e.g. (x, y) coordinates, as the neural network has predicted them. As it is presented, the predicted coordinates of the landmarks form clusters around the coordinates of the landmarks within the VE. The centroid for each

45

cluster is plotted as well. The “cognitive map” (the internal representation of space achieved by the network) conserves well the topology while its metric is slightly less accurate. The same properties characterize the cognitive maps built by humans. In other words the network built a map of the space in a way similar to humans.

X coordinate of the center of the landmark

.6.4.2-.0-.2-.4-.6

Y c

oord

inat

e of

the

cent

er o

f the

land

mar

k

.8

.6

.4

.2

0.0

-.2

-.4

-.6

-.8

Landmark

predicted values

centroid values

Fig. 5 Centers of the landmarks predicted by the neural network

For a better understanding of the network’s ability to discriminate between

landmark features, we performed an analysis of network predictions regarding the attention it paid to each landmark. More precisely, we counted how many times a landmark was visited, or in other words how many times a given landmark was the nearest to the user. We took this measurement as an indicator of landmark saliency.

The most important landmarks are the desk with a computer, the sofa in the center of the larger room, the door to the meeting room, the large elliptical table, the lift and the door between meeting room and library.

The saliency of a landmark is related to the landmark’s location in the room, e.g. centrality, its size, and unique features (e.g. shelves in the library are all alike thus undistinguishable). These findings suggest that users paid particular attention to connectivity/decision points such as doors and the lift.

For understanding the network accuracy of predicting the landmarks, the figure 6 presents the actual centers of the landmarks together with the centroids of the centers of the landmarks, as the neural networks anticipated them.

A visual examination suggests that the network predictions of landmarks locations, in terms of (x, y) coordinates, approximate the landmarks’ locations within the VE. We employed a quantifiable measure [15], through the mean cosine of the angle between the vectors representing the landmarks’ positions predicted by the network, namely the centroids of the clusters, and the vectors representing the landmarks’ positions in the virtual world: (x, y). The average cosine was 0.9996, while 1 would represent a perfect performance.

46

X coordinate of the center of the landmark

.6.4.2-.0-.2-.4-.6

Y c

oord

inat

e of

the

cent

er o

f the

land

mar

k

1.0

.8

.6

.4

.2

0.0

-.2

-.4

-.6

-.8

Landmark

center

centroid

Fig. 6 Original landmarks: centers and the centroids for their predictions

6 Conclusions

This simulation was carried out with the purpose of showing that some abstract aspects related to spatial cognition are learnable. The basic idea is that by mapping an input vector consisting of current Cartesian coordinates together with information about the nearest landmark such as distance to it and the coordinates of its center is sufficient to induce the internal abstractions to predict the next position. Moreover, the network is able to understand the spatial configuration of the VE.

The network predicted correctly the next position together with its nearest landmark at a rate of 67.57%. It was also able to learn the boundaries of the spatial layout, and even to build a cognitive-like map. The spatial representation of the virtual world preserves both topology and metric. The network was also able to assign saliency to landmarks, related to their location e.g. centrality in the room frame, their size and distinctiveness.

A future direction will be to analyze the representations in the hidden layer in order to extract rules or procedural knowledge underlying the navigational behavior. Using neural networks as a tool in studying navigation can be beneficial for user modeling in the area of spatial knowledge acquisition. Permitting a comparative analysis between efficient and inefficient navigational strategies, this methodology could suggest how VEs might be better designed. Based on these results further work will be focused on assisting new users, to improve their spatial abilities in exploring a new VE. After a period of navigation, users could be classified in a cluster according with the navigational patterns [20]. By predicting the user’s following trajectory, pertinent advice could be provided to reduce its offset from the desirable “good” trajectory. Thereafter this guidance will improve user exploration. Alternatively, a real-time dynamic reconstruction of the VE could assist the users in their tasks.

47

References

1. Amant, R.S. and Riedl, M.O. A practical perception substrate for cognitive modeling in HCI. International Journal of Human Computer Studies, 55, 1, 2001, 15-39

2. Christos, S. What is a neural network? Surprise 96 On-Line Journal, 1, 1996. 3. Darken, R.P. Wayfinding in Large-Scale Virtual Worlds, in Proceedings of SIGCHI `95,

ACM Press, 45–46. 4. Elman, J.L. Finding structure in time, Cognitive Science, 14, 1990, 179–211. 5. Finlay, J. and Beale, R. Pattern recognition and classification in dynamic and static user

modelling, in Beale, R. and Finlay, J. (eds.). Neural Networks and Pattern Recognition in Human Computer Interaction. Ellis Horwood, 1992.

6. Gabbard, J. and Hix, D. Taxonomy of Usability Characteristics in Virtual Environments, Final Report to the Office of Naval Research, 1997.

7. Golledge, G.R. Environmental cognition, in Stokols, D. and Altman, I. (eds.) Handbook of Environmental Psychology, vol. 1. John Wiley & Sons, Inc. New York, 1987, 131–175.

8. Golledge, G.R. Human cognitive maps and wayfinding, in Golledge, G.R (ed.) Wayfinding behaviour. John Hopkins University Press, Baltimore, 1-45.

9. Hagner, D.G., Watta, P.B. and Hassoun, M.H. Comparison of Recurrent Neural Networks for Trajectory Generation, In Recurrent Neural Networks: Design and Applications, L. Medsker and L. C. Jain (eds.), CRC Press, 1999.

10. Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice-Hall, 1994. 11. Jacobs, W.J., Thomas, K.G.F., Laurence, H.E., and Nadel, L. Place learning in virtual space

II: Topological relations as one dimension of stimulus control. Learning and Motivation, 29, 1998, 288-308.

12. Kuipers, B. Modeling spatial knowledge. Cognitive Science, 2, 1978, 129–153. 13. Lint, H. van, S.P. Hoogendoorn, H.J. van Zuylen. Freeway Travel Time Prediction with

State-Space Neural Networks, Preprint 02-2797 of the 81st Annual Meeting of the Transportation Research Board, Washington D.C., 2002.

14. Lynch, K. Image of the city. The MIT Press, Cambridge MA, 1960. 15. Morris, W.C., Cottrell, G.W., Elman, J. A connectionist simulation of the empirical

acquisition of grammatical relations, In Wermter, S. and Sun, R. (eds.) Hybrid Neural Symbolic Integration. Springer Verlag. 2000.

16. O'Hare, G.M.P., Sewell, K., Murphy, A.J. and Delahunty, T. ECHOES: An Immersive Training Experience. Proceedings of Adaptive Hypermedia and Adaptive Web-based Systems. 2000, 179–188

17. Pearlmutter, B. Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2): 263-269, 1989.

18. Preece, J., Rogers, Y., Sharp, H., Benyon, D., Holland, S., and Carey, T. Human-Computer Interaction. Wokingham, UK: Addison-Wesley. 1994.

19. Psarrout, A, Gong, S., Buxton, H. Modeling spatio-temporal trajectories and face signatures on partially recurrent neural networks. In Proc. IEEE International Conference on Neural Networks, pp. 2226-2232, Perth, Australia, Nov 1995.

20. Sas, C., O’Hare, G.M.P., Reilly, R. A Connectionist Approach to Modeling Navigation: Trajectory Self Organization and Prediction. Proceedings of 7th ERCIM Workshop, User Interfaces for All. Carbonell, N. and Stephanidis, C. (eds.), 20002, 111-116.

21. Sayers, H.M., Wilson, S., Myles, W., McNeill, M.D.J. Usable Interfaces for Virtual Environment Applications on Non-Immerse Systems, In Proceedings Eurographics UK, 2000, 143-150.

22. Sundareshan, M.K., Wong, Y.C. and Condarcure, T. Training algorithms for recurrent neural nets that eliminate the need for computation of error gradients with application to trajectory production problem, In Recurrent Neural Networks: Design and Applications, L. Medsker and L. Jain (eds.), CRC Press, 1999.

48

Statistical machine learning for tracking hypermedia user behavior

S. Bidel, L. Lemoine, F. Piat, T. Artières, P. Gallinari

LIP6, Université Paris 6 8 rue du capitaine Scott, 75015, Paris, France

{Sylvain.Bidel, Laurent.Lemoine, Frederic.Piat, Thierry.Artieres, Patrick.Gallinari}@lip6.fr

Abstract. We consider the classification and tracking of user navigation patterns for closed world hypermedia. We use a number of statistical machine learning models and compare them on different instances of the classification/tracking problem using a home made access log database. We conclude on the potential and limitations of these methods for user behavior identification and tracking.

1 Introduction

The development and complexity increase of hypermedia systems accessible by a large variety of users has created a need for developing tools to help the user meet his infor-mation need. For such applications where navigation plays an important role, it is gen-erally useful to characterize dynamically the user behavior. On line positioning of indi-viduals inside a taxonomy of users is an important information for defining help strategies. Examples of such characterizations are the classification of user behavior among pre-defined categories, the discovery of behavior changes or the tracking of user behavior. Such an analysis is a preliminary step for the development of adaptive help systems and it is usually performed by analyzing on line user traces.

We consider here how machine learning techniques could be useful to dynamically characterize the behavior of a user navigating a closed world - rich information content hypermedia. In this context, we focus on the automatic discovery of user navigation behavior from the low-level information provided by the temporal sequences of navi-gation actions (visited document, click, scroll, etc). The goal is to follow the user dur-ing his navigation and to track his navigation behavior. We first propose a number of features which allow to characterize the sequence of user actions with respect to the user typology, and which are adapted to the rich content of the hypermedia. To track the user behavior, we investigate a number of statistical machine learning models for dealing with temporal data: a neural network, Markovian models (Markov Chains and Hidden Markov Models), we also introduce Multi-Stream Markov Models which allow to process simultaneously different feature sequences occurring at different time scales and asynchronously. For comparing and analyzing the models we propose two series of experiments, we first evaluate the models for the classification of single-behavior data sequences, we then consider the more difficult -and more interesting- problem of de-tecting changes in behavior (called tracking). For both tasks we take two approaches to

49


the learning problem: supervised and unsupervised learning. They correspond to two different strategies and needs for developing user models. Supervised learning might be adequate for a fixed number of user categories with well defined user behaviors, whereas unsupervised learning makes easier the incorporation of new categories and the development of adaptive systems able to incorporate new or evolving populations. It may be questionable to use supervised learning in our context but results gained with such a learning scheme at least provide insights about the learnability of user naviga-tion behavior. In order to perform supervised learning in a controlled setting, we choose to build a home made database where user sessions may be labeled with behav-iors. The database consists of sessions from 26 users who have been enrolled for a total of about 16 navigation hours of a multimedia encyclopedia. Besides comparing different models for the classification and tracking tasks, we dis-cuss the potential and limitations of automatic tools for analyzing user action se-quences. Note that sequence models -Markovian models [10] and Dynamic Bayesian Networks [8]- have mainly been used for predicting user actions or for inferring goals in environments which are described using existing domain specific knowledge. The paper is organized as follows. We first introduce in §2 a navigation patterns typol-ogy. Then, we describe in §3 the database used in our experiments. In §4, we describe our behavior models and present the supervised and unsupervised strategies. Then, we give experimental results in §5.

2 High level navigation strategies

Although most researchers distinguish broad user navigation strategies (e.g. browsing and searching), there is no general agreement on a typology. Defining a clear classifi-cation between different behaviors is also made difficult since strategies are not mutu-ally exclusive and users frequently go back and forth between them – e.g. browsing may be used to achieve searching, etc. For categorizing behaviors, we use a popular taxonomy by Canter [4] that represents a good trade-off, covering the basic navigation strategies while keeping the number of behaviors small: Too many behaviors would make them much harder to interpret, and it may not be straightforward to define an adequate corresponding help strategy. It distinguishes the four high level behaviors:

Scanning: seeking an overview of a theme (i.e. subpart of the hypermedia) by requesting an important proportion of its pages but without spending much time on them.

Exploring : reading thoroughly the pages viewed. Searching : seeking a particular document or information. Wandering : navigating in an unstructured fashion without particular goal or

strategy.

50

Statistical machine learning for tracking hypermedia user behavior 3

3 Database and Feature Extraction

For this work, we used « The XXth century encyclopedia », initially a cultural CD-ROM1 reconfigured as an Internet site. This is a typical “cultural” hypermedia system, it contains about 2000 articles (i.e. pages with text, pictures, videos etc), a full-text search engine and tables of contents where the user can navigate on a 2-level theme hi-erarchy. Each theme is associated a set of key words. Each article is associated a theme, navigation links towards other articles, and reading times corresponding to the durations required to fully read each of its paragraphs. All the above have been set by the conceivers of the site.

In order to evaluate our methods, we have generated homogeneous user data ses-sions in a controlled fashion, by asking 26 users to fill out questionnaires by navigating through the encyclopedia. The questions for each session were chosen in order to in-duce a given navigation behavior according to the typology in §2. For instance, a ques-tion asks the user to extract some important dates from a particular theme. This prompts the user to view several pages of this theme without having to read them thor-oughly, which corresponds to the “Scanning” behavior. For the “Exploring” behavior, the user was asked to fully read a few articles. For the “Searching” and “Wandering” Behaviors, the users were asked to retrieve from the whole encyclopaedia a particular picture (for Searching) and to pick any one they liked (for Wandering). The sequences of user actions (traces) recorded by the navigator and associated to each question is called a homogeneous user data session. Each session is then labeled by the corre-sponding high-level behavior. These labels will be used for the evaluation and for training supervised classifiers.

104 data sessions were thus gathered, 26 for each of the 4 behaviors. Navigation data are sequences of dated events (page access, click, scroll, query on the search en-gine, etc) which are collected all along the user session. These traces are then proc-essed to compute sequences of feature vectors or frames. A frame was computed about every minute and overall, this yields over 900 frames. To characterize user behavior, we investigated various features defined intuitively by observing the navigation habits of several people. After elimination of redundant (highly correlated) features, we are left with 9 that take advantage of the richness of the information associated to articles (reading time, etc..). These 9 features are divided into three subsets according to the type of information they carry: The “reading” subset reflects the extent and the quality of the reading behavior and contains 4 features. Using reference reading times for each paragraph we compute reading rates for the first quarter and for the rest of the document (applies when a document is accessed via scrolling). The time spent on the page(s) and the activity (number of clicks/scroll events) complete this set of features. The “resources” subset informs the system about the type of resources used. They may be articles (the real content of the hypermedia, the leaf pages in the tree of themes hierarchy), tables of content (either of 3 levels, containing links to access the themes, sub-themes or articles), or the search engine page. We use as features the percentage of time spent on these three kinds of resources.

1 Distributed by Montparnasse Multimédia company.

51


The “navigation” subset characterizes the navigation focus, it indicates whether the user is focused on one theme or spread onto several. We define two navigation fea-tures based on a inter-themes similarity, which is the cosine between the two vectors representing the themes key-words, it is a classical distance measure used in the In-formation Retrieval field. The first feature is the average distance between the themes of successive pages accessed during the frame duration. The second feature does not take into account the visiting order of the pages but measures the global variability of themes visited, similar to a ‘weighted standard deviation’. For that, we first determine the main (focus) theme as the one whose average distance to all visited articles is minimal. We then compute the average distance of the visited themes to this focus, weighted by the time spent on each theme.

4 Behavior Models

After the feature extraction step, the navigation information in a user session is rep-resented as a sequence of frames (a frame is a vector of 9 features), each frame corre-sponds to timely information about the user actions. Let o1

T=(o1, …, oT) denote a se-quence of T frames, ot being the tth frame in the sequence. To identify different user behaviors, we trained models of frame sequences. Such a model, B, allows computing sequence likelihood P(o1

T/B). We investigated various statistical machine learning models, Multi-Layer Perceptrons (MLPs), Markov Chains (MCs), Hidden Markov Models (HMMs) and Multi-Stream variations of Markovian models.

We used MLPs since they are known to be efficient for discrimination tasks. A MLP is trained (using Back Propagation algorithm) to discriminate between frames of different behaviors. It takes a frame as input and outputs a vector of behavior scores, the maximal score corresponds to the recognized behavior. When trained for discrimi-nation, a MLP is known to approximate posterior probabilities P(B/ot). Then one can use this MLP to classify sequences of frames since, using Bayes Theorem:

tt

B

T

BoBPBPBoP )/(maxarg)()./(maxarg 1 (1)

We also used Markovian models since they have shown strong abilities for various signal and sequence modelling and classification tasks. We use one Markovian model per behavior (either MC or HMM), with an ergodic topology (any transition allowed), and diagonal covariance Gaussian densities in the case of HMMs. Learning and recog-nition algorithms are classical ones and are not detailed here.

The underlying hypothesis for Markovian models is that the modeled process is lo-cally stationary and a transition in the Markov model corresponds to a skip from one of its stationary states to another one. A consequence is that all features in the frames are assumed to obey a synchronous process. This assumption does not correspond to the features used here which do not change synchronously. Hence, we propose to use a variant of Markovian models called Multi-Stream Markovian models [9]. We briefly present below the principle of multi-stream HMMs (MS-HMMs), the case of MS-MCs is similar.

52


MS-HMMs allow combining multiple partially synchronous information streams or modalities [6, 7]. In our study, a MS-HMM is a combination of three stream-HMMs each operating on a different information stream corresponding respectively to Read-ing, Resources and Navigation frame sequences (see §3). The three streams are asyn-chronous i.e. transitions in the three stream-HMMs may occur at different times, except in some particular states named recombination states. We have chosen the entering and leaving states of each stream model as recombination states, i.e. each behavior model is fully asynchronous. This means, that, given an entering time and a leaving time in a behavior model, one can compute very simply the probability of the corresponding sub-sequence of frames. For example, the probability of a sub-sequence of frames from time b to time e is computed with:

nebrd

ebrd

eb

eb BnPBrsPBrdPBoP //// (2)

where B is a behavior MS-HMM model composed of three HMM models Brd, Brs, Bn, working respectively on sequences of frames , , which are frames of read-ing, resources and navigation features.

ebrd e

brs ebn

Recombination states will be useful for segmentation tasks as will be seen in §4.2.

4.1 Supervised and unsupervised learning

For our experiments, we investigated supervised and unsupervised learning. Both have been performed using the assumption that a homogeneous user data session in the database (as defined in section 3) corresponds to a user who doesn’t change his behav-ior all along this session. In supervised learning, due to the lack of information, behav-ior priors are assumed uniform. In unsupervised learning, these priors are learned along with behavior models.

For supervised learning, we used the labeling of our database into the four elemen-tary behaviors as described in §2. For Markovian models, we learned four models (one for each behavior), each behavior model is trained to maximize the likelihood of asso-ciated training sessions. For the MLP, one MLP is trained to discriminate between the frames from the four behaviors, using a Mean Squared Error criterion.

We also investigated unsupervised learning for Markovian models (we did not per-formed unsupervised experiments with MLPs since this model is not well adapted to this task). To do this, we consider a mixture of N probabilistic behavior models. The probability of a sequence is given by the following mixture of sequence models:

Nii

Ti

T BoPBPoP..1

11 /. (3)

where are N Behavior models, NiiB ..1 iBP is the prior probability for the ith be-

havior model Bi and iT BoP /1 is the likelihood of computed by BTo1 i. Learning con-

sists in maximizing the likelihood of all training sessions given this mixture model. Since for unsupervised learning we do not know which behavior a training session be-longs to, we use an EM procedure where missing data are posterior probabilities of be-

53


havior Ti oBP 1/ . This algorithm performs a clustering of user sequences. Here is the

sketch of our clustering algorithm, it is close to the one in [3]: 0. Initialize the parameters of all behavior models NiiB ..1 and of priors. 1. Iterate until convergence i. Estimate missing data using current models.

Ti oBP 1/ =

N

jjj

T

iiT

BPBoP

BPBoP

11

1

./

./ (4)

ii. Re-estimate behavior models with all training sessions. A session par-ticipates to the re-estimation of model B

To1

i with a weight corresponding to To1iBP / . iii. Re-estimate behavior models priors:

iBP = aTrainigDato

Ti

ToBP

sessionsTraining1

1/ #

1 (5)

4.2 Behavior categorization and tracking

In a first step we have compared the different methods on the classification of ho-mogeneous user data sessions (§3). For each model, MLP, MCs, MS-MCs, HMMs or MS-HMMs: the sessions are classified according to the model maximizing the se-quence likelihood ( (1) was used for MLP).

Usually sessions are not homogeneous and exhibit multiple successive behaviors, the goal is then to track on line the user behavior. In this case, we make use of global session models built from elementary homogeneous models. For Markov models, a global Markov model is built by concatenating the leaving state of each behavior model to the entering state of each behavior model. Then, considering an unknown ses-sion, a dynamic programming algorithm finds the optimal state path for the session, from which we derive the most likely sequence of typical behaviors. This corresponds to the segmentation step for standard Markovian models (MCs and HMMs). For MLPs, a similar scheme may be used. Let us explain below how it works for MS-HMMs (it is similar for MS-MCs). To segment a session into elementary behaviors us-ing MS-HMMs, one builds three large HMMs rd, rs, n by concatenating, as above, all HMMs corresponding respectively to reading features, resources features and navi-gation features. The global MS-HMM model denoted is built from these three asyn-chronous models, by imposing synchronization points at each leaving state, i.e. the three paths in each stream are forced to leave their behavior model at the same time. The likelihood of a session is given by:

nrsrd SSSnrsrdnn

Trsrs

Trdrd

TT SSSPSnPSrsPSrdPoP,,

1111 /,,.,/,/,// (6)

54


where are the paths in nrsrd SSS ,, rd, rs, n. The synchronization consists in setting /,, nrs SSrdSP to 0 if the constraint is not verified. Otherwise, /,, nrsrd SSSP

is set equal to rdrdSP / . rsrsSP / . nnSP / .

5 Experiments

We now describe the two series of experiments. In the first series we categorize homo-geneous user sessions. This does not usually correspond to a realistic scenario, but it allows performing a preliminary evaluation on a simplified task. In the second series of experiments, we want to track the user behavior and detect its behavior changes. This amounts to segment user sessions into reference behaviors. This is a more realistic situation. For this second series of experiments, we concatenated all the homogeneous user sessions in the database using a random ordering, producing large sessions where the user behavior changes. All the evaluations have been performed using a 26-fold cross-validation. Each experiment consists in training the system using all user data but one, and to test on the remaining user data. It must be noticed that, even in a closed and controlled environment like the one we are dealing with, user behavior classification is difficult and has intrinsic limitations. Even with a clear goal in mind, a user goes back and forth between strategies during a ses-sion, which makes difficult an accurate classification of sessions. The elementary be-haviors we use are only rough abstract representations of the potential user behavior.

Since we built the database using a predefined scenario, we know the label of each elementary session. It is thus possible to perform supervised learning for both classifi-cation and segmentation. We performed supervised learning with all the models de-scribed in §4. Although this strategy could make sense for user behavior classification in some controlled environments, it is more realistic to consider the problem as an un-supervised learning problem where sessions are unlabeled, and the goal is to identify typical user behaviors from scratch. We thus performed unsupervised learning experi-ments with Markovian models only since MLP is not well adapted to unsupervised learning. The interpretation of the discovered behaviors is complex and the evaluation of unsupervised methods is an open problem. We thus provide below performances of unsupervised methods with regard to the known (i.e. supervised) labels of elementary sessions. Although this is not fully satisfying, this provides interesting hints for meas-uring the ability of these methods to detect user behaviors. Note that performances ob-tained using supervised methods provide an upper bound of the performances that could be obtained for session classification and segmentation.

5.1 Session categorization

Here whole sessions have to be classified according to an underlying behavior. For evaluating the models, we used two criterions, the standard correct classification (CC) percentage, and a weighted accuracy (WA) criterion where confusions between classes have different weights. The idea behind WA is that confusions between behaviors do not all have the same importance, since user help actions for some classes may be very

55


similar. In our WA, confusions between Scanning and Exploring and between Search-ing and Wandering are respectively weighted by a ½ factor, all other confusions are as-signed a weight equal to 1, these weights have been fixed by hand. For supervised learning with Markovian models, we trained 4 models, one for each typical behavior. A standard HMM (MC) model working on whole frames, has 7 states. A multi-stream HMM (MC) model consists of 3 HMMs (MCs), one per feature subset, with 3 states. The number of states in the models has been fixed using cross validation. We use one MLP that is trained in discrimination mode. For behavior clustering (unsupervised learning), we first determined an “optimal” number of clusters using the F-statistic, which is a cluster homogeneity measure. We found an optimal number of 6 clusters. We then learned a mixture of 6 models. Train-ing sessions were then clustered according to the model with greatest likelihood. After training, each cluster has been labeled into one of the 4 classes according to the major-ity of labels it contains. CC and WA criteria may then be computed.

Table 1 sums up our results. Both CC and WA are reasonably high for most super-vised models: elementary behaviors can be recognized rather accurately from low-level navigation data. Behavior may be recognized with up to 65% accuracy using only one frame (1 minute), and with up to 79% for whole sessions (about 5’ in average). As may be seen, HMMs and MLP perform similarly, outperforming MC models.

Table 1. Correct classification percentage (CC) and weighted accuracy (WA) for behavior clas-sification task for supervised and unsupervised systems.

Sys-tem

Training mode

Criterion

HMM MS-HMM

MC MS-MC MLP

Supervised Session CC 79 76 57 63 74 Supervised Session WA 85 84 67 72 83

Unsupervised Session CC 69 65 61 61 - Unsupervised Session WA 78 76 70 70 - Although CC and WA are noticeably lower for unsupervised training, it can be seen

that a reasonable proportion of the sessions is again correctly classified. This shows that unsupervised classification on user traces allows capturing valuable information on the user behavior. This also shows the difficulty of this task. Going further in the evaluation of unsupervised systems would necessitate a manual analysis of the clusters, this is beyond the scope of this paper. For supervised learning all models HMM, MS-HMM, MLP, do perform similarly. The MS-HMM is unable to take benefit here from its better ability to model the sequence data, and the same conclusion holds for unsupervised learning (MLPs being left out).

5.2 User behavior tracking

Here, the system has to detect in a long session the behavior changes, and to recognize these behaviors. A segmentation system receives as input a sequence of frames and outputs a sequence of labels, one for each frame. In our controlled experimental set-ting, this computed sequence has to be close to the actual label sequence. Different

56


measures have been proposed for comparing discrete sequences. We have been using here the edit distance between computed and desired label sequences [1]. This is a classical measure which computes insertions, deletions and substitutions between the two strings. The correct recognition percentage is then 1 minus substitution and dele-tion percentages. Note that this does not take into account the duration of each detected behavior. We made this choice considering that it was not important to detect the exact time where the user changes his exploration strategy, but rather to detect the change of strategy within a reasonable delay. The Edit distance reflects this idea up to a certain extent. For supervised learning, models are first trained on elementary sessions as for classifi-cation. For unsupervised learning, the class of each homogeneous session is supposed unknown. Models are then used to segment a large session where elementary sessions have been concatenated. The computed sequence is compared to the desired sequence via the Edit distance. Table 2 shows the experimental results.

Table 2. Edit-distance rates between correct and predicted behavior sequences, with substitution cost =1 and deletion cost = insertion cost = 2, for supervised and unsupervised systems.

Training mode Edit-distance % Correct % Susbt. % Del % Ins Supervised HMM 78 14 9 12 Supervised MS-HMM 75 16 10 10 Supervised MC 49 38 13 14 Supervised MS-MC 61 29 10 17 Supervised MLP 73 16 11 13

Unsupervised HMM 35 55 10 14 Unsupervised MS-HMM 39 50 11 13 Unsupervised MC 37.5 51 12 12 Unsupervised MS-MC 39 44 14 12.5

Again performances of supervised models are satisfying and only show a small drop compared to the simpler task of classification. MLP, HMMs and MS-HMMs are still higher than MCs and MS-MCs. This is an encouraging result since it shows the feasi-bility of behavior tracking. On the other hand, performances of unsupervised systems drop 30 % below the supervised upper bound for all models. The lower classification ability carries over to segmentation. It looks like tracking is not possible in an unsuper-vised setting. Note however, that this evaluation of unsupervised systems for segmen-tation is even more questionable than for categorization since there is no clear frontier between the different categories. An analysis of the segments inside each cluster should be performed in order to assess the relevance of these models. Globally, these controlled experiments show that classical machine learning tools like HMMs and MLPs operating on adequate navigation features allow extracting signifi-cant information from on line user information. These models behave similarly and more sophisticated models did not bring any improvement. All models allow operating on line on the sequences of user actions. Both classification and tracking do perform at a reasonable level in a supervised setting, although the performances shown here could be an upper bound for this type of system. For unsupervised learning which probably corresponds to the more interesting scenario for analyzing user actions, things are more

57


complicated and e.g. nothing could be definitely concluded from the results of the tracking experiments. Further investigations and interpretation of the data segmentation are needed to go further. However it seems that in this setting, additional information like user interaction is needed in order to confirm or not the model decisions. In both cases, generative models like HMMs allow incorporating new user behaviors and this is an advantage compared to discriminate methods like MLPs.

6 Conclusion

We proposed a series of new features and investigated various statistical machine learning models for the categorization and tracking of user navigation behaviors in rich hypermedia systems. Experiments were performed on a real hypermedia system using a controlled navigation database. Results show that session classification and tracking performs well in a supervised setting, but that performance drops for unsupervised tracking. It is not clear yet whether this is an intrinsic limit of the unsupervised ap-proach to tracking or a side effect of the evaluation criteria. In all cases it seems that additional information from the user must be taken into account if we want reliable tracking and classification. Acknowledgment : This project is part of the RNTL project Gicsweb funded by the French Ministry of Industry.

Bibliography

1. Atallah, M.J. (ed.), Algorithms and Theory of Computation Handbook, CRC Press LLC, 1999 2. Brusilovsky P., Adaptive Hypermédia, User Modeling and User-Adapted Interaction, 2001. 3. Cadez I., Gaffney S., Smyth P., A general probabilistic framework for clustering individuals

and objects, In Proceedings of the Sixth ACM International Conference on Knowledge Dis-covery and Data Mining, 2000.

4. Canter D., Rivers R., Storrs G., Characterizing User Navigation through Complex Data Struc-ture, Behavior and Information Technology, vol. 4, 1985.

5. Catledge L., Pitkow J., Characterizing Browsing Strategies in the World Wide Web, Com-puter Networks and ISDN Systems, 1995, vol.27, No.6.

6. Dupont S. and Luettin., Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition: Experiments on the M2VTS Database, Int. Conf. on Spoken Language Processing, 1998.

7. Gauthier N., Artières T., Dorizzi B., Gallinari P., Strategies for combining on-line and off-line informations in a on-line handwriting recognition system, Int. Conf. Document Analysis and Recognition, 2001.

8. Horvitz E., Breese J., Heckerman D., Hovel D., Rommelse K., The Lumière Project: Bayes-ian user modeling for inferring the goals and needs of software users, UAI 98.

9. Varga A., Moore R., Hidden Markov Model decomposition of speech and noise, International Conference on Acoustics, Speech and Signal Processing,1990.

10. Zukerman I., Albrecht D., Nicholson A., Predicting users’ request on the WWW, UM 99.