word weighting based on user’s browsing history yutaka matsuo national institute of advanced...

Word Weighting based on User’s Browsing History

Yutaka MatsuoNational Institute of Advanced Industrial Science and

Technology (JPN)

Presenter: Junichiro MoriUniversity of Tokyo (JPN)

Outline of the talk

• Introduction– Context-based word weighting

• Proposed measure

• System architecture

• Evaluation

• Conclusion

Introduction

• Many information support systems with NLP use tfidf to measure the weight of words.– Tfidf is based on statistics of word occurrence on a

target document and a corpus.

– It is effective in many practical systems including summarization systems and retrieval systems.

• However, a word that is important to one user is sometimes not important to others.

Introduction

Example• “Suzuki hitting streak ends at 23 games”

– Ichiro Suzuki is a Japanese MLB player, MVP in 2001.– Those who are greatly interested in MLB would thinks “hitting streak

ends” as important,– While a user who has no interest in MLB would note the words

“game” or “Seattle Mariners” as the informative, because those words would indicate that the subject of the article was baseball.

• If a user is not familiar with the topic, he/she may think general words related to the topic are important.

• On the other hand, if a user is familiar with the topic, he/she may think more detailed words are important.

Our main hypothesis

Introduction

Goal of this research

• This research addresses context-based word weighting, focusing on the statistical feature of word co-occurrence.

• In order to measure the weight of words more correctly, contextual information about a user (we call “familiar words”) is used.

Introduction

Outline of the talk


• Proposed measure– Previous work– IRM (Interest Relevance Measure)

• System architecture• Evaluation• Conclusion

IRM• A new measure, IRM, is based on a word-

weighting algorithm applied to a single document.

– [Matsuo 03]: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information, FLAIRS 2003

We take a paper for example.COMPUTING MACHINERY AND INTELLIGENCE

A.M.TURING1 The Imitation Game I PROPOSE to consider the question, 'Can machines think?' This should begin with definitions of the meaning of the terms 'machine 'and 'think'. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words 'machine' and 'think 'are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, 'Can machines think?' is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words. The new form of the problem can be described' in terms of a game which we call the 'imitation game'. It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either

Previous work [Matsuo03]

Distribution of frequent terms

00.050.10.150.20.250.30.350.4

probability

a b c d e f g h I j

term


Next, count co-occurrences

・・・・・・・・・・・・・・・・The new form of the problem can be described' in terms of a

game which we call the `imitation game’.・・・・・・・・・・・・・・・・

stem, stop word elimination, phrase extraction

“new” and “form” co-occur once.“new” and “problem” co-occur once.….“call” and “imitation game” co-occur once.


Co-occurrence matrix

“kind”

Frequent termsF

requ

ent

term

s

“make”


Co-occurrences of“kind” – frequent terms“kind” – frequent terms, and, and

“make”—frequent terms“make”—frequent terms• A general term such as “kind” or “make” is used relatively

impartially with each frequent term, …but


Co-occurrence matrix

“imitation”

Frequent termsF

requ

ent

term

s

“digital computer”


Co-occurrences of“imitation” – frequent terms, and

“digital computer”—frequent termswhile a term such as “imitation” or “digital

computer” shows co-occurrence especially with particular terms.


Biases of co-occurrence

• A general term such as “kind” or “make” is used relatively impartially with each frequent tem,

while a term such as “imitation” or “digital computer” shows co-occurrence especially with particular terms.

• Therefore, the degree of biases of co-occurrence can be used as a surrogate of term importance.


χ2-measure

• We use the χ2-test, which is very common for evaluating biases between expected and observed frequencies.

square

Expected co-occurrenceObserved co-occurrence

Gg g

g

pwf

pwfgwfreqw

)(

))(),(()(

22

G: Frequent termsfreq(w,g): Frequency of co-occurrence term w and term g.pg: unconditional probability (the expected probability) of g.f(w): The total number of co-oocurrence of term w and frequent terms G


•Large bias of co-occurrence means importance of a word.

Sort by χ2-valueΧ2-

valueLabel Freq

1 593.7 digital computer 31

2 179.3 imitation game 16

3 163.1 future 4

4 161.3 question 44

5 152.8 internal 3

6 143.5 answer 39

7 142.8 input signal 3

8 137.7 moment 2

9 130.7 play 8

10 123.0 output 15

・・・

・・・・・・・・

・

551 1.0 slowness 2

552 1.0 unemotional channel 2

553 0.8 Mr. 2

554 0.8 sympathetic 2

555 0.7 leg 2

556 0.7 chess 2

557 0.6 Pickwick 2

558 0.6 scan 2

559 0.3 worse 2

560 0.1 eye 2

We can get important words based on co-occurrence information in a document.


Outline of the talk




Personalize the calculation of word importance

• The previous method is useful for extracting reader-independent important words from a document.

• However, importance of words depends not only on the document itself but also on a reader.

IRM, proposed measure

If we change the columns to pick up…

a b c d e f g h i j total

a ― 30 26 19 18 12 12 17 22 9 165

b 30 ― 5 50 6 11 1 3 2 3 111

c 26 5 ― 4 23 7 0 2 0 0 67

d 19 50 4 ― 3 7 1 1 0 4 89

e 18 6 23 3 ― 7 1 2 1 0 61

f 12 11 7 7 7 ― 2 4 0 0 50

g 12 1 0 1 1 2 ― 5 1 0 23

h 17 3 2 1 2 4 5 ― 0 0 34

i 22 2 0 0 1 0 1 0 ― 7 33

j 9 3 0 4 0 0 0 0 7 ― 23

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

u 6 5 5 3 3 18 2 2 1 0 45

V 13 40 4 35 3 6 1 0 0 2 104

w 11 2 2 1 1 0 1 4 0 0 22

X 17 3 2 1 2 4 5 0 0 0 34

a: machine,b: computer,c: question,d: digital,e: answer,f: game,g: argument,h: make, i: state, j: number

u: imitation,v: digital computer,w:kind,x:make

k

9

13

0

4

2

0

5

0

7

6

・・・

0

2

4

0


If we change the columns to pick up…

word

196.9 imitation game

88.9 play

62.4 digital computer

60.1 card

57.1 future

50.4 logic

45.1 identification

44.4 universality

42.7 state

word


88.5 play

84.4 logic system


60.0 card

57.0 future

44.9 identification

44.2 proposition

43.9 limitation

word


113.8 animal

88.2 play


59.9 card

56.9 future

49.8 identification

44.7 woman

40.8 book

Frequent words Frequent terms “＋ logic” Frequent terms+“God”

The relevant words to selected words have high χ2 value, because they co-occurs often.


Familiarity instead of frequency

• We focus on “familiar words” to the user, instead of “frequent words” in the document.

• [Definition] Familiar words are the words which a user has frequently seen in the past.


Interest Relevancy Measure (IRM)

where Hk is a set of familiar words for user k

kHh h

h

pwf

pwfhwfreqkwIRM

)(

))(),((),(

2

Gg g

g

pwf

pwfgwfreqw

)(

))(),(()(

22


IRM

• If the value of IRM is large, word wij is relevant to user’s familiar words. – The word is relevant to the user’s interests, so it

is a keyword for the user.

• Conversely, if the value of IRM is small, word wij is not specifically relevant to any of the familiar words.


Outline of the talk




Browsing support system

• It is difficult to evaluate IRM objectively because the weight of words depends on a user’s familiar words, and therefore varies among users.

• Therefore, we evaluate IRM by constructing a Web browsing support system. – Web pages accessed by a user are monitored by a proxy

server.

– The count of each word is stored in a database.

System architecture ofbrowsing support system

Proxy Server Keyword Extraction Module- Morphological analysis- Count word frequency- Query past word frequency-Compute IRM of words- Select keywords- Modify the HTML

-Keep word count for each user- Increment word count

Frequency Server

- Pass if non-text data - Send the body part of HTML - Receive the result and send to browser

query frequency in the history

Internet

Browser

Sample Screen shot

Outline of the talk




Evaluation

• For evaluation, ten people tried this system for more than one hour.

• Three methods are implemented for comparison.– (I) word frequency– (II) tf ・ idf– (III) IRM

Evaluation – Result(1)• After using each system (blind), we ask the following questions on a 5-

point Likert-scale from 1(not at all) to 5 (very much).

– Q1: Do this system help you browse the Web?• (I) 2.8 (II) 3.2 (III) 3.2

– Q2: Are the red color words (=high IRM words) interesting to you?• (I) 3.2 (II) 4.0 (III) 4.1

– Q3: Are the interesting words colored red?• (I) 2.9 (II) 3.3 (III) 3.8

– Q4: Are the blue color words (=familiar words) interesting to you?• (I) 2.7 (II) 2.5 (III) 2.0

– Q5: Are the interesting words colored blue?• (I) 2.7 (II) 2.5 (III) 2.4

(I) word frequency (II) tf ・ idf (III) IRM

Evaluation – Result(2)

• After evaluating all three system, we ask the following two questions.

– Q6: Which one helps your browsing the most?• (I) 1 people (II) 3 (III) 6

– Q7: Which one detects your interests the most?• (I) 0 people (II) 2 (III) 8

• Overall, IRM can detect words of the user’s interests the most.

(I) word frequency (II) tf ・ idf (III) IRM

Outline of the talk




Conclusion • We develop an context-based word weighting

measure (IRM) based on the relevance (i.e., the co-occurrence) to a user’s familiar words.– If a user is not familiar with the topic, he/she may think

general words related to the topic are important.

– On the other hand, if a user is familiar with the topic, he/she may think more detailed words are important.

• We implemented IRM to browsing support system, and showed the effect.

word weighting based on user’s browsing history yutaka matsuo national institute of advanced...

Documents

words game

familiar words

weight of words

words machine

detailed words

general words

unambiguous words

wordweighting algorithm