intelligent information integrationtakeda/lecture-kss/iii2007e-part2.pdf · zmain task: how to...

19
Hideaki Takeda / National Institute of Informatics NII Intelligent information integration Part II: Information Gathering Hideaki Takeda National Institute of Informatics [email protected] http://www-kasm.nii.ac.jp/~takeda/ Hideaki Takeda / National Institute of Informatics NII Intelligent Information Integration Information Gathering Information Integration Information Distribution

Upload: others

Post on 18-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Intelligent information integrationPart II: Information Gathering

Hideaki Takeda

National Institute of Informatics

[email protected]

http://www-kasm.nii.ac.jp/~takeda/

Hideaki Takeda / National Institute of Informatics NII

Intelligent Information Integration

Information Gathering

Information Integration

Information Distribution

Page 2: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Information Gathering

Model: Relationship between agents and information sources

Main task: how to provide access from the agent to information sources

IS

IS

IS

Hideaki Takeda / National Institute of Informatics NII

Methods for Information Gathering

Information Retrieval

Explicit specification of users needs

Currently available information sources

Information Filtering

Dynamic information sources

Not available at the current moment, but available in future

Browsing

Implicit specification of users needs

Page 3: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Browsing

Characteristics as Information Gathering

Pros:

Users initiative

Applicable even with vague purpose

Cons: No warranty to reach the goal

Human habit: up to down, side trip

Human limitation: Sequential access

Problems

How to support users with keeping users initiative

How to obtain users preference

Hideaki Takeda / National Institute of Informatics NII

Browsing

Problems

How to support users with keeping users initiative

How to obtain users preference

How to obtain users preference

Web Watcher

Letizia

Syskill & Webert

InforSpinder

Page 4: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Web Watcher

Hideaki Takeda / National Institute of Informatics NII

Web Watcher

Use of machine learning

Learn users preference from browsing process

Functions

Recommendation of links which the system infers useful to the user among links in browsing pages

Recommendation of links which the systems infers useful to the user among all links

Page 5: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Web Watcher

Learning target

LinkUtility: Page×Goal×User×Link→[0,1]

UserChoice: Page×Goal×Link→[0,1]

Page: Keyword vector of 200 words extracted from pages

link: Keyword vector of 200 words from links and 100 words from texts surrounding links

Goal: Keyword vector of 30 words

Hideaki Takeda / National Institute of Informatics NII

Web Watcher

Learning Method

TFIDF(term frequency times inverse document frequency) [Salton] and other two methods

Results

2 or 3 times better than random recommendation

Page 6: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

TF/IDF

To measure importance of document d within document set D by word t

W(d,t,D) = TF(d,t)・IDF(t,D)

TF(d,t): frequency of word t in document d

IDF(t,D) = log(|D|/freq(D,t))

freq(D, t): number of document in D in which t appears

Intuitively

More frequent the specific word appears, more important

But, more documents have the specific words, less important

Hideaki Takeda / National Institute of Informatics NII

Use of TF/IDF in WebWatcher

TF/IDF is calculated for the goal page

Then each vector for “userchoice” is calculated with TF/IDF values for words as weight

Welcome from the dean

0.0 Baseball0.7 welcome0.0 graphics0.1 the0.0 hockey…0.2 from0.0 space0.7 dean

TFIDF weights

=1.7

Page 7: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Reinforcement Learning

A model that an agent can learn its behavior through trial-and-error in a dynamic environment

Learning Agent

Environment

Action state Reward

Hideaki Takeda / National Institute of Informatics NII

Basic elements in reinforcement learning

Environment

The environment is (partially) observable by sensors

A set of states described by some parameters

Reinforcement (Reward) function

Reinforcement or reward is feedback from the environment to the agent

The goal of reinforcement learning is to find a function to maximize rewards that it will receive in the future

Examples for reward: delayed rewards, minimal time …

Page 8: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

value function

policy

Decide next action in each state

Mapping from state to action

Optimal policy: mapping from any state to the state where maximal reward is given

state value

Sum of rewards from the initial state to the state

Value function

Mapping from a state to a state value

It depends on policy

Hideaki Takeda / National Institute of Informatics NII

The given environment

Gain

10

-1 -1

-1

-1

-1 -1

-1

-1-1-1

-1

Page 9: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Optimal state value

Gain

10

5

7

6

898

5

45

6

7

-1 -1

-1

-1

-1 -1

-1

-1-1-1

-1

Hideaki Takeda / National Institute of Informatics NII

Q-Learning

R(s): reward at state s

Q(s, a): evaluation function for action a at state s

Delayed rewards

)]','([)'(),(

)(),(

'1

10

asQsRasQ

SRaSQ

nactiona

n

iti

it

∈+

++

=

+=

=∑maxλ

γ

1 2 3 4 Tγ γ γ γ

Page 10: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Q-learning in WebWatcher

Page: state

Hyperlink: action

Reward for page: how much it is interesting for user

It is measured by sum of words weighted by TFIDF

Hideaki Takeda / National Institute of Informatics NII

Letizia

Browser which learns in browsing

Functions

Recommend pages starting from the page which the user is currently browsing

Methods

Collect pages starting from the current page in breadth-first way

Stop by thresholds or if the current page is changed

TFIDF

Users profile

Page 11: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Letizia

Hideaki Takeda / National Institute of Informatics NII

Syskill & Webert

Browser which learn by Supervised Learning

User specify hot, cold, or lukewarm

System recommends either hot, cold, worth to look, or worth not to look

Page 12: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Selection of keywords

Calculation for keyword selection

where

( ) ( ) ( ) ( ) ( ) ( )[ ]E W S I S P W present I S P W absent I Sw present w absent, = − = + == =

( ) ( ) ( )( )I S p S p Sc c= −∑ log2

Hideaki Takeda / National Institute of Informatics NII

Selection of keywords

hot:30 cold:70

pr:20h&p:10 c&p:10

h&a:20 h&c:60ab:80

total:100

I I I I

I

I I I I

I

I I I I

I

E

h c

h p c p

p

h a c a

p

= = = === = = =

=

= = = =

=

= − ⋅ + ⋅ =

0 3 0 7

0 5 0 5

0 25 0 75

052 0 36

088

05 05

1

05 0 31

081

088 0 2 1 08 081 0 03

. .

| . | .

| . | .

. .

.

. .

. .

.

. ( . . . ) .

hot:30 cold:70

pr:20h&p:15 c&p:5

h&a:20 h&c:60ab:80

total:100

I I I I

I

I I I I

I

I I I I

I

E

h c

h p c p

p

h a c a

p

= = = === = = =

=

= = = =

=

= − ⋅ + ⋅ =

0 3 0 7

0 75 0 25

0 25 0 75

0 52 0 36

088

0 31 0 5

081

0 5 0 31

081

0 88 0 2 081 0 8 0 81 0 07

. .

| . | .

| . | .

. .

.

. .

.

. .

.

. ( . . . . ) .

Page 13: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

InfoSpider[Menczer]

Adaptive information agent by Artificial Life

Agent follows links, then obtain energy dependent on page content and sometimes propagates

Identify pages related to the questions

Hideaki Takeda / National Institute of Informatics NII

InfoSpinder

Page 14: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Information Filtering / Recommendation system

Content-based filtering

Estimate users preference by comparing keywords in pages and users profiles

Social filtering / Collaborative filtering

Estimate users preference by collecting and analyzing preferences of many users

Hideaki Takeda / National Institute of Informatics NII

Problem definition

How to estimate missing information from the given matrix?

Each vector represents preference of each person

Some values are missing because she has not experience them

Estimate these values

Solution: Use similarity between users

Article Person A

1 1 4 2 22 5 2 4 43 34 2 5 55 4 1 16 ? 2 5

Person B Person C Person D

Page 15: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Algorithm for Social filtering(correlation coefficient)

11144

12)1(2

8.044111144

)2(12)1()1(212

=++⋅+−⋅−

=

−=++++++

−⋅+⋅−+−⋅+⋅−=

AC

AB

r

r

∑∑

==

=

−−

−−=

===

'

1

2''

'

1

2

'

1''

'

''

)()(

))((

)1',1()',(

n

lklk

n

lkkl

n

lklkkkl

kk

kkkk

xxxx

xxxxr

mkmkkkCov

r KKσσ

Article Person A

1 1 4 2 22 5 2 4 43 34 2 5 55 4 1 16 ? 2 5

Person B Person C Person D

=

=

−−=

−=

'

1''

'

1

2

))(()',(

)(

n

lklkkkl

n

lkklk

xxxxkkCov

xxσStandard deviation

Calculate relation between user k and k’ by correlation coefficient (相関係数)

Covariance

Hideaki Takeda / National Institute of Informatics NII

Algorithm for Cooperative filtering(correlation coefficient)

∑∑

−+=

kkkk

kkkkklk

kkl r

rxxxx

''

'''' )(

'

56.418.0

12)8.0()1(3' 6 =

+⋅+−⋅−

+=Ax

1 1 4 2 22 5 2 4 43 34 2 5 55 4 1 16 ? 2 5

Article Person A Person B Person C Person D

Page 16: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

GroupLensCollaborative filtering system for NetNews

Hideaki Takeda / National Institute of Informatics NII

Collaborative Filtering: Pros and Cons

Pros

Robust for content change

No need for content analysis

Applicable for non-text data

Few users actions

Just only evaluate items

Cons

“Cold start” problem

Massive evaluation data is need before reliable recommendation

No evaluation, no recommendation

Items without evaluation are never recommended

Page 17: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

MovieLens

Hideaki Takeda / National Institute of Informatics NII

Content-based filtering: pros and cons

Pros

Precise recommendation is possible

For users

For providers

Cons

For providers: Difficulty to design profiles

For users: Difficulty for keeping users profiles

Input

Update

Not adaptive for new contents

Page 18: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Lifestyle Finder

Recommendation for buying, visiting, etc

Estimate users preference by categorizing users to some clusters

Fewer feedback are needed for learning preference

Categorization of users with large-scale demographic data

Demographic data tell us a good example of users categorization

Ex., A survey categorizes Americans into 62 clusters by using U.S. census, magazine subscriptions, catalog purchases, …

Create larger clusters by merging the given clusters

Ex., “Prefer TV news” ⊃ “Prefer the specific TV news program”

Hideaki Takeda / National Institute of Informatics NII

Lifestyle Finder

Page 19: Intelligent information integrationtakeda/lecture-kss/III2007e-part2.pdf · zMain task: how to provide access from the agent to information sources IS IS IS Hideaki Takeda / National

Hideaki Takeda / National Institute of Informatics NII

Lifestyle Finder