project management:

21
Project Management: The project is due on Friday inweek13. You have to demo the system to me (Lab will be booked) There will be a test in week 10. Course work contains: (1) assignment 1, test in week 10 and the project. The weight for the project is heavy. The purpose for test: (1) small part of coursework and (2) training for the final examination.

Upload: carlotta-feddis

Post on 31-Dec-2015

18 views

Category:

Documents


0 download

DESCRIPTION

Project Management:. The project is due on Friday inweek13. You have to demo the system to me (Lab will be booked) There will be a test in week 10. Course work contains: (1) assignment 1, test in week 10 and the project. The weight for the project is heavy. - PowerPoint PPT Presentation

TRANSCRIPT

Project Management:

• The project is due on Friday inweek13. • You have to demo the system to me (Lab will

be booked)• There will be a test in week 10. • Course work contains: (1) assignment 1, test in

week 10 and the project. The weight for the project is heavy.

• The purpose for test: (1) small part of coursework and (2) training for the final examination.

Extended Boolean Model:

• Disadvantages of “Boolean Model” :• No term weight is used

• Counterexample: query q=Kx AND Ky.

Documents containing just one term, e,g, Kx is consid

ered as irrelevant as another document containing none of these terms.

• No term weight is used• The size of the output might be too large or to

o small

Extended Boolean Model:

• The Extended Boolean model was introduced in 1983 by Salton, Fox, and Wu[703]

• The idea is to make use of term weight as vector space model.

• Strategy: Combine Boolean query with vector space model.

• Why not just use Vector Space Model?• Advantages: It is easy for user to provide query.

Extended Boolean Model:

• Each document is represented by a vector (similar to vector space model.)

• Remember the formula.• Query is in terms of Boolean formula.• How to rank the documents?

ii

xjxjx

idf

idffw

max*,,

Fig. Extended Boolean logic considering the space composed of two terms kx and ky only.

dj

dj +1dj +1

dj

kx and ky

kx or ky

( 0, 1) ( 0, 1)( 1, 1) ( 1, 1)

( 0, 0) ( 1, 0) ( 0, 0) ( 1, 0)

• ky • ky

• kx • kx

Extended Boolean Model:

• For query q=Kx or Ky, (0,0) is the point we try to avoid. Thus, we can use

to rank the documents• The bigger the better.

2),(

22 yxdqsim or

Extended Boolean Model:

• For query q=Kx and Ky, (1,1) is the most desirable point.

• We use

to rank the documents.• The bigger the better.

2

1(1),(

))1(22

yxdqsim and

Extend the idea to m terms

• qor=k1 p k2 p … p Km

• qand=k1 p k2 p … p km

)...

( 21

/1

),(m

xxxp

m

pp p

jor dqsim

))1(...)1()1(

(121

/1

),(

m

xxx m

ppp

jand

p

dqsim

Properties:

• The p norm as defined above enjoys a couple of interesting properties as follows. First, when p=1 it can be verified that

• Second, when p= it can be verified that

• Sim(qor,dj)=max(xi)

• Sim(qand,dj)=min(xi)

m

xxdqsimdqsim

mjandjor

...),(),(

1

Example:

• For instance, consider the query q=(k1 k2) k3. The similarity sim(q,dj) between a document dj and this query is then computed as

• Any boolean can be expressed as a numeral formula.

)2

))(1((

321

/1/1

2

)1()1(),( x

pp p p

xxdqsim

pp

Exercise:

1. Give the numeral formula for extended Boolean model of the query

q=(k1 or k2 or k3)and (not k4 or k5). (assume that there are 5 terms in total.)

2. Assume that the document is represented by the vector (0.8, 0.1, 0.0, 0.0, 1.0).

What is sim(q, d) for extended Boolean model?Also try to do more exercise for other Boolean

formulas.

Fussy Set Theory

• Definition A fuzzy subset A of a universe of discourse U is characterized by a membership function which associate with each element u of U a number in the interval [0,1].

• Set Theory: A={a, b, c}.Subset of A: {a, c}.

• An element is either in a set of not in a set. is either 0 or 1.

]1,0[: UA

)(uA

)(uA

Set Theory

• Let U be the set of all elements (universe)

• There are three basic operations:

• AB={elements in A or in B}.

• AB={elements in both A and B}

• Not A=U-A.

• Definition Let U be the universe of discourse, A and B be two fussy subsets of U, and be the complement of A relative to U. Also, let u be an element of U. Then,

A

)}(),(min{)(

)}(),(max{)(

)(1

uuu

uuu

u

BABA

BABA

AA

Fuzzy Information Retrieval

We first set up term-term correlation matric:

For terms ki and kl,

Where ni is the number of documents containing ki , nl is the number of documents containing kl

And ni,l is the number of documents containing both ki and kl. Note Ci,i=1.

lili

lili nnn

nc

,

,,

Fuzzy Information Retrieval

We define a fuzzy set for each term ki. In the fuzzy set for ki , a document dj has a degree of membership

ij computed as

Example: c1,2=0.1, c1,3=0.21.

D1=(0, 1, 1, 0). 1,1= 1-0.9*0.79.

D2=(1, 0, 0, 0). 1,2= 1-0. (since c1,1=1.)How is d3=(1, 0, 1,0)?

jl dk

liji c )1(1 ,,

Fuzzy Information Retrieval

Whenever, the document dj contains a term that is strongly related to ki, then the document dj is belong to the fuzzy set of term ki, i.e.,

i,j is very close to 1.

Example, c1,2=0.9, d1=(0, 1, 0, 0).

1,1 =1-(1-0.9)=0.9

Query:• Query is a Boolean formula, e.g.,

• q=Ka and (Kb or not Kc).

• q= (1, 1, 1) or (1, 1, 0) or (1, 0, 0).

• Suppose q is

)( cba kkkq

pdnf ccccccq 21

bDaD

cD321 ccccccDq

3cc 2cc

1cc

)]([ cba kkkq Figure 1. Fuzzy document sets for the query . Each is a conjunctive component. is the query fuzzy set.},3,2,1{, icci qD

))1)(1(1())1(1(

)1(1

)1(1

,,,,,,

,,,

3

1,

,, 321

jcjbjajcjbja

jcjbja

ijcc

jccccccjq

i

},,,{,, cbaiji jdWhere is the membership of

in the fuzzy set associated with . q,j is the membership of document j for query q.

ik

Some changes in the last slide.

1. Instead of max{ }, we use +.

2. Instead of min{ }, we use .

Exercise: suppose there are 3 doc. and 4 terms.

d1=(1, 0, 1, 0), d2=(1, 1, 0, 0), and d3=(0, 1, 1, 0).

(1) Compute the term-term correlation matrix c i,j.

(2) Compute i,j (membership of document j in term i.)

(3) If the query q=(1, 0, 0, 0) or (1, 1, 0, 0), compute q,k for each document dk.