slide 1 ee3j2 data mining ee3j2 data mining lecture 7 topic spotting & query expansion martin...
Post on 20-Dec-2015
216 views
TRANSCRIPT
EE3J2 Data MiningSlide 2
Objectives
To introduce Topic Spotting– Salience and Usefulness
– Example: The AT&T “How May I Help You?” system
Text retrieval revisited– Query expansion
– Synonyms & hyponyms
– WordNet
EE3J2 Data MiningSlide 3
Topic Spotting
Type of dedicated IR system– Always looking for the same type of documents
– Documents about a particular topic
– Corpus from which data is retrieved is dynamic
Examples– Detect all weather forecasts in BBC radio 4 broadcasts
– Find all documents written by Charlotte Bronte
– …
EE3J2 Data MiningSlide 4
Topic Spotting Queries
Examples rather than explicit queries ‘Find more like this’ May have substantial quantity of example data But why is Topic Spotting different to IR? Can exploit large amount of query data Calculate better measures of the usefulness of a term than simple IDF
EE3J2 Data MiningSlide 5
IDF Revisited
Recall the definition of IDF
IDF is concerned only about whether or not a term is contained in a document
But there is also information in the number of times the term occurs in documents
tND
NDtIDF log
EE3J2 Data MiningSlide 6
Salience
Suppose we have a set of ‘training’ documents– Some ‘on topic’
– Some ‘off topic’
Then, for a term t, we can estimate the probabilities:– P(t | Topic)
– P(t | Not Topic)
– P(t)
EE3J2 Data MiningSlide 7
‘Salience’ and ‘Usefulness’
Given a term t and a topic T, define the salience of t (relative to T) by:
Similarly, the usefulness of t (relative to T) is given by:
TP
tTPtTPtS
|log|
tP
TtPTtPtU
|log|
EE3J2 Data MiningSlide 9
Salience and Usefulness
tUtp
TP
tp
TtpTtp
tp
TP
tpTP
TPTtp
tp
TPTtp
TP
tTPtTPtS
|log|
|log
|
|log|
EE3J2 Data MiningSlide 10
Salience and Usefulness
Now, T is the topic, so P(T) is fixed. Therefore
tUtp
TPtS
tUtp
tS1
EE3J2 Data MiningSlide 11
Salience and Usefulness
So, main difference between Salience and Usefulness is that to have high usefulness, a term must occur frequently
tUtp
tS1
EE3J2 Data MiningSlide 12
Example
A term w occurs:– t1 times in documents about topic T
– t2 times in documents which are not about topic T
Total number of terms:– in documents about topic T is N1
– in documents not about topic T is N2
Then: P(w|T) = t1/N1 , P(w) = (t1+t2)/(N1+N2)– So
21
21
1
1
1
1 log
NNtt
Nt
N
twU
EE3J2 Data MiningSlide 13
Example A term w occurs:
– 150 times in documents about topic T
– 230 times in documents which are not about topic T
Total number of terms:– in documents about topic T is 12,500
– in documents not about topic T is 23,100
So:– p(w|T)=0.012, p(w)=0.0107, log(p(w|T)/p(w)) = 0.051
– U(w) = 0.00054
EE3J2 Data MiningSlide 14
Example (continued)
P(T) =NT /N , where NT and N are the total number of documents about topic T and the total number of documents, respectively
So, if, say, 1 document in 100 is about the topic, then P(T) = 0.01
Then S(w) = (P(T)/p(w))*U(w)
= (0.01/0.0107)*U(w)
EE3J2 Data MiningSlide 16
Example
The AT&T “How May I Help You?” system Task: to understand what AT&T customers say to
operators Look HMIHY? Up on the web
EE3J2 Data MiningSlide 17
AT&T How May I Help You?
Speech Recognition
Language Processing
Salient word list
Service 1
Service 2
Service 3
Service 15
EE3J2 Data MiningSlide 18
AT&T How May I Help You? HMIHY? Treats telephone network services as topics
or documents, to be detected or retrieved Example salient words:
Word Salience Word SalienceDifference 4.04 Dialled 1.29
Cost 3.39 Area 1.28Rate 3.37 Time 1.23Much 3.24 Person 1.23
Emergency 2.23 Charge 1.22Misdialed 1.43 Home 1.13Wrong 1.37 Information 1.11code 1.36 credit 1.11
Allen Gorin, “Processing of semantic information in fluent spoken language, Proc. ICSLP 1996
EE3J2 Data MiningSlide 19
HMIHY Demonstrations.
See http://www.research.att.com/~algor/hmihy/samples.html
EE3J2 Data MiningSlide 20
Query Processing
Remember how we previously processed a query: Example:
– “I need information on distance running” Stop word removal
– information, distance, running Stemming
– information, distance, run But what about:
– “The London marathon will take place…”
EE3J2 Data MiningSlide 21
Synonyms, hyponyms, hypernyms & antonyms We know there is a relationship between
– run, distance, and
– marathon
We know that a ‘marathon’ is a ‘long distance run’ Words with the same meaning are synonyms If a query q contains a word w1 and w2 is a synonym
of w1, then w2 should be added to q
This is an example of query expansion
EE3J2 Data MiningSlide 22
Thesaurus
A thesaurus is a ‘dictionary’ of synonyms and semantically related words and phrases
E.G: Roget’s Thesaurus Example:
physiciansyn: || croaker, doc, doctor, MD, medical, mediciner, medico ||rel: medic, general practitioner, surgeon
EE3J2 Data MiningSlide 23
Hyponyms
Not only synonyms are useful for query expansion Query q = “Tell me about England” Document d = “A visit to London should be on
everyone’s itinerary” ‘London’ is a hyponym of ‘England’ Hyponym ~ subordinate ~ subset If a query q contains a word w1 and w2 is a hyponym
of w1, then w2 should be added to q
EE3J2 Data MiningSlide 24
Hypernyms
Hypernyms are also useful for query expansion Query q = “Tell me about England” Document d = “Places to visit in the British Isles” ‘British Isles’ is a hypernym of ‘England’ Hyponym ~ generalisation ~ superset If a query q contains a word w1 and w2 is a
hypernym of w1, then w2 should be added to q
EE3J2 Data MiningSlide 25
Antonyms
An antonym is a word which is opposite in meaning to another (e.g. bad and good)
The occurrence of an antonym can also be relevant
EE3J2 Data MiningSlide 26
WordNet
Online lexical database for the English Language http://www.cogsci.princeton.edu/~wn
Category Forms Meanings (syn sets)
Nouns 57,000 48,800
Adjectives 19,500 10,000
Verbs 21,000 8,400
See Belew, chapter 6
EE3J2 Data MiningSlide 27
WordNet
WordNet is organised as a set of hierarchical trees For nouns, there are 25 trees Children of a node correspond to hyponyms Words become more specific as you move deeper
into the tree
EE3J2 Data MiningSlide 28
Noun Categories
act, action, activity natural object
animal, fauna natural phenomenon
artefact person, human being
attribute, property plant, flora
body, corpus possession
cognition, knowledge process
communication quantity, amount
event, happening relation
feeling, emotion shape
food state, condition
group, collection substance
location, place time
motive
EE3J2 Data MiningSlide 29
Query-document scoring
A query q is expanded to include hyponyms and synonyms
Recall that for a document d
tIDFfw tdtd
qd
ww
dqSim dqttqtd
,
EE3J2 Data MiningSlide 30
Query expansion
Suppose:– t is the original term in the query,
– t’ is a synonym or hyponym of t which occurs in d
Then we could define:
Where tt’ is a weighting depending on how ‘far’ t and t’ are apart according to WordNet
tIDFfw dtttdt ''' 10 ' tt