a comparison of document, sentence, and term event spaces
Post on 17-Jan-2016
28 Views
Preview:
DESCRIPTION
TRANSCRIPT
A Comparison of Document, Sentence, and
Term Event Spaces Catherine Blake
School of Information and Library Science
University of North Carolina at Chapel Hill
North Carolina, NC 27599-3360cablake@email.unc.edu
Classic Information Retrieval
Document Representation
QueryInformation
Need
Match
? ? ?
Representation?Matching
- Exact match = Boolean Model- Weighted match = Vector
Model
?
Term Weighting
• Goal : Favor discriminating terms• Commonly used : TF x IDF
• IDF(ti)=log2(N)–log2(ni)+1– N = total number of documents in the
corpus
– ti = a term (typically an stemmed word)
– ni = number of documents that contain at least one occurrence of the term ti
Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21.
Salton,G. & Buckley,C. (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24 (5):513-23
Practical Motivations
• Systems moving toward sub-document retrieval– Document Summarization – Why not use Inverse
Sentence Frequency (ISF) ?– Question Answering – Why not use Inverse Term
Frequency (ITF) ?
• Calculating IDF is problematic– How many documents to have stable IDF estimates ?
• Corpora have changed since initial experiments– # documents - Vocabulary size– # terms per document
Theoretical Motivations
• TF x IDF combines two different event spaces – TF – number of terms – IDF – number of documents– Are the limits of these spaces really the same
?
• Foundational theories use the term space– Zipf’s Law (Zipf, 1949) – Shannon’s Theory (Shannon, 1948)
Goal : Compare and Contrast
1. Raw term comparison2. Zipf Law comparison 3. Direct IDF, ISF, and ITF
comparison4. Abstract versus full-text
comparison5. IDF Sensitivity
Corpora
• Full text scientific articles in chemistry• Initial corpus:
– 103,262 articles– Published in 27 journals over the last 4 years– Two journals excluded due to formatting
inconsistencies• These experiments:
– 100,830 articles– 16,538,655 sentences– 526,025,066 total unstemmed terms – 2,001,730 distinct unstemmed terms– 1,391,763 distinct stemmed terms (Porter
algorithm)–1,391,763 distinct stemmed terms (Porter algorithm)
Journal # Docs%
Corpus Avg Length Million %
ACHRE4 548 0.5 4923 2.7 1
ANCHAM 4012 4.0 4860 19.5 4
BICHAW 8799 8.7 6674 58.7 11
BIPRET 1067 1.1 4552 4.9 1
BOMAF6 1068 1.1 4847 5.2 1
CGDEFU 566 0.5 3741 2.1 <1
CMATEX 3598 3.6 4807 17.3 3
ESTHAG 4120 4.1 5248 21.6 4
IECRED 3975 3.9 5329 21.2 4
INOCAJ 5422 5.4 6292 34.1 6
JACSAT 14400 14.3 4349 62.6 12
JAFCAU 5884 5.8 4185 24.6 5
JCCHFF 500 0.5 5526 2.8 1
JCISD8 1092 1.1 4931 5.4 1
JMCMAR 3202 3.2 8809 28.2 5
JNPRDF 2291 2.2 4144 9.5 2
JOCEAH 7307 7.2 6605 48.3 9
JPCAFH 7654 7.6 6181 47.3 9
JPCBFK 9990 9.9 5750 57.4 11
JPROBS 268 0.3 4917 1.3 <1
MAMOBX 6887 6.8 5283 36.4 7
MPOHBP 58 0.1 4868 0.3 <1
NALEFD 1272 1.3 2609 3.3 1
OPRDFK 858 0.8 3616 3.1 1
ORLEF7 5992 5.9 1477 8.8 2
Example IDF, ISF, ITF
Document Sentence Term
TermAbstract
Non-Abs All
Abstract
Non-Abs All
Abstract
Non-Abs All
the 1.0 1.0 1.0 1.3 1.4 1.4 4.6 9.4 5.2
chemist 11.1 6.0 5.7 13.6 12.8 12.6 22.8 17.6 17.6
synthesis 14.3 11.2 10.8 17.1 18.0 17.6 26.4 22.6 22.5
eletrochem 17.5 15.3 15.0 20.3 22.6 22.4 29.6 27.0 27.5
IDF(ti)=log2(N)–log2(ni)+1
1) Raw term comparison
• Document vs Sentence Frequency (log scales)
1.0E+0
1.0E+1
1.0E+2
1.0E+3
1.0E+4
1.0E+5
1.0E+6
1.0E+7
1.0E+8
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06
Document Frequency (Log scale)
Ave
rag
e S
ente
nce
Fre
qu
ency
(L
og
sca
le)
1) Raw term comparison
• Document vs Term Frequency (log scales)
1.0E+0
1.0E+1
1.0E+2
1.0E+3
1.0E+4
1.0E+5
1.0E+6
1.0E+7
1.0E+8
1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06
Document Frequency (Log scale)
Ave
rag
e T
erm
Fre
qu
ency
(L
og
sca
le)
Luhn
Image Source: Van Rijsbergen, 1979
1) Raw term comparison
• Sentence vs Term Frequency (log scales)
1.0E+0
1.0E+1
1.0E+2
1.0E+3
1.0E+4
1.0E+5
1.0E+6
1.0E+7
1.0E+8
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Sentence Frequency (Log scale)
Ave
rag
e T
erm
Fre
qu
ency
(L
og
sca
le)
2) Zipf Law comparison
• Zipf’s Law : The frequency of terms in a corpus conforms to a power law distribution K/jθ where θ is close to 1 (Zipf, 1949)
• Term distributions followed a power law
• θ differed between the event spaces– Average θ in document space = -1.65– Average θ in sentence space = -1.73– Average θ in term spaces = -1.73
2) Example Document Distribution
MAMOBX
1.0E+0
1.0E+1
1.0E+2
1.0E+3
1.0E+4
1.0E+5
1.E+0 1.E+1 1.E+2 1.E+3 1.E+4
Word Occurances (log)
Actual
theta=1
2) θ Comparison of all journals
-1.85
-1.80
-1.75
-1.70
-1.65
-1.60
-1.55
-1.80 -1.70 -1.60 -1.50Document Slope
Sen
tenc
e or
Ter
m S
lope
Sentence
Term JACSAT
3) Direct IDF vs ISF comparison
y = 1.0662x + 5.5724
R2 = 0.9974
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
IDF
ISF
AvgMinMax
3) Direct IDF vs ITF comparison
y = 1.0721x + 10.452
R2 = 0.9972
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
IDF
ITF
AvgMinMax
y = 1.0144x + 4.6937
R2 = 0.9996
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ISF
ITF
AvgMinMax
3) Direct ISF vs ITF comparison
4) Abstract versus full-text
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18Global IDF
Ave
rag
e ab
stra
ct/N
on
-ab
stra
ct I
DF
AbstractNon-Abstract
4) IDF Sensitivity
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18IDF of Total Corpus
Ave
rage
ID
F o
f S
tem
med
Ter
ms
102030405060708090
% of Total Corpus
4) IDF Sensitivity
0
5
10
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Global IDF
Ave
rag
e L
oca
l ID
F
ACHRE4
ANCHAM
BICHAW
BIPRET
BOMAF6
CGDEFU
CMATEX
ESTHAG
IECRED
INOCAJ
JACSAT
JAFCAU
JCCHFF
JCISD8
JMCMAR
JNPRDF
JOCEAH
JPCAFH
JPCBFK
JPROBS
MAMOBX
MPOHBP
NALEFD
OPRDFK
ORLEF7
Conclusions
• raw document frequencies differ from sentence & term frequencies. – around the areas of important terms– difficult to perform a linear
transformation from the document to a sub-document space
• raw term frequencies correlate well with the sentence frequencies
• IDF, ISF and ITF are highly correlated
Conclusions
• IDF values are surprisingly stable – with respect to random samples at 10% of the
total corpus. – average IDF values based on only a 20% random
stratified sample correlated almost perfectly to IDF
• Journal based IDF samples did not correlate well to the global IDF
• language used in abstracts is systematically different from the language used in the body of a full-text scientific document.
top related