processing of large document collections part 5. in this part: zindexing zquerying zindex...
TRANSCRIPT
Processing of large document collections
Part 5
In this part:
Indexingqueryingindex construction
Indexing
An index is a mechanism for locating a given term in a text
Index in a book it is possible to find information without
browsing the pages in large document collections
(gigabytes) page-by-page search would even be impossible
IndexingIt is supposed that
a document collection consists of a set of separate documents
each document is described by a set of representative terms
index must be capable of identifying all documents that contain combinations of specified terms
document is the unit of text that is returned in response to queries
Indexing
What is a document? E.g. emails
sender, recipient, subject, message body
one email, one field, a set of emails?
Indexing
Granularity of the index = the resolution to which term locations are recorded within each document
e.g. 1 email = 1 document, but the index could be capable of ascertaining a more exact location within the document of each term which documents contain terms ’tax’ and
’avoidance’ in the same sentence?
Indexing
If the granularity of the index is taken to be one word, then the index will record the exact location of every word in the collection the original text can be recovered from
the index the index takes more space than the
original text
Indexing
Choice of representative terms each word that appears in the
documents is included verbatim as a term in the indexthe number of terms is huge
usually some transformationscase foldingstemming, baseword reductionremoval of stopwords
Inverted file indexing
An inverted file contains, for each term in the lexicon, an inverted list that stores a list of pointers to all occurrences of that term in the main text each pointer is the number of a document in
which that term appears a lexicon: a list of all terms that appear in the
document collectionsupports mapping from terms to their
corresponding inverted lists
Inverted file indexing
A query involving a single term is answered by scanning its inverted list and retrieving every document that it cites
for conjunctive Boolean queries of the form ’term AND term AND … AND term’, the intersection of the terms’ inverted lists is formed for disjunction (OR): union of lists for negation (NOT): complement
Inverted file indexing
The inverted lists are usually stored in order of increasing document number various merging operations can be
performed in a time that is linear in the size of the lists
Inverted file indexing: granularity
A coarse-grained index might identify only a block of text, where each block stores several documents
a moderate-grain index will store locations in terms of document numbers
a fine-grained index will return a sentence or word number
Inverted file indexing: granularity
Coarse indexes require less storage, but during
retrieval, more of the plain text must be scanned to find terms
multiterm queries are more likely to give rise to false matches, where each of the desired terms appears somewhere in the block, but not all within the same document
Inverted file indexing: granularity
Word-level indexing enables queries involving adjacency and
proximity to be answered quickly because the desired relationship can be checked before the text is retrieved
adding precise locational information expands the index more pointers in the indexeach pointer requires more bits of storage
Inverted file indexing: granularity
Unless a significant fraction of the queries are expected to be proximity-based, the usual granularity is to individual documents
phrase-based queries can be handled by the slightly slower method of a postretrieval scan
Inverted file compression
Uncompressed inverted files can consume considerable space 50-100% of the space of the text itself
the size of an inverted file can be reduced considerably by compressing it
key for compression: each inverted list can without any loss of
generalization be stored as an ascending sequence of integers
Inverted file compression
Suppose that some term appears in 8 documents of a collection; the term is described in the inverted file by a list: <8; 3, 5, 20, 21, 23, 76, 77, 78>
the address of which is contained in the lexicon
more generally, the list for a term t store the number of documents ft in which the term appears and then a list of ft document numbers
Inverted file compression
the list of document numbers within each inverted list is in ascending order, and all processing is sequential from the beginning of the list -> the list can be stored as an initial
position followed by a list of d-gapsthe list for the term above:
<8; 3, 2, 15, 1, 2, 53, 1, 1>
Inverted file compression
The two forms are equivalent, but it is not obvious that any saving has been achieved the largest d-gap in the second presentation
is still potentially the same as the largest document number in the first
if there are N documents in the collection and a flat binary encoding is used to represent the gap sizes, both methods require log N bits per stored pointer
Inverted file compression
Considering each inverted list as a list of d-gaps, the sum of which is bounded by N, allows improved representation -> it is possible to code inverted lists
using on average substantially fewer than log N bits per pointer
Inverted file compression
many specific models have been proposed global methods
every inverted list is compressed using the same common model
local methods adjusted according to some parameter, usually
frequencytend to outperform global ones, but are more
complex to implement
Querying
How to use an index to locate information in the text it describes?
Boolean queries
A Boolean query comprises a list of terms that are combined using the connectives AND, OR, and NOT
the answers to the query are those documents that satisfy the condition
Boolean queries
e.g. ’text’ AND ’compression’ AND ’retrieval’
all three words must occur somewhere in every answer (no particular order) ”the compression and retrieval of large
amounts of text is an interesting problem” ”this text describes the fractional
distillation scavenging technique for retrieving argon from compressed air”
Boolean queries
A problem with all retrieval systems: non-relevant answers are returned must be filtered out manually
broad query -> high recallnarrow query -> high precision
Boolean queries
Small variations in a query can generate very different results data AND compression AND retrieval text AND compression AND retrieval
the user should be able to pose complex queries like (text OR data OR image) AND
(compression OR compaction OR decompression) AND (archiving OR retrieval OR storage)
Ranked queries
Non-professional users might prefer simply giving a list of words that are of interest and letting the retrieval system supply the documents that seem most relevant, rather than seeking exact Boolean answers
text, data, image, compression, compaction, archiving, storage, retrieval...
Ranked queries
It would be useless to convert a list of words to a Boolean query connect with AND -> too few documents connect with OR -> too many documents
solution: a ranked query a heuristic that is applied to measure the
similarity of each document to the query r most closely matching documents are
returned
Ranking strategies
Simple techniques count the number of query terms that
appear somewhere in the documenta document that contains 5 query terms is
ranked higher than a document that contains 3 query terms
more advanced techniques cosine measure
takes into account the lenghts of the documents etc.
Accessing the lexicon
The lexicon for an inverted file index stores the terms that can be used to search the
collection information needed to allow queries to be
processedaddress in the inverted file (of the
corresponding list of document numbers)the number of documents containing the term
Access structures
A simple structure an array of records, each comprising a string
along with two integer fields if the lexicon is sorted, a word can be located
by a binary search of the strings consumes a lot of space e.g. a collection of million words (~5GB),
stored as 20-byte strings, with 4-byte inverted file address and 4-byte freq. value -> 28MB
Access structuresThe space for the strings is reduced if they
are all concatenated into one long contiguous strings an array of 4-byte character pointers is used for
access each term: its exact number of characters + 4 for
the pointerit is not necessary to store string lengths: next pointer
indicates the end of the string
in the collection of million terms, memory reduction is 8 MB -> 20 MB
Access structures
The memory required can be further reduced by eliminating many of the string pointers
1 word in 4 is indexed, and each stored word is prefixed by a 1-byte length field the length field allows the start of the
next string to be identified and the block of strings traversed
Access structures
In each group, 12 bytes of pointers is saved at the cost of including 4 bytes of length
informationfor a million word lexicon: saving of
2MB -> 18 MB
Access structuresBlocking makes the search process more
complex: to look up a term the array of string pointers is binary-searched to
locate the correct block of words the block is scanned in a linear fashion to find the
term the term’s ordinal term number is inferred from the
combination of the block number and the position within block
freq.value and inverted file addresses are accessed using the ordinal term number
Access structures
Consecutive words in a sorted list are likely to share a common prefix
front coding: 2 integers are stored with each word
one to indicate how many prefix characters are the same as the previous word
the other to record how many suffix characters remain when the prefix is removed
the integers are followed by the suffix characters
Access structures
Front coding yields a net saving of about 40 percent of the space required for string storage in a typical lexicon for the English language
problem with the complete front coding: binary search is no longer possible solution: partial 3-in-4 front coding
Access structures
Partial 3-in-4 front coding every 4th word (the one indexed by the
block pointer) is stored without front coding, so that binary search can proceed
on a large lexicon, expected to save about 4 bytes on each of three words, at the cost of 2 extra bytes of prefix-length informationa net gain of 10 bytes per 4-word blockfor million-word lexicon: -> 15,5 MB
Disk-based lexicon storage
The amount of primary memory required by the lexicon can be reduced by putting the lexicon on disk just enough information is retained in
primary memory to identify the disk block corresponding to each term
Disk-based lexicon storage
To locate the information corresponding to a given term, the in-memory index is searched to determine a block number the block is read into a buffer search is continued within the block B-tree etc. can be used
Disk-based lexicon storage
This approach is simple and requires minimal amount of primary memory
a disk-based lexicon is many times slower to access than a memory-based one one disk access per lookup is required extra time is tolerable when just a few terms
are being looked up (like in normal query processing, less than 50 terms)
not suitable for index construction process
Boolean query processing
Processing a query the lexicon is searched for each term in
the query each inverted list is retrieved and
decoded lists are merged, taking the intersection,
union, or complement, as appropriate finally, the documents are retrieved and
displayed
Conjunctive queries
text AND compression AND retrievala conjunctive query of r terms is
processedeach term is stemmed and located in the
lexicon if the lexicon is on disk, one disk access per
term is requiredthe terms are sorted by increasing
frequency
Conjunctive queries
The inverted list for the least frequent term is read into memory the list = a set of candidates (documents
that have not yet been eliminated and might be answers to the query)
all remaining inverted lists are processed against this set of candidates, in increasing order of term frequency
Conjunctive queries
In a conjunctive query, a candidate cannot be an answer unless it appears in all inverted lists -> the size of the set of candidates is non-
increasing to process a term, each document in the set
of candidates is checked and removed if it does not appear in the term’s inverted list
the remaining candidates are the answers
Term processing order
reasons to select the least frequent term to initialize the set of candidates (and also later): to minimize the amount of temporary
memory space required during query processing
the number of candidates may be quickly reduced, even to zero, after which no processing is required
Processing ranked queries
How to assign a similarity measure to each document that indicates how closely it matches a query?
Coordinate matching
Count the number of query terms that appear in each document
the more terms that appear, the more likely it is that the document is relevant
a hybrid query between a conjunctive AND query and a disjunctive OR query a document that contains any of the terms is
a potential answer, but preference is given to documents that contain all or most of them
Inner product similarity
Coordinate matching can be formalized as an inner product of a query vector with a set of document vectors
the similarity measure of query Q with document Dd is expressed as M(Q, Dd) = Q · Dd
the inner product of two n-vectors X and Y:
n
i
iiyxYX1
Drawbacks
Takes no account of term frequency documents with many occurrences of a
term should be favoredtakes no account of term scarcity
rare terms should have more weight?long documents with many terms are
automatically favored they are likely to contain more of any given
list of query terms
SolutionsTerm frequency
binary ”present” - ”not-present” judgment can be replaced with an integer indicating how many times the term appears in the document
fd,t: within-document frequency
more generally: a term t in document d can be assigned a document-
term weight wd,t
and a query-term weight wq,t
Solutions
The similarity measure is the inner product of these two vectors
it is normal to assign wq,t= 0 it t does not appear in Q, so the measure can be stated as
Qt
tdtqd wwDQM ,,),(
Inverse document frequency
If only the term frequency is taken into account, and a query contains common words, a document with enough appearances of a common term is always ranked first, irrespective of other words
-> terms can be weighted according to their inverse document frequency
Weighting
Many possibilities exist to combine term frequency and inversed document frequency
principles: a term that appears in many documents should
not be regarded as being more important than a term that appears in a few
a document with many occurrences of a term should not be regarded as being less important than a document that has just a few
Weighting
For instance, TF-IDF for a wd,t
IDF for a wq,t
Similarity of vectors
Long documents should not be favored over short documents similarity of the direction indicated by the
two vectors is measuredsimilarity is defined as the cosine of the
angle between the document and query vector cos = 1, when = 0 cos = 0, when the vectors orthogonal
Similarity of vectors
The cosine of an angle can be calculated
|X| is the length of vector X normalization factor
n
i i
n
i i
n
i ii
yx
yx
YX
YX
1
2
1
2
1cos
Similarity of vectors
Cosine rule for ranking:
where
and
n
ttdtq
dqd
dd ww
WWDQ
DQDQine
1,,
1),(cos
n
ttdd wW
1
2,
n
ttqq wW
1
2,
Index construction
Each document of the collection contains some index terms, and each index term appears in some of the documents this relationship can be expressed with a
frequency matrixeach column corresponds to one wordeach row corresponds to one documentthe number stored at any row and column is the
frequency, in that document, of the word indicated by that column
Index construction
Each document of the collection is summarized in one row of the frequency matrix
to create an index, the matrix must be transposed, forming a new version in which the rows are the term numbers
from this form, an inverted file index is easy to construct
Index construction
Trivial algorithm build in memory a transposed frequency
matrix, reading the text in document order, one column of the matrix at a time
write the matrix to disk row by row, in term order
Index construction
in reality, inversion is much more difficult the problem is the size of the frequency matrix
for instance, a collection that has 535346 distinct terms and 741856 documents, the size of the matrix can be 1.4 Tbytes
we could use a machine with a large virtual memory -> would take 2 months
Index construction
More economical methods for constructing and inverting a frequency matrix exist
an index for the large collection mentioned above could be created in less than 2 hours (1998) on a personal computer, consuming just 30 MB of main memory and less than 20 MB of temporary disk space over the space required by the final inverted file
Final words
We have discussed character sets preprocessing of text feature selection text categorization text summarization text compression indexing, querying
Final words
What else there is... structured documents (XML,…) metadata (semantic Web, ontologies,…) linguistic resources (WordNet, thesauri,…) document management systems (archiving…) document analysis (scanning of documents) digital libraries text mining, question-answering,.. ...
Administrative...
Exam on Tuesday 4.12. ”large” essays (~2-3 pages/each) ”data comprehension” (e.g. recall/precision) use full sentences!
Exercise points: 28 and more original points -> 30 pts otherwise original points + 2
Remember Kurssikysely!