processing of large document collections part 5. in this part: zindexing zquerying zindex...

Processing of large document collections

Part 5

In this part:

Indexingqueryingindex construction

Indexing

An index is a mechanism for locating a given term in a text

Index in a book it is possible to find information without

browsing the pages in large document collections

(gigabytes) page-by-page search would even be impossible

IndexingIt is supposed that

a document collection consists of a set of separate documents

each document is described by a set of representative terms

index must be capable of identifying all documents that contain combinations of specified terms

document is the unit of text that is returned in response to queries

Indexing

What is a document? E.g. emails

sender, recipient, subject, message body

one email, one field, a set of emails?

Indexing

Granularity of the index = the resolution to which term locations are recorded within each document

e.g. 1 email = 1 document, but the index could be capable of ascertaining a more exact location within the document of each term which documents contain terms ’tax’ and

’avoidance’ in the same sentence?

Indexing

If the granularity of the index is taken to be one word, then the index will record the exact location of every word in the collection the original text can be recovered from

the index the index takes more space than the

original text

Indexing

Choice of representative terms each word that appears in the

documents is included verbatim as a term in the indexthe number of terms is huge

usually some transformationscase foldingstemming, baseword reductionremoval of stopwords

Inverted file indexing

An inverted file contains, for each term in the lexicon, an inverted list that stores a list of pointers to all occurrences of that term in the main text each pointer is the number of a document in

which that term appears a lexicon: a list of all terms that appear in the

document collectionsupports mapping from terms to their

corresponding inverted lists


A query involving a single term is answered by scanning its inverted list and retrieving every document that it cites

for conjunctive Boolean queries of the form ’term AND term AND … AND term’, the intersection of the terms’ inverted lists is formed for disjunction (OR): union of lists for negation (NOT): complement


The inverted lists are usually stored in order of increasing document number various merging operations can be

performed in a time that is linear in the size of the lists

Inverted file indexing: granularity

A coarse-grained index might identify only a block of text, where each block stores several documents

a moderate-grain index will store locations in terms of document numbers

a fine-grained index will return a sentence or word number


Coarse indexes require less storage, but during

retrieval, more of the plain text must be scanned to find terms

multiterm queries are more likely to give rise to false matches, where each of the desired terms appears somewhere in the block, but not all within the same document


Word-level indexing enables queries involving adjacency and

proximity to be answered quickly because the desired relationship can be checked before the text is retrieved

adding precise locational information expands the index more pointers in the indexeach pointer requires more bits of storage


Unless a significant fraction of the queries are expected to be proximity-based, the usual granularity is to individual documents

phrase-based queries can be handled by the slightly slower method of a postretrieval scan

Inverted file compression

Uncompressed inverted files can consume considerable space 50-100% of the space of the text itself

the size of an inverted file can be reduced considerably by compressing it

key for compression: each inverted list can without any loss of

generalization be stored as an ascending sequence of integers


Suppose that some term appears in 8 documents of a collection; the term is described in the inverted file by a list: <8; 3, 5, 20, 21, 23, 76, 77, 78>

the address of which is contained in the lexicon

more generally, the list for a term t store the number of documents ft in which the term appears and then a list of ft document numbers


the list of document numbers within each inverted list is in ascending order, and all processing is sequential from the beginning of the list -> the list can be stored as an initial

position followed by a list of d-gapsthe list for the term above:

<8; 3, 2, 15, 1, 2, 53, 1, 1>


The two forms are equivalent, but it is not obvious that any saving has been achieved the largest d-gap in the second presentation

is still potentially the same as the largest document number in the first

if there are N documents in the collection and a flat binary encoding is used to represent the gap sizes, both methods require log N bits per stored pointer


Considering each inverted list as a list of d-gaps, the sum of which is bounded by N, allows improved representation -> it is possible to code inverted lists

using on average substantially fewer than log N bits per pointer


many specific models have been proposed global methods

every inverted list is compressed using the same common model

local methods adjusted according to some parameter, usually

frequencytend to outperform global ones, but are more

complex to implement

Querying

How to use an index to locate information in the text it describes?

Boolean queries

A Boolean query comprises a list of terms that are combined using the connectives AND, OR, and NOT

the answers to the query are those documents that satisfy the condition

Boolean queries

e.g. ’text’ AND ’compression’ AND ’retrieval’

all three words must occur somewhere in every answer (no particular order) ”the compression and retrieval of large

amounts of text is an interesting problem” ”this text describes the fractional

distillation scavenging technique for retrieving argon from compressed air”

Boolean queries

A problem with all retrieval systems: non-relevant answers are returned must be filtered out manually

broad query -> high recallnarrow query -> high precision

Boolean queries

Small variations in a query can generate very different results data AND compression AND retrieval text AND compression AND retrieval

the user should be able to pose complex queries like (text OR data OR image) AND

(compression OR compaction OR decompression) AND (archiving OR retrieval OR storage)

Ranked queries

Non-professional users might prefer simply giving a list of words that are of interest and letting the retrieval system supply the documents that seem most relevant, rather than seeking exact Boolean answers

text, data, image, compression, compaction, archiving, storage, retrieval...

Ranked queries

It would be useless to convert a list of words to a Boolean query connect with AND -> too few documents connect with OR -> too many documents

solution: a ranked query a heuristic that is applied to measure the

similarity of each document to the query r most closely matching documents are

returned

Ranking strategies

Simple techniques count the number of query terms that

appear somewhere in the documenta document that contains 5 query terms is

ranked higher than a document that contains 3 query terms

more advanced techniques cosine measure

takes into account the lenghts of the documents etc.

Accessing the lexicon

The lexicon for an inverted file index stores the terms that can be used to search the

collection information needed to allow queries to be

processedaddress in the inverted file (of the

corresponding list of document numbers)the number of documents containing the term

Access structures

A simple structure an array of records, each comprising a string

along with two integer fields if the lexicon is sorted, a word can be located

by a binary search of the strings consumes a lot of space e.g. a collection of million words (~5GB),

stored as 20-byte strings, with 4-byte inverted file address and 4-byte freq. value -> 28MB

Access structuresThe space for the strings is reduced if they

are all concatenated into one long contiguous strings an array of 4-byte character pointers is used for

access each term: its exact number of characters + 4 for

the pointerit is not necessary to store string lengths: next pointer

indicates the end of the string

in the collection of million terms, memory reduction is 8 MB -> 20 MB

Access structures

The memory required can be further reduced by eliminating many of the string pointers

1 word in 4 is indexed, and each stored word is prefixed by a 1-byte length field the length field allows the start of the

next string to be identified and the block of strings traversed

Access structures

In each group, 12 bytes of pointers is saved at the cost of including 4 bytes of length

informationfor a million word lexicon: saving of

2MB -> 18 MB

Access structuresBlocking makes the search process more

complex: to look up a term the array of string pointers is binary-searched to

locate the correct block of words the block is scanned in a linear fashion to find the

term the term’s ordinal term number is inferred from the

combination of the block number and the position within block

freq.value and inverted file addresses are accessed using the ordinal term number

Access structures

Consecutive words in a sorted list are likely to share a common prefix

front coding: 2 integers are stored with each word

one to indicate how many prefix characters are the same as the previous word

the other to record how many suffix characters remain when the prefix is removed

the integers are followed by the suffix characters

Access structures

Front coding yields a net saving of about 40 percent of the space required for string storage in a typical lexicon for the English language

problem with the complete front coding: binary search is no longer possible solution: partial 3-in-4 front coding

Access structures

Partial 3-in-4 front coding every 4th word (the one indexed by the

block pointer) is stored without front coding, so that binary search can proceed

on a large lexicon, expected to save about 4 bytes on each of three words, at the cost of 2 extra bytes of prefix-length informationa net gain of 10 bytes per 4-word blockfor million-word lexicon: -> 15,5 MB

Disk-based lexicon storage

The amount of primary memory required by the lexicon can be reduced by putting the lexicon on disk just enough information is retained in

primary memory to identify the disk block corresponding to each term


To locate the information corresponding to a given term, the in-memory index is searched to determine a block number the block is read into a buffer search is continued within the block B-tree etc. can be used


This approach is simple and requires minimal amount of primary memory

a disk-based lexicon is many times slower to access than a memory-based one one disk access per lookup is required extra time is tolerable when just a few terms

are being looked up (like in normal query processing, less than 50 terms)

not suitable for index construction process

Boolean query processing

Processing a query the lexicon is searched for each term in

the query each inverted list is retrieved and

decoded lists are merged, taking the intersection,

union, or complement, as appropriate finally, the documents are retrieved and

displayed

Conjunctive queries

text AND compression AND retrievala conjunctive query of r terms is

processedeach term is stemmed and located in the

lexicon if the lexicon is on disk, one disk access per

term is requiredthe terms are sorted by increasing

frequency

Conjunctive queries

The inverted list for the least frequent term is read into memory the list = a set of candidates (documents

that have not yet been eliminated and might be answers to the query)

all remaining inverted lists are processed against this set of candidates, in increasing order of term frequency

Conjunctive queries

In a conjunctive query, a candidate cannot be an answer unless it appears in all inverted lists -> the size of the set of candidates is non-

increasing to process a term, each document in the set

of candidates is checked and removed if it does not appear in the term’s inverted list

the remaining candidates are the answers

Term processing order

reasons to select the least frequent term to initialize the set of candidates (and also later): to minimize the amount of temporary

memory space required during query processing

the number of candidates may be quickly reduced, even to zero, after which no processing is required

Processing ranked queries

How to assign a similarity measure to each document that indicates how closely it matches a query?

Coordinate matching

Count the number of query terms that appear in each document

the more terms that appear, the more likely it is that the document is relevant

a hybrid query between a conjunctive AND query and a disjunctive OR query a document that contains any of the terms is

a potential answer, but preference is given to documents that contain all or most of them

Inner product similarity

Coordinate matching can be formalized as an inner product of a query vector with a set of document vectors

the similarity measure of query Q with document Dd is expressed as M(Q, Dd) = Q · Dd

the inner product of two n-vectors X and Y:

n

i

iiyxYX1

Drawbacks

Takes no account of term frequency documents with many occurrences of a

term should be favoredtakes no account of term scarcity

rare terms should have more weight?long documents with many terms are

automatically favored they are likely to contain more of any given

list of query terms

SolutionsTerm frequency

binary ”present” - ”not-present” judgment can be replaced with an integer indicating how many times the term appears in the document

fd,t: within-document frequency

more generally: a term t in document d can be assigned a document-

term weight wd,t

and a query-term weight wq,t

Solutions

The similarity measure is the inner product of these two vectors

it is normal to assign wq,t= 0 it t does not appear in Q, so the measure can be stated as

Qt

tdtqd wwDQM ,,),(

Inverse document frequency

If only the term frequency is taken into account, and a query contains common words, a document with enough appearances of a common term is always ranked first, irrespective of other words

-> terms can be weighted according to their inverse document frequency

Weighting

Many possibilities exist to combine term frequency and inversed document frequency

principles: a term that appears in many documents should

not be regarded as being more important than a term that appears in a few

a document with many occurrences of a term should not be regarded as being less important than a document that has just a few

Weighting

For instance, TF-IDF for a wd,t

IDF for a wq,t

Similarity of vectors

Long documents should not be favored over short documents similarity of the direction indicated by the

two vectors is measuredsimilarity is defined as the cosine of the

angle between the document and query vector cos = 1, when = 0 cos = 0, when the vectors orthogonal


The cosine of an angle can be calculated

|X| is the length of vector X normalization factor

n

i i

n

i i

n

i ii

yx

yx

YX

YX

1

2

1

2

1cos


Cosine rule for ranking:

where

and

n

ttdtq

dqd

dd ww

WWDQ

DQDQine

1,,

1),(cos

n

ttdd wW

1

2,

n

ttqq wW

1

2,

Index construction

Each document of the collection contains some index terms, and each index term appears in some of the documents this relationship can be expressed with a

frequency matrixeach column corresponds to one wordeach row corresponds to one documentthe number stored at any row and column is the

frequency, in that document, of the word indicated by that column

Index construction

Each document of the collection is summarized in one row of the frequency matrix

to create an index, the matrix must be transposed, forming a new version in which the rows are the term numbers

from this form, an inverted file index is easy to construct

Index construction

Trivial algorithm build in memory a transposed frequency

matrix, reading the text in document order, one column of the matrix at a time

write the matrix to disk row by row, in term order

Index construction

in reality, inversion is much more difficult the problem is the size of the frequency matrix

for instance, a collection that has 535346 distinct terms and 741856 documents, the size of the matrix can be 1.4 Tbytes

we could use a machine with a large virtual memory -> would take 2 months

Index construction

More economical methods for constructing and inverting a frequency matrix exist

an index for the large collection mentioned above could be created in less than 2 hours (1998) on a personal computer, consuming just 30 MB of main memory and less than 20 MB of temporary disk space over the space required by the final inverted file

Final words

We have discussed character sets preprocessing of text feature selection text categorization text summarization text compression indexing, querying

Final words

What else there is... structured documents (XML,…) metadata (semantic Web, ontologies,…) linguistic resources (WordNet, thesauri,…) document management systems (archiving…) document analysis (scanning of documents) digital libraries text mining, question-answering,.. ...

Administrative...

Exam on Tuesday 4.12. ”large” essays (~2-3 pages/each) ”data comprehension” (e.g. recall/precision) use full sentences!

Exercise points: 28 and more original points -> 30 pts otherwise original points + 2

Remember Kurssikysely!

processing of large document collections part 5. in this part: zindexing zquerying zindex...

Documents

indexing zan index

index ythe index

original text slide

word number slide

impossible slide

document ze

index xthe number of

term locations