file organization

File organization

File Organizations (Indexes)• Choices for accessing data during query evaluation• Scan the entire collection

– Typical in early (batch) retrieval systems– Computational and I/O costs are O(characters in collection)– Practical for only “small” text collections– Large memory systems make scanning feasible

• Use indexes for direct access– Evaluation time O(query term occurrences in collection)– Practical for “large” collections– Many opportunities for optimization

• Hybrids: Use small index, then scan a subset of the collection

Indexes• What should the index contain?• Database systems index primary and secondarykeys

– This is the hybrid approach– Index provides fast access to a subset of database records– Scan subset to find solution set

• IR Problem:• Cannot predict keys that people will use in queries

– Every word in a document is a potential search term• IR Solution: Index by all keys (words) full text indexes

Indexes• Index is accessed by the atoms of a query language, The atoms are called

“features” or “keys” or “terms”• Most common feature types:• – Words in text, punctuation• – Manually assigned terms (controlled and uncontrolled vocabulary)• – Document structure (sentence and paragraph boundaries)• – Inter- or intra-document links (e.g., citations)• Composed features• – Feature sequences (phrases, names, dates, monetary amounts)• – Feature sets (e.g., synonym classes)• Indexing and retrieval models drive choices• – Must be able to construct all components of those models• Indexing choices (there is no “right” answer)

– Tokenization,Case folding (e.g., New vs new, Apple vs apple), Stopwords (e.g., the, a, its), Morphology (e.g., computer, computers, computing, computed)

• Index granularity has a large impact on speed and effectiveness

Index Contents• The contents depend upon the retrieval model• Feature presence/absence

– Boolean– Statistical (tf, df, ctf, doclen, maxtf)– Often about 10% the size of the raw data, compressed

• Positional– Feature location within document– Granularities include word, sentence, paragraph, etc– Coarse granularities are less precise, but take less space– Word-level granularity about 20-30% the size of the raw

data,compressed

Indexes: Implementation• Common implementations of indexes

– Bitmaps– Signature files– Inverted files

• Common index components– Dictionary (lexicon)– Postings

• document ids• word positions

No positional data indexed

Indexes: Bitmaps• Bag-of-words index only• For each term, allocate vector with one bit per

document• If feature present in document n, set nth bit to 1,

otherwise 0• Boolean operations very fast• Space efficient for common terms (why?)• Space inefficient for rare terms (why?)• Good compression with run-length encoding

(why?)• Not widely used

Indexes: Signature Files• Bag-of-words only• For each term, allocate fixed size s-bit vector (signature)• Define hash function:

– Multiple functions: word → 1..s [selects which bits to set]• Each term has an s-bit signature

– may not be unique!• OR the term signatures to form document signature• Long documents are a problem (why?)

– Usually segment them into smaller pieces / blocks

Signature File Example

Indexes: Signature Files• At query time:

– Lookup signature for query (how?)– If all corresponding 1-bits are “on” in document

signature, document probably contains that term• Vary s to control P (false positive)

– Note space tradeoff• Optimal s changes as collection grows (why?)• Widely studied but Not widely used

Indexes: Inverted Lists• Inverted lists are currently the most common

indexing technique• Source file: collection, organized by document• Inverted file: collection organized by term

– one record per term, listing locations where term occurs• During evaluation, traverse lists for each query term

– OR: the union of component lists– AND: an intersection of component lists– Proximity: an intersection of component lists– SUM: the union of component lists; each entry has a score

Inverted Files

Word-Level Inverted File

How big is the index?

• For an n word collection:• Lexicon

– Heaps’ Law: V = O(nβ), 0.4 < β< 0.6– TREC-2: 1 GB text, 5 MB lexicon

• Postings– at most, one per occurrence of the word in the

text: O(n)

Inverted Search Algorithm

1. Find query elements (terms) in the lexicon

2. Retrieve postings for each lexicon entry3. Manipulate postings according to the

retrieval model

Word-Level Inverted File

Query: 1.porridge & pot (BOOL) 2.“porridge pot” (BOOL)3. porridge pot (VSM)

lexicon posting

Answer

Lexicon Data StructuresAccording to Heaps’ Law, the size of lexicon

maybe very large• Hash table

– O(1) lookup, with constant h() and collision handling– Supports exact-match lookup– May be complex to expand

• B-Tree– On-disk storage with fast retrieval and good caching

behavior– Supports exact-match and range-based lookup– O(log n) lookups to find a list– Usually easy to expand

In-memory Inversion Algorithm1. Create an empty lexicon2. For each document d in the collection,

1. Read document, parse into terms2. For each indexing term t,

1. fd,t = frequency of t in d2. If t is not in lexicon, insert it3. Append <d, fd,t> to postings list for t

3. Output each postings list into inverted file1. For each term, start new file entry2. Append each <d,fd,t> to the entry3. Compress entry4. Write entry out to file.

Complexity of In-memory Inv.• Time: O(n) for n-byte text• Space

– Lexicon: space for unique words + offsets– Postings, 10 bytes per entry

• document number: 4 bytes• frequency count: 2 bytes (allows 65536 max occ)• "next" pointer: 4 bytes

• Is this affordable?– For 5GB collection, at 10 bytes/entry, for 400M

entries, need 4GB of main memory

Idea 1: Partition the text

• Invert a chunk of the text at a time• Then, merge each sub-indexes into one

complete index

Main inverted file

chunk

多路归并

Idea 2: Sort-based Inversion

Invert in two passes:1. Output records <t, d, ft> to a temp. file2. Sort the records using external merge

sort1. read a chunk of the temp file2. sort it using Quicksort3. write it back into the same place4. then merge-sort the chunks in place

3. Read sorted file, and write inverted file

Access Optimizations• Skip lists:

– A table of contents to the inverted list– Embedded pointers that jump ahead n documents

(why is this useful?)• Separating presence information from location

information– Many operators only need presence information– Location information takes substantial space (I/O)– If split,

• reduced I/O for presence operators• increased I/O for location operators (or larger index)

– Common in CD-ROM implementations

Inverted file compression• Inverted lists are usually compressed• Inverted files with word locations are about the size of

the raw data• Distribution of numbers is skewed

– Most numbers are small (e.g., word locations, term frequency)• Distribution can be made more skewed easily

– Delta encoding: 5, 8, 10, 17 → 5, 3, 2, 7• Simple compression techniques are often the best

choice– Simple algorithms nearly as effective as complex algorithms– Simple algorithms much faster than complex algorithms– Goal: Time saved by reduced I/O > Time required to

uncompress

Compress Huffman encoding

Inverted file compression• The longst lists, which take up the most space, have the most

frequent (probable) words.• Compressing the longest lists would save the most space.• The longest lists should compress easily because they contain the

least information (why?)• Algorithms:

– Delta encoding– Variable-length encoding– Unary codes– Gamma codes– Delta codes– Variable-Byte Code – Golomb code

Inverted List Indexes: Compression

Delta Encoding ("Storing Gaps")• Reduces range of numbers.

– keep d in ascending order– 3, 5, 20, 21, 23, 76, 77, 78– becomes: 3, 2, 15, 1, 2, 53, 1, 1

• Produces a more skewed distribution.• Increases probability of smaller numbers.• Stemming also increases the probability of

smaller numbers. (Why?)

Variable-Byte Code• Binary, but use minimum number of bytes• 7 bits to store value, 1 bit to flag if there is

another byte– 0 < x < 128: 1 byte– 128 < x < 16384: 2 bytes– 16384 < x < 2097152: 3 bytes

• Integral byte sizes for easy coding• Very effective for medium-sized numbers• A little wasteful for very small numbers• Best trade-off between smallest index size and

fastest query time

Index/IR toolkits• The Lemur Toolkit for

Language Modeling and Information Retrieval

• Java-based indexing and search technology,provide web search application software

http://lucene.apache.org/nutch/

http://www.lemurproject.org/

http://lucene.apache.org/nutch/

http://www.lemurproject.org/

Summary

• Common implementations of indexes– Bitmap, signature file, inverted file

• Common index components– Lexicon & postings

• Inverted Search Algorithm• Inversion algorithm• Inverted file access optimizations

– compression

The End

Vector Space Model• 文档 d和查询 q在向量空间中表示为两个m维向量，每维度的权值用 TF∙IDF，其相似度用向量夹角余弦度量，有 : (使用原始的 tf,idf公式 )

Qt ttqtd

d

Qt ttqtd

dq

dq

Qtttqttd

dq

Qttqtd

d

dfNff

W

dfNff

WW

WW

idffidff

WW

WWDQCos

)(log1

)(log1

),(

2,,

2,,

,,,,

Vocabulary Growth (Heap’s Law)

• How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?– Vocabulary has no upper bound due to proper names,

typos, etc.– New words occur less frequently as vocabulary grows

• If V is the size of the vocabulary and the n is the length of the corpus in words:– V = Knβ (0< β <1)

• Typical constants:– K ≈ 10−100– β ≈ 0.4−0.6 (approx. square-root of n)

Query Answer

• 1.porridge & pot (BOOL) – d2

• 2.“porridge pot” (BOOL)– null

• 3. porridge pot (VSM)– d2 > d1>d5 – Next page

3. porridge pot (VSM)N=6， f(d1)表示 d1内词频

Termdf f(d1) f(d2) f(d5)

porridge 2 2 1 0

Pot 2 0 1 1

d5d1d26/1d5)Sim(q,

5/2d2)Sim(q,

10/2d1)Sim(q,

6|W|

5|W|

10|W|

1,1,11,0,0,0,0,0,0,0,1,1, D51,0,10,0,0,1,1,0,0,0,1,0, D20,0,00,0,0,2,2,1,0,1,0,0, D1

)log(||

)(log1),(

d5

d2

d1

2

t,d

2,,

文档长度：

文档向量：相同，均略去，下面，df

dfNfW

dfNff

WDQCos

Dt td

Qt ttqtd

dd

back

file organization

Documents

vocabulary growsif v

new words

longest lists

smallest index size

search technology

signature file

orgnutch http

byte128 x