design a data structure

Design a Data Structure Suppose you wanted to build a web search

engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”)

index say 100,000,000 documents of 1000 words each 100 billion word occurrences.

average word length: 8 speed determined largely by disk accesses may want boolean searches (e.g. “banana” and

“slug”) order results by relevance (title, keywords,

repetitions…) what data structure, algorithms? what will the space requirements of your data

structure be? what will the time requirements be?

Search Engine Ideas Binary search tree

With a node for each word occurrence, memory needed: 100 billion nodes, 20-30 bytes each?

Insert, delete, find O(log n) – would that be OK? Or one node for all occurrences of a word, with a

linked list of pointers to documents? perhaps 10 million nodes, each with a 10,000 element

list? keep nodes (but not lists) in RAM each element of list has URL, title, excerpt – 8K bytes?

How about a list of documents with excerpts. 1. Banana Slugs, http://…, “Banana slugs are yellow, 8”

long…” 8K per document would be 800 GB for the whole list.

Getting results What should we store at the nodes of the BST?

A “hit list” for a word? 10000 entries? Store a pointer to a hit list instead, to minimize BST size For each hit store document number and byte offset Order hit list by relevance criteria Size of hit list: 8GB?

How many disk accesses to find the hits in a BST? At 100 million * 20-30 bytes per node, the BST is large.

Can we store it all in RAM? How to perform a Boolean search?

or: union two lists (merge) and: intersect two lists (merge-like algorithm)

Total disk accesses needed? search BST + access hit list + access each document’s

info

A Better Data Structure BSTs waste space. Much duplication in the keys BSTs waste comparison time, for the same reason Can we use the ideas of Radix Sort? Search by bit? or by letter?

Build a search tree, but… Go left if first bit is 0, right for 1 Or, nodes have 26 children, for a..z Words at the leaves. (Different sort of node.) Each leaf node is a “hit list” Don’t need to store the words! How much space is needed?

suppose you have all 11.9M 5-letter words. space for tree about 1 pointer per word, 4 bytes, vs. 20(?) in BST

Space savings possible--but what about wasted pointer space?

Radix Search (Ch. 15) Radix-search methods provide reasonable

worst-case performance without balanced-tree complexity

Space savings are also possible. They work by comparing pieces (“bytes”) of

the key rather than the whole key, as in a BST Analogous to Radix Sorting methods Called “tries” for retrieval (but, ironically,

pronounced like the word tries)

Symbol Tables (Ch. 12 quickie) But first, a word about symbol tables and BSTs

(review) Symbol table: store items. retrieve them by key.

e.g. a compiler’s symbol table e.g. a database with primary key e.g. Perl’s hash data structure (essentially an array

indexed by a word.) $phone{“john”} = “x6789”. fundamental to much of computation

Symbol table ADT (with additional desirable ops): insert, delete, find select (kth largest) sort union (of two symbol tables)

Extensively studied and still an area of active research(eg web)

BSTs for Symbol Tables The Binary Search Tree is a common data

structure used to implement symbol tables Operations:

insert, delete, find – recursive algs, O(n) worst case O(log n) worst case in balanced BSTs

sort – inorder traversal O(n)

kth largest? augment tree with number of descendants stored at

each node O(log n) time in a balanced BST

pred, succ? union?

Digital Search Trees (Ch. 15 again) Like a BST, but go left for 0, right for 1 in the bit

in question Store key at node Root is most significant bit; ith level -> ith bit from

left Search: like BST search, but compare

appropriate bit Insert: ditto Note: not inorder! Each key is somewhere

along the path specifiedby its bits…

Can’t support sort, select Search time?

O(b), b=# of bits

Digital Search Tree Insertion How to insert Z? Z=11010 Trace down bits

until you find anempty spot

Runtime?

O(b), b=number of bits

Trie How can we keep

the BST order? Trie: a binary tree with

keys at the leaves: for an empty set

is a null pointer for a single key

a leaf containing it for many keys,

a node with keys starting with bit 0 in its left subtree andnodes starting with 1 in its right subtree

Trie Insertion Perform search as usual. If search ends at null link, insert there If the search ends on a leaf, we need to add

enough nodes on the way down to differentiate the leaf and the inserted node

Runtime? O(b) -- or maybe better! Inserting N random bitstrings requires lg N bit

comparisons on average per insertion

Note that leaf nodes and internal nodes are different. Wasted space if we use only one sort. (This gets especially significant in a large radix!)

Even with different node types, there may be wasted space

R-way Tries You can save search time by using a larger

radix(at the expense of wasted space…)

For example, have 26 children of each node, one for each letter of the alphabet

Tries for strings 26 pointers per internal node, one for each

letter of the alphabet What if one word is the prefix of another?

Example aardvark and aardvarkish How do you represent that “aardvark” is a word if

that node’s ‘i’ pointer points to another internal node?

Add a bit per letter which means “this is a word”

Keys are stored implicitly – by the sequence of links taken to find it.

A Trie node for stringsstruct node

{

char isword[26];

node *links[26];

node()

{

for (int i=0; i<26; i++)

{ isword[i]=0; links[i]=0; }

}

};

But where is the word stored?

Insertion How do you insert a string into a trie?

void insert(string word, node *n, int pos) { if (pos == word.size() - 1) { n->isword[index(word[pos])] = 1; return; } if (n->links[index(word[pos])] == NULL) n->links[index(word[pos])] = new node; insert(word, n->links[index(word[pos])], pos+1); return; }

int index(char ch) { return int(ch-’a’); }

Experimental Results In my implementation a node used 132 bytes 20068 words were read in 45747 nodes were allocated Total space 6,038,604 bytes

(compared with 200k size of /usr/dict/words) Average word length 7.4 characters Average comparisons per search: 7.4 one-

character comparisons (compared to 15 word comparisons for a balanced BST)

Much easier to implement than a balanced BST

Using a Trie: Examples Spell checker: fast but big Symbol table with lots of short symbols Boggle-playing program

read /usr/dict/words into a trie generate a 4x4 square of random letters DFS (or BFS) starting at each square, not re-using

letters, finding all words from trie…

design a data structure

Documents