© 2014 a. haeberlen, z. ives cis 455/555: internet and web systems 1 university of pennsylvania...

1University of Pennsylvania

© 2014 A. Haeberlen, Z. Ives

CIS 455/555: Internet and Web Systems

Indexing

February 5, 2014



Announcements HW1 MS1 is due IN ONE WEEK

At this point, you should have a feature-complete prototype, so you have time to debug and test your solution

Debugging tips When in doubt about protocol details, please look in

the HTTP/1.1 spec (RFC2616; linked from HTTP Made Really Easy)

Reminder: You have three jokers; the late penalty without jokers is 20% per day

Please: Use private questions on Piazza sparingly

Reading: D. Comer: "The Ubiquitous B-Tree"

http://dl.acm.org/citation.cfm?id=356776



Plan for today

Inverted indices B+ trees

NEXT


4

Finding data by content

We’ve seen two approaches to search: Flood the network with requests (example: Gnutella),

and do all the work at the data stores Have a directory based on names (example: LDAP)

Which of these is the 'best'?

An alternative, two-step process: Build a content index over what’s out there

An index is a keyvalue map Typically limited in what kinds of queries can be

supported Most common instance: an index of document

keywords


A common model for search

Index the words in every document

“Forward index”: document (ID) list of words

“Inverted index”: word document (ID)

5


6

Inverted indices A conceptually very simple map-multiset data

structure: <keyword, {list of occurrences}>

In its simplest form, each occurrence includes a document pointer (e.g., URI), perhaps a count and/or position

What might a count be useful for? A position?

Requires two components, an indexer and a retrieval system

We’ll consider the cost of building the index, plus searching the index using a single keyword

Storage efficiency is also a concern


7

How do we lay out an inverted index?

Some data structures we could use: Unordered list (e.g., a log) Ordered list Tree Hash table


8

Unordered and ordered lists

Assume that we have entries such as:<keyword, #items, {occurrences}>

What does ordering buy us?

Assume that we adopt a model in which we use:

<keyword, item><keyword, item>

Do we get any additional benefits?

How about:<keyword, {items}>

where we fix the size of the keyword and the number of items?


9

Tree-based indices

Trees have several benefits over lists: Potentially logarithmic search time, as with a well-

designed sorted list if it is balanced!

Ability to handle variable-length records

We’ve already seen how trees might make a natural way of distributing data, as well

How does a binary search tree fare? Cost of building? Cost of finding an item in it?



Recap: Inverted indices

Useful for search

Different data structures can be used Pros / cons



Plan for today

Inverted indices B+ trees NEXT


The B+ tree A flexible, height-balanced, high-fanout tree Insert/delete at logF N cost (F = fanout, N = # leaf

pages) Need to keep tree height-balanced

Minimum 50% occupancy (except for root) Each node contains d <= m <= 2d entries Inner nodes contain up to 2d+1 pointers d is called the order of the tree

Can search efficiently based on equality (or also range, though we don’t need that here)

Index Entries

Data Entries("Sequence set")

(Direct search)

...Linked list

(compare to B-tree!)


Example B+ Tree Data (inverted list pointers) is at the leaves;

intermediate nodes have copies of search keys

Search begins at root, and key comparisons direct it to a leaf

Search for be↓, bobcat↓ ...

Based on the search for bobcat*, we know it is not in the tree!

Root

best but dog

a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓

art


Inserting data into a B+ Tree

Find correct leaf L Put data entry onto L

If L has enough spacewe are, done!

Else, must split leaf node L (into L and a new node L2)

Redistribute entries evenly, copy up middle key Insert index entry pointing to L2 into parent of L

This can happen recursively To split index node, redistribute entries evenly, but

push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height

Tree growth: gets wider or one level taller at the top

Root

best but dog

a↓am ↓an↓ ant↓ art↓ be↓ best↓bit↓ bob↓ but↓can↓cry↓ dog↓dry↓elf↓ fox↓

art


15

Inserting “and↓” Example: Copy up

Want to insert here; no room, so split & copy up:

a↓ am ↓ an↓ and↓ ant↓

an

Entry to be inserted in parent node.(Note that key “an” is copied up andcontinues to appear in the leaf.)

and↓

Root

best but dog

a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓

art

But where? Parent nodeis already "full"!


16

Inserting “and↓” Example: Push up 1/2

Root

art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓

an

Need to split node & push up

best but dogart

a↓ am ↓ dog↓ dry↓ elf↓ fox↓

an↓ ant↓ and↓


17

Inserting “and↓” Example: Push up 2/2

Root

art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓

an but dog

best

art

Entry to be inserted in parent node.

(Note that best is pushed up and only

appears once in the index. Contrast

this with a leaf split.)

a↓ am ↓ dog↓ dry↓ elf↓ fox↓

an↓ ant↓ and↓


18

Summary: Copying vs. splitting

Every keyword (search key) appears in at most one intermediate node

Hence, in splitting an intermediate node, we push up

Every inverted list entry must appear in a leaf

We may also need it in an intermediate node to define a partition point in the tree

We must copy up the key of this entry

Note that B+ trees easily accommodate multiple occurrences of a keyword



Some details

How would you choose the order of the tree?

How would you find all the words starting with the letters 'com'?

How would you delete something?

Do you always have to split/merge?


Virtues of the B+ Tree

B+ tree and other indices are quite efficient: Height-balanced; logF N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average

Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using later in the semester:

Interface: open B+ Tree; get and put items based on key

Handles concurrency, caching, etc.



Example: B+ tree

Insert 15, 11, 12, 32, 74

65 130 187

9 25 45 70 80 101 138 150159122 180

1 4 6 9 14 16 25 31 38 41 45 61 63 64

65 67 68 69 70 72 75 79


22

How do we distribute a B+ Tree?

We need to host the root at one machine and distribute the rest

What are the implications for scalability?

Consider building the index as well as searching


23

Eliminating the root

Sometimes we don’t want a tree-structured system because the higher levels can be a central point of congestion or failure

Two strategies: Modified tree structure (e.g., the BATON p2p tree;

see Jagadish et al., VLDB 2005) Non-hierarchical structure (distributed hash table,

discussed in a couple of weeks)



Recap: B+ trees

A very common data structure for indices

Used, e.g., in many file systems and many DBMS

Very efficient Height-balanced; logF N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average

© 2014 a. haeberlen, z. ives cis 455/555: internet and web systems 1 university of pennsylvania...

Documents

inverted index

search index

ives plan

ives recap

ives cis

concern slide

content index

document forward index