© 2014 a. haeberlen, z. ives cis 455/555: internet and web systems 1 university of pennsylvania...

24
© 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

Upload: jasper-newcombe

Post on 30-Mar-2015

219 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

1University of Pennsylvania

© 2014 A. Haeberlen, Z. Ives

CIS 455/555: Internet and Web Systems

Indexing

February 5, 2014

Page 2: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

2University of Pennsylvania

Announcements HW1 MS1 is due IN ONE WEEK

At this point, you should have a feature-complete prototype, so you have time to debug and test your solution

Debugging tips When in doubt about protocol details, please look in

the HTTP/1.1 spec (RFC2616; linked from HTTP Made Really Easy)

Reminder: You have three jokers; the late penalty without jokers is 20% per day

Please: Use private questions on Piazza sparingly

Reading: D. Comer: "The Ubiquitous B-Tree"

http://dl.acm.org/citation.cfm?id=356776

Page 3: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

3University of Pennsylvania

Plan for today

Inverted indices B+ trees

NEXT

Page 4: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

4

Finding data by content

We’ve seen two approaches to search: Flood the network with requests (example: Gnutella),

and do all the work at the data stores Have a directory based on names (example: LDAP)

Which of these is the 'best'?

An alternative, two-step process: Build a content index over what’s out there

An index is a keyvalue map Typically limited in what kinds of queries can be

supported Most common instance: an index of document

keywords

Page 5: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

A common model for search

Index the words in every document

“Forward index”: document (ID) list of words

“Inverted index”: word document (ID)

5

Page 6: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

6

Inverted indices A conceptually very simple map-multiset data

structure: <keyword, {list of occurrences}>

In its simplest form, each occurrence includes a document pointer (e.g., URI), perhaps a count and/or position

What might a count be useful for? A position?

Requires two components, an indexer and a retrieval system

We’ll consider the cost of building the index, plus searching the index using a single keyword

Storage efficiency is also a concern

Page 7: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

7

How do we lay out an inverted index?

Some data structures we could use: Unordered list (e.g., a log) Ordered list Tree Hash table

Page 8: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

8

Unordered and ordered lists

Assume that we have entries such as:<keyword, #items, {occurrences}>

What does ordering buy us?

Assume that we adopt a model in which we use:

<keyword, item><keyword, item>

Do we get any additional benefits?

How about:<keyword, {items}>

where we fix the size of the keyword and the number of items?

Page 9: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

9

Tree-based indices

Trees have several benefits over lists: Potentially logarithmic search time, as with a well-

designed sorted list if it is balanced!

Ability to handle variable-length records

We’ve already seen how trees might make a natural way of distributing data, as well

How does a binary search tree fare? Cost of building? Cost of finding an item in it?

Page 10: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

10University of Pennsylvania

Recap: Inverted indices

Useful for search

Different data structures can be used Pros / cons

Page 11: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

11University of Pennsylvania

Plan for today

Inverted indices B+ trees NEXT

Page 12: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

The B+ tree A flexible, height-balanced, high-fanout tree Insert/delete at logF N cost (F = fanout, N = # leaf

pages) Need to keep tree height-balanced

Minimum 50% occupancy (except for root) Each node contains d <= m <= 2d entries Inner nodes contain up to 2d+1 pointers d is called the order of the tree

Can search efficiently based on equality (or also range, though we don’t need that here)

Index Entries

Data Entries("Sequence set")

(Direct search)

...Linked list

(compare to B-tree!)

Page 13: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

Example B+ Tree Data (inverted list pointers) is at the leaves;

intermediate nodes have copies of search keys

Search begins at root, and key comparisons direct it to a leaf

Search for be↓, bobcat↓ ...

Based on the search for bobcat*, we know it is not in the tree!

Root

best but dog

a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓

art

Page 14: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

Inserting data into a B+ Tree

Find correct leaf L Put data entry onto L

If L has enough spacewe are, done!

Else, must split leaf node L (into L and a new node L2)

Redistribute entries evenly, copy up middle key Insert index entry pointing to L2 into parent of L

This can happen recursively To split index node, redistribute entries evenly, but

push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height

Tree growth: gets wider or one level taller at the top

Root

best but dog

a↓am ↓an↓ ant↓ art↓ be↓ best↓bit↓ bob↓ but↓can↓cry↓ dog↓dry↓elf↓ fox↓

art

Page 15: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

15

Inserting “and↓” Example: Copy up

Want to insert here; no room, so split & copy up:

a↓ am ↓ an↓ and↓ ant↓

an

Entry to be inserted in parent node.(Note that key “an” is copied up andcontinues to appear in the leaf.)

and↓

Root

best but dog

a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓

art

But where? Parent nodeis already "full"!

Page 16: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

16

Inserting “and↓” Example: Push up 1/2

Root

art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓

an

Need to split node & push up

best but dogart

a↓ am ↓ dog↓ dry↓ elf↓ fox↓

an↓ ant↓ and↓

Page 17: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

17

Inserting “and↓” Example: Push up 2/2

Root

art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓

an but dog

best

art

Entry to be inserted in parent node.

(Note that best is pushed up and only

appears once in the index. Contrast

this with a leaf split.)

a↓ am ↓ dog↓ dry↓ elf↓ fox↓

an↓ ant↓ and↓

Page 18: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

18

Summary: Copying vs. splitting

Every keyword (search key) appears in at most one intermediate node

Hence, in splitting an intermediate node, we push up

Every inverted list entry must appear in a leaf

We may also need it in an intermediate node to define a partition point in the tree

We must copy up the key of this entry

Note that B+ trees easily accommodate multiple occurrences of a keyword

Page 19: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

19University of Pennsylvania

Some details

How would you choose the order of the tree?

How would you find all the words starting with the letters 'com'?

How would you delete something?

Do you always have to split/merge?

Page 20: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

Virtues of the B+ Tree

B+ tree and other indices are quite efficient: Height-balanced; logF N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average

Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using later in the semester:

Interface: open B+ Tree; get and put items based on key

Handles concurrency, caching, etc.

Page 21: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

21University of Pennsylvania

Example: B+ tree

Insert 15, 11, 12, 32, 74

65 130 187

9 25 45 70 80 101 138 150159122 180

1 4 6 9 14 16 25 31 38 41 45 61 63 64

65 67 68 69 70 72 75 79

Page 22: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

22

How do we distribute a B+ Tree?

We need to host the root at one machine and distribute the rest

What are the implications for scalability?

Consider building the index as well as searching

Page 23: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

23

Eliminating the root

Sometimes we don’t want a tree-structured system because the higher levels can be a central point of congestion or failure

Two strategies: Modified tree structure (e.g., the BATON p2p tree;

see Jagadish et al., VLDB 2005) Non-hierarchical structure (distributed hash table,

discussed in a couple of weeks)

Page 24: © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

© 2014 A. Haeberlen, Z. Ives

24University of Pennsylvania

Recap: B+ trees

A very common data structure for indices

Used, e.g., in many file systems and many DBMS

Very efficient Height-balanced; logF N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average