spring 2003 ece569 lecture 04-2.1 ece 569 database system engineering spring 2003 yanyong zhang...

Spring 2003ECE569 Lecture 04-2.1

ECE 569 Database System Engineering

Spring 2003

Yanyong Zhang www.ece.rutgers.edu/~yyzhang

Course URL www.ece.rutgers.edu/~yyzhang/spring03


Access Paths

Associative access can be realized via scan.

The class of algorithms and data structures designed for translating attribute values into TID, or into other types of internal addresses of tuples having those attribute values, is called access paths.

Depending on what kind of selection predicate is to be supported, the techniques for associative access vary greatly.


Content addressability techniques

Primary key access. A tuple of a relation must be retrieved efficiently via the

value of its primary (unique) key(s). e.g., key-sequenced files and hased files.

Point query vs. range query

Secondary key access A set of tuples are produced

Multi-table access Tuple access is often based on relationships between

different tuples.


Operations on files

Assumptions n = number of records in file

R = number of records that can fit in block

Lookup – Given a key find corresponding record On average, n / (2R) block accesses.

Insertion – add record to file (allows duplicates) Read last block; it may need to allocate a new block.

Approximately, requires 2 accesses

Deletion – delete record look up record n / (2R)

Write back to disk (1 access)

Reorganize (unpinned) – move tuple from last page to utilize space (2 disk accesses)


Hashed Files

File is divided into B buckets

Hash function h maps elements of the key space to range [0, B)

Key space is large and unevenly distributed

- SSNs as character strings

- Each character takes on at most 10 of the possible 256 values

Hash function h must map key values evenly among a relatively small number of values.

A bucket directory is an array of B pointers to the allocated buckets.

Small enough to fit entirely in memory

Buckets are allocated only as they are needed.


Hashed files

..Ksuchthath(K)=1Ksuchthath(

1

0

.

.

.

B-2

B-1

Hash Table

Tuples with key Kbsuch that h(Kb) = 1

Tuples with key Kbsuch that h(Kb) = B -22

Tuples with key Kbsuch that h(Kb) = B -1

Buckets


Hash-based associative access

FOLDING HASHING

Ran

ge

of

po

siti

ve in

teg

ers

tupleaddressspace

Range of PotentialKey Values(the shadedareas denoteused keyvalues)


Folding Convert arbitrary data types to a positive integer h can be

applied to. Reduce number of bits so that arithmetic is efficient. Example: Key is “Keefe” and 16803

Key value is the concatenation of byte representation of individual fields

Folded value of key is

0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3 Partition result into words and combine using XOR

0x4b 0x65 0x65 0x66

0x65 0x0 0x0 0x0

0x41 0xa3 0x0 0x0

0x6f 0xc6 0x65 0x66 = 1875273062


Hashing

goal of hasing

How to choose hash function if all the key values are uniformly distributed?

The critical issue is to produce 1:1 mapping

Collision: different inputs are mapped to the same output.

The criteria of a good hash function is to keep the collision as small as possible.


Static Hashing

Input: folded key values

Output: bytes (relative to the beginning of the file), blocks ??

Bytes are not good because of the varying tuple size.

A block/page is called a bucket.

H: {0 … 232-1} -> {0, B-1}

Continuous allocation

Fixed size: B pages are allocated at file creation time.

Insert

- Determine the bucket

- Check the bucket ( collision may happen)


How to find a good hash function

Division / remainder (Congruential hashing) H(Kb) = kb mod B where kb is folded key value and B is

the number of buckets.

Nth power Compute kb

N, and from the resulting bit string (n x 31 bits) take log2B bits from the middle.

Base transformation

Polynomial division

Numerical analysis

encryption


Performance

Assumption Perfect hash function (tuples are uniformly distributed

over B buckets)

Lookup ½ n/R 1/B To finish first match

n/R 1/B If tuple does not exist

Insertion n/R 1/B + 1 Test for duplicates

1 Otherwise

Deletion ½ n/R 1/B delete first match


Collision

Two keys collide if they hash to same value

A bucket with room for R tuples can accommodate R – 1 collisions before it overflows

Internal resolution: Place overflow blocks in another bucket

- (h(K) + 1) mod B open addressing

- (h2(h1(K)) multiple hashing


Collision - continued

External resolution: Allocation overflow block, link to overflow chain

buckets Overflow

pages


Discussion

How do you limit the number of pages accessed when retrieving a tuple, for both external and internal resolution?


How to locate a tuple in a page?

Sequential search

Page directory

hash


Extendible Hashing

The number of buckets can grow/shrink.

An intermediate data structure translates the hash results into page addresses. This data structure needs to be as compact as possible.

Hashes into an array of pointer to buckets (directory).

The array is small enough to be kept in memory.


Directory Growth

• To adapt to dynamically varying size of hash file- modify directory size

• Assume a hash function h(Kb) that produces a bit

string s.

• The directory is of size 2d. d is called the global depth and is initially 0.

• Use least significant d bits of s to determine bucket to access

• Each bucket has a corresponding local depth in the range [0, d]


Example

00x13,0x10 0

d = 0

00x13,0x10 0

d = 1

1

00x10,0x00 1

d = 1

1

0x13,0x07 1

000x10,0x00 1

d = 2

01

210

11

0x13,0x07 2

Insert 0x13, 0x10, 0x07, 0x00, 0x1fEach page can contain no more than 2 tuples


Example – insert 0x1f

000 0x10,0x00 1

d = 3

001

2 010

011

0x13 3

100

101

110

111

0x07,0x1f 3


Performance

2 steps for retrieving a tuple

If we can keep the directory in memory, each retrieval is one page access

Assuming 4 bytes per entry, 4KB pages, 1GB hash files, and we want to keep the entire directory in memory, what is the minimum buffer size?


Discussion

How easy is it to keep the directory in the memory?

How do we reduce the structure when the file shrinks?

spring 2003 ece569 lecture 04-2.1 ece 569 database system engineering spring 2003 yanyong zhang...

Documents