sistemas de informação e bases de dados - indexing...j’ssearch-keyvalues p n...

Sistemas de Informação e Bases de Dados

Indexing

SIBD 2019/2020

Departamento de Engenharia InformáticaInstituto Superior Técnico

Os slides das aulas de SIBD são baseados nos slidesoficiais dos livros Database Management Systems, deRamakrishnan e Gehrke, e Database System Con-cepts, de Silberschatz, Korth e Sudarshan

http://pages.cs.wisc.edu/~dbbook/

http://db-book.com/

http://db-book.com/

Outline

Basic Concepts

B+-Tree Indices

Hash Indices

Indices on Multiple Keys

Indices for Full-Text Search

Indices and SQL

1

Outline

Basic Concepts

B+-Tree Indices

Hash Indices



Indices and SQL

2

Indexing

• Indexing mechanisms can speed up access to desired data.• For instance, an author catalog in a library

• In a database, they serve the same purpose

• An index file consists of records (index entries) of the form

search-key pointer

• A search key is an attribute or set of attributes used to look uprecords in a file

• Index files are typically much smaller than the original file• Two basic kinds of indices:

• Ordered indices: search keys are stored in sorted order• Hash indices: search keys are distributed uniformly across

“buckets” using a “hash function”

• Specialized indexes for text, geospatial data, etc.

3

Index Evaluation Metrics

Indices are chosen according to:

• Access types supported efficiently• records with a specified value in the attribute• records with an attribute value falling in a specified range• specialized indexes (e.g., for text), for finding records with a

textual attribute matching particular terms

• Access time

• Insertion time

• Deletion time

• Space overhead

4

Types of Indices (Ordered Indices)

• In an ordered index, index entries are stored sorted on thesearch key value• For instance, author an catalog in a library

• Primary index: in a sequentially ordered file, the index whosesearch key specifies the sequential order of the file• Also called clustering index• The search key of a primary index is usually but not necessarily

the primary key.

• Secondary index: an index whose search key specifies an orderdifferent from the sequential order of the file• Also called non-clustering index

5

Secondary Indices

• Frequently, one wants to find all the records whose values in acertain field (which is not the search-key of the primary index)satisfy some condition• Example 1: In the account relation stored sequentially by

account number, we may want to find all accounts in aparticular branch

• Example 2: as above, but where we want to find all accountswith a specified balance or range of balances

• We can have a secondary index with an index record for eachsearch-key value• Index record points to a bucket that contains pointers to all

the actual records with that particular search-key value• Some queries will only require attributes from the index

records (i.e., covering indexes)

6

Secondary Indices (cont.)

Secondary index on balance field of account

7

Dense Index Files

• Dense index: index record appears for every search-key value inthe file

8

Sparse Index Files

• Sparse Index: contains index records for only some search-keyvalues

9

Sparse Index Files (cont.)

• Applicable when records are sequentially ordered on search-key

• To locate a record with search-key value K we:• Find index record with largest search-key value < K

• Search file sequentially starting at the record to which theindex record points

• Less space and less maintenance overhead for insertions anddeletions

• Generally slower than dense index for locating records

• Good tradeoff: sparse index with an index entry for every blockin file, corresponding to least search-key value in the block

• Secondary indices have to be dense

10

Using Indices

• Indices offer substantial benefits when searching for records

• Indices can also be used to speed-up join operations, or tospeed-up data sorting and aggregation

• When a file is modified, every index on the file must beupdated: updating indices imposes overhead on databasemodifications

• Sequential scan using primary index is efficient, but asequential scan (or a join) using a secondary index is expensive• each record access may fetch a new block from disk

11

Using Indices (Selection Operation)

• Linear search: Scan each file and test all records to seewhether they satisfy the selection condition.• Can be applied regardless of selection condition, or regardless

of the availability of indices

• Index scan algorithms: Several different search algorithms thatuse an index:• Clustering ordered index, equality on candidate key: Retrieve a

single record that satisfies the corresponding equality condition• Non-clustering ordered index, equality: Retrieve a single record

if the search-key is a candidate key (cheap), or retrieve multiplerecords if search-key is not a candidate key (expensive).

12

Using Indices (Join Operation)

• Nested-loop join: Examines every pair of tuples in the tworelations, scanning the inner relation once per tuple (or block)in the outer relation.

• Indexed nested-loop join: Use index lookup to replace file scanin inner relation:• Can be used if join is an equi-join or natural join• Can be used if index is available on the inner relation’s join

attributes

• Merge-join: Can be used only for equi-joins and natural joins,exploring both relations sorted on their join attributes• Can sort one of the relations just to compute the join

13

Outline

Basic Concepts

B+-Tree Indices

B+-Tree Nodes

Queries and Updates on B+-Trees

Hash Indices



Indices and SQL14

B+-Tree Index Files

B+-tree indices are an alternative to indexed-sequential files

• Disadvantage of indexed-sequential files:• Performance degrades as file grows, since many overflow

blocks get created• Periodic reorganization of entire file is required

• Advantage of B+-tree index files:• Automatically reorganizes itself with small, local, changes, in

the face of insertions and deletions• Reorganization of entire file is not required to maintain

performance

• Disadvantage of B+-trees: extra insertion and deletionoverhead, space overhead• Advantages of B+-trees outweigh disadvantages, and they are

used extensively

15

B+-Tree Index Files (cont.)

A B+-tree is a rooted tree satisfying the following properties:

• All paths from root to leaf are of the same length

• Each node that is not a root/leaf has between n/2 and n

children (i.e., n is the maximum number of node pointers)

• A leaf node has between (n − 1)/2 and n − 1 values

• Special cases:• If the root is not a leaf, it has at least 2 children• If the root is a leaf (that is, there are no other nodes in the

tree), it can have between 0 and n − 1 values

16

Example of a B+-Tree

B+-tree for account file (n = 3)

B+-tree for account file (n = 5)

17

B+-Tree Node Structure

• Typical node:

P1 K1 P2 . . . Pn−1 Kn−1 Pn

• Ki are the search-key values• Pi are pointers to children (for non-leaf nodes) or pointers to

records or buckets of records (for leaf nodes)

• The search-keys in a node are ordered

K1 < K2 < K3 < · · · < Kn−1

18

Leaf Nodes in B+-Trees

Properties of a leaf node:

• For i = 1, 2, . . . , n − 1, pointer Pi either points to a file recordwith search-key value Ki , or to a bucket of pointers to filerecords, each record having search-key value Ki

• Only needs bucket structure if search-key does not form acandidate key

• If Li , Lj are leaf nodes and i < j , Li ’s search-key values are lessthan Lj ’s search-key values• Pn points to next leaf node in search-key order

19

Non-Leaf Nodes in B+-Trees

• Non leaf nodes form a multi-level sparse index on the leafs

• For a non-leaf node with m pointers:• All the search-keys in the subtree to which P1 points are less

than K1

• For 2 ≤ i ≤ m − 1, all the search-keys in the subtree to whichPi points have values greater than or equal to Ki−1 and lessthan Ki

• All the search-keys in the subtree to which Pm points aregreater than or equal to Km−1

P1 K1 P2 . . . Pm−1 Km−1 Pm

20

Observations about B+-trees

• Since the inter-node connections are done by pointers,“logically” close blocks need not be “physically” close

• The non-leaf levels of the B+-tree form a hierarchy of sparseindices

• The B+-tree contains a relatively small number of levels(logarithmic in the size of the main file), thus searches can beconducted efficiently

• Insertions and deletions to the main file can be handledefficiently, as the index can be restructured in logarithmic time

21

Queries on B+-Trees

To find all records with a search-key value of k :

1. Start with the root node1.1 Examine the node for the smallest search-key value > k .1.2 If such a value exists (assume it is Ki ), then follow Pi to the

child node1.3 Otherwise follow Pm to the child node (k ≥ Km−1, where there

are m pointers in the node)

2. If the node reached by following the pointer above is not a leafnode, repeat step 1 on the node

3. Else we have reached a leaf node3.1 If for some i , key Ki = k follow pointer Pi to the desired

record or bucket3.2 Else no record with search-key value k exists

22

Queries on B+-Trees (cont.)

• In processing a query, a path is traversed in the tree from theroot to some leaf node

• If there are K search-key values in the file, the path is nolonger than dlogdn/2e(K )e• A node is generally the same size as a disk block, typically 4

kilobytes, and n is typically around 100 (40 bytes per indexentry)

• With 1 million search key values and n = 100, at mostlog50(1 000 000) = 4 nodes are accessed in a lookup• Contrast this with a balanced binary free with 1 million search

key values—around 20 nodes are accessed in a lookup• The above difference is significant since every node access may

need a disk I/O, costing around 20 milliseconds!

23

Insertion on a B+-Tree

• Find the leaf node in which the search-key value would appear

• If the search-key value is already there in the leaf node, recordis added to file and if necessary a pointer is inserted into thebucket.

• If the search-key value is not there, then add the record to themain file and create a bucket if necessary. Then:• If there is room in the leaf node, insert (key-value, pointer)

pair in the leaf node• Otherwise, split the node (along with the new (key-value,

pointer) entry)

24

Splitting a Node

• To split a node:• take the n (search-key,pointer) pairs (including the one being

inserted) in sorted order• place the first dn/2e in the original node, and the rest in a new

node• let the new node be p, and let k be the smallest key value in p

• Insert (k ,p) in the parent of the node being split• If the parent is full, split it and propagate the split further up

• The splitting of nodes proceeds upwards till a node that is notfull is found

• In the worst case the root node may be split increasing theheight of the tree by 1

25

Insertion Example

B+-Tree before and after insertion of Clearview

26

Splitting a Node

Result of splitting node containing Brighton and Downtown oninserting Clearview

(example)

27

http://goneill.co.nz/btree-demo.php

Outline

Basic Concepts

B+-Tree Indices

Hash Indices

Static Hashing



Indices and SQL

28

Static Hashing

• A bucket is a unit of storage containing one or more records (abucket is typically a disk block)

• In a hash file organization we obtain the bucket of a recorddirectly from its search-key value using a hash function

• Hash function h is a function from the set of all search-keyvalues K to the set of all bucket addresses B

• Hash function is used to locate records for access, insertion,and deletion

• Records with different search-key values may be mapped tothe same bucket; thus entire bucket has to be searchedsequentially to locate a record

29

Hash File Organization

Hash file organization of account file, using branch_name as key

• There are 10 buckets

• The binary representation ofthe ith character is assumed tobe the integer i

• The hash function returns thesum of the binaryrepresentations of thecharacters modulo 10

• E.g. h(Perryridge) = 5,h(Round Hill) = 3,h(Brighton) = 3

30

Hash Functions

• Worst hash function maps all search-key values to the samebucket

• An ideal hash function is uniform, i.e., each bucket is assignedthe same number of search-key values from the set of allpossible values

• Ideal hash function is random, so each bucket will have thesame number of records assigned to it irrespective of theactual distribution of search-key values in the file

• Typical hash functions perform computation on the internalbinary representation of the search-key.• For example, for a string search-key, the binary representations

of all the characters in the string could be added and the summodulo the number of buckets could be returned

31

Handling of Bucket Overflows

• Bucket overflow can occur because of• Insufficient buckets• Skew in distribution of records. This can occur due to two

reasons:• multiple records have same search-key value• hash function produces non-uniform distribution of key values

• Although the probability of bucket overflow can be reduced, itcannot be eliminated; it is handled by using overflow buckets

32

Handling of Bucket Overflows (cont.)

• Overflow chaining - the overflow buckets of a given bucket arechained together in a linked list

• Above scheme is called closed hashing• An alternative, called open hashing (linear probing), which does not

use overflow buckets, is not suitable for database applications33

Hash Indices

• Hashing can be used not only for file organization, but also forindex-structure creation

• A hash index organizes the search keys, with their associatedrecord pointers, into a hash file structure

• Strictly speaking, hash indices are always secondary indices• if the file itself is organized using hashing, a separate primary

hash index on it using the same search-key is unnecessary• however, we use the term hash index to refer to both

secondary index structures and hash organized files

34

Example of Hash Index

35

Deficiencies of Static Hashing

• In static hashing, function h maps search-key values to a fixedset of B of bucket addresses• Databases grow with time. If initial number of buckets is too

small, performance will degrade due to too much overflows• If file size at some point in the future is anticipated and

number of buckets allocated accordingly, significant amount ofspace will be wasted initially• If database shrinks, again space will be wasted• One option is periodic re-organization of the file with a new

hash function, but it is very expensive• These problems can be avoided with dynamic hashing

techniques, allowing the number of buckets to be modifieddynamically

36

Outline

Basic Concepts

B+-Tree Indices

Hash Indices



Indices and SQL

37

Multiple-Key Access

• Use multiple indices for certain types of queries

Exampleselect account_number

from account

where branch_name = ’Perryridge’ and balance = 1000

• Possible strategies for processing query using indices on singleattributes:1. Use index on branch_name to find

branch_name = ’Perryridge’; test for balance = $10002. Use index on balance to find balance = $1000; test for

branch_name = ’Perryridge’3. Use branch_name index to find pointers to all records

pertaining to the Perryridge branch; similarly, use index onbalance; take intersection of both sets of pointers obtained

38


• Composite search keys are search keys containing more thanone attribute• E.g. (branch_name, balance)

• Lexicographic ordering: (a1, a2) < (b1, b2) if either• a1 < b1, or• a1 = b1 and a2 < b2

39

Examples

• With the where clause

where branch_name = ’Perryridge’ and balance = 1000

the index can be used to fetch only records that satisfy bothconditions• Using separate indices in less efficient — we may fetch many

records that satisfy only one of the conditions.• Can also efficiently handle

where branch_name = ’Perryridge’ and balance < 1000

• But cannot efficiently handle

where branch_name < ’Perryridge’ and balance = 1000

• May fetch many records that satisfy the first but not thesecond condition

40

Outline

Basic Concepts

B+-Tree Indices

Hash Indices



Indices and SQL

41

Full-Text Search

• Several relational database management systems (DBMSs)nowadays implement full-text search functionalities• Let users run full-text queries (queries based on search terms)

against character-based data stored in tables• Specialised full-text indexes support these operations

• Implementations rely on concepts and techniques from thearea of information retrieval• The ad-hoc search task concerns with retrieving relevant

documents to user queries expressing information needs.

42

Information Retrieval (IR) Concepts

• A document collection is a set of separate documents• In relational DBMSs implementing full-text search, the values

within different records for a (set of) textual attribute(s),within a relation, can be seen as a document collection

• Documents are composed of terms, and we conceptuallymodel documents as bags of terms or as vectors

• Full-text indexes provide efficient access to documents, givenone or more search terms• A common choice of data structure is an inverted index

• Document statistics stored in the indexes can be used to sortmatching documents according to relevance estimates• Often using variations of the vector space model for IR

43

An Example: Representing Documents

Example documentI heartily accept the motto, “That government is best whichgoverns least”; and I should like to see it acted up to morerapidly and systematically. Carried out, it finally amountsto this, which also I believe—“That government is bestwhich governs not at all”; and when men are prepared forit, that will be the kind of government which they will have.

44


Modeling documents as bags of terms

I

3

accept

1

acted

2

all

3

also

1

amounts

1

and

3

are

1

at

1

be

1

believe

1

best

2

carried

1

finally

1

for

1

government

3

governs

2

have

1

heartily

1

is

2

it

3

kind

1

least

1

like

1

men

1

more

1

motto

1

not

1

of

1

out

1

prepared

1

rapidly

1

see

1

should

1

systematically

1

that

3

the

2

they

1

this

1

to

3

up

1

when

1

which

4

will

2

44



I 3 accept 1 acted 2 all 3 also 1 amounts 1and 3 are 1 at 1 be 1 believe 1 best 2

carried 1 finally 1 for 1 government 3 governs 2 have 1heartily 1 is 2 it 3 kind 1 least 1 like 1men 1 more 1 motto 1 not 1 of 1 out 1

prepared 1rapidly 1 see 1 should 1 systematically 1 that 3the 2 they 1 this 1 to 3 up 1 when 1

which 4 will 2

44



I 3

accept 1 acted 2 all 3

also 1

amounts 1

and 3

are 1

at 1

be 1 believe 1 best 2carried 1 finally 1

for 1

government 3 governs 2 have 1heartily 1 is 2

it 3

kind 1 least 1 like 1men 1 more 1 motto 1

not 1 of 1

out 1prepared 1rapidly 1 see 1 should 1 systematically 1

that 3the 2 they 1 this 1 to 3 up 1 when 1

which 4

will 2

44


Modeling documents as vectors

accept acted all . . . government governs . . .d1 1 2 3 . . . 3 2 . . .

d2 0 1 0 . . . 2 2 . . .

d3 0 2 0 . . . 1 0 . . .

d4 2 0 2 . . . 2 1 . . .. . .

44

An Example: Indexing a Document Collection

Doc. Text1 That government is best which governs least2 That government is best which governs not at all3 The mass of men serve the state not as men,

but as machines4 Wooden men can be manufactured that will serve the

purpose as well5 Government is at best but an expedient6 But most governments are usually inexpedient

45

An Inverted Index

Lexicon

Num. Term1 best2 expedient3 government4 governs5 inexpedient6 least7 machines8 manufactured9 mass10 men11 purpose12 serve13 state14 wooden

Inverted file

Num. Inverted list1 〈3; 1, 2, 5〉2 〈1; 5〉3 〈4; 1, 2, 5, 6〉4 〈2; 1, 2〉5 〈1; 6〉6 〈1; 1〉7 〈1; 3〉8 〈1; 4〉9 〈1; 3〉10 〈2; 3, 4〉11 〈1; 4〉12 〈2; 3, 4〉13 〈1; 3〉14 〈1; 4〉 46

An Inverted Index (With Term Frequency Information)

Lexicon

Num. Term Doc. Freq.1 best 32 expedient 13 government 44 governs 25 inexpedient 16 least 17 machines 18 manufactured 19 mass 110 men 211 purpose 112 serve 213 state 114 wooden 1

Inverted File

Inverted list(1; 1), (2; 1), (5; 1)(5; 1)(1; 1), (2; 1), (5; 1), (6; 1)(1; 1), (2; 1)(6; 1)(1; 1)(3; 1)(4; 1)(3; 1)(3; 2), (4; 1)(4; 1)(3; 1), (4; 1)(3; 1)(4; 1) 47

Query Processing with Inverted Indexes

• Query processing involves searching the index andmanipulating the inverted lists

• For instance:

• Query “government and men”• Query “men or machines”• Query “men and not machines”

Besides Boolean retrieval, typical implementations also offermechanisms for ranking results according to relevance estimates

48



• For instance:• Query “government and men”

• Query “men or machines”• Query “men and not machines”


48



• For instance:• Query “government and men” ⇒ intersection between inverted

lists 3 and 10 (= ∅);

• Query “men or machines”• Query “men and not machines”


48




lists 3 and 10 (= ∅);• Query “men or machines”

• Query “men and not machines”


48




lists 3 and 10 (= ∅);• Query “men or machines” ⇒ union between inverted lists 10

and 7 (= {3, 4});

• Query “men and not machines”


48





and 7 (= {3, 4});• Query “men and not machines”


48





and 7 (= {3, 4});• Query “men and not machines” ⇒ inverted list 10 except

inverted list 7 (= {4});


48

The Vector Space Model for Ranked Retrieval

Concepts:

• Documents are conceptually represented as vectors of termweights• ~dj = (w1,j ,w2,j , . . . ,wt,j)

• wi,j is the weight of term i in document j

• Queries are also conceptually represented as vectors• ~q = (w1,q,w2,q, . . . ,wt,q)

• Vector operations can be used to compare queries withdocuments (or documents with documents)

Two important questions:

1. How can we define term weights?

2. How can we compare documents to queries?

49

Defining Term Weights — TF

Term frequencyTerm frequency is a measure of term importance within adocument

DefinitionLet N be the total number of documents in the collection and ni

be the number of documents in which term ki appears. Thenormalized frequency of a term ki in document dj is given by:

fi ,j =freqi ,j

maxl freql ,j

In the previous equation, freqi ,j is the number of occurrences ofterm ki in document dj .

50

Defining Term Weights — IDF

(Inverse) Document frequencyDocument frequency is a measure of term importance within adocument collection

DefinitionThe inverse document frequency of a term ki is given by:

idfi = log

(N

ni

)

51

Defining Term Weights — TF-IDF

DefinitionThe weight of a term ki in document dj according to the vectorspace model is given by the tf-idf formula:

wi ,j = fi ,j × log

(N

ni

)

52

Document Similarity

• Similarity between documents and queries is a measure of thecorrelation between their vectors

• Documents/queries that share the same terms, with similarweights, should be more similar

• Thus, as similarity measure, we use the dot product betweenthe vectors, or the cosine of the angle between the vectors

sim(dj , q) =~dj · ~q|~dj | × |~q|

=

∑ti=1 wi ,j × wi ,q√∑t

i=1 w2i ,j ×

√∑ti=1 w

2i ,q

53

Improvements over TF-IDF and The Vector Space Model

The BM25 ModelConsider not only the term frequency and inverse documentfrequency heuristics, but also the document length as anormalization factor for the term frequency

TFi ,j =fi ,j × (k1 + 1)

fi ,j + k1 ×(1− b + b

|dj |avgdl

)IDFi = log

N − ni + 0.5ni + 0.5

sim(dj , q) =∑i∈q

IDFi × TFi ,j

54

Outline

Basic Concepts

B+-Tree Indices

Hash Indices



Indices and SQL

55

Index Definition in SQL

• Create an index:

CREATE INDEX <index-name> ON<relation-name>(<attribute-list>)

• E.g.: CREATE INDEX branch_index onbranch(branch_name)

• Can select the index type with using <type>

• To drop an index:

drop index <index-name>

56

Full-Text Indices in MySQL

• Create an inverted index over a textual attribute:

CREATE FULLTEXT INDEX <index-name> on<relation-name>(<text-attributes>)

• As shown previously, inverted indexes store a list of words, andfor each word, the list of records where the word appears in.

• To use the full-text index:

SELECT * , MATCH (<text-attributes>) AGAINST (’query’) as relevanceFROM <relation-name>

WHERE MATCH (<text-attributes>) AGAINST (’query’)ORDER BY relevance;

57

Questions?

57