implementation of relational operators/estimated cost 1.select 2.join
TRANSCRIPT
Implementation of Relational Operators/Estimated Cost
1.Select2.Join
Sailors(sid, sname, rating, age)– Each Sailors tuple is 50 bytes long (fixed length record
format)– A page size is 4K bytes– #tuples: 40,000– All pages are full, unpacked bitmap; 96 bytes are
reserved for slot directory– How many pages for Sailors?
• One page can contain at most tuples
• Sailors occupies pages
4096 9680
50
40000500
80
Cost of Selection Operators)( op . RvalueAR
Factors to Consider
)( op . RvalueAR
• No index• unsorted data• sorted data
• Index• tree index• hash-based index
No index, unsorted data
. = (R)R A valueSuppose R is Sailors
Best access path: File ScanI/O Cost: 500 pagesI/O time cost: 500 * time to access each pageComplexity: O(|R|)
Notation: |R| is the number of pages in R
No Index, sorted file on R.A
. = (R)R A valueSuppose R is Sailors
Sorted on R.ABest Access Path:
•Binary search to locate the first tuple with R.A=Value•Scan the remaining records
I/O Cost: log2(|R|)+Cost of scan for remaining tuples (0 ~ |R|)
Tree Index on R.A
Selection Cost = cost of traversing from the root to the leaf + cost of retrieving the pages in the sequence set + cost of retrieving pages containing the data records.
• Need to know– Format of the data entries in the leaf node– Clustered or unclustered – Dense or sparse
. = (R)R A value
Index entries
Data entries
Data Records
Page (node)
Alternatives for Data Entry k* in Index Alternative 1: Data entry is the actual tuple
o Tuples organized according to the search key value
o Allow only one Alternative 1 index per relationA
B
C
D
E
F
G
H
I
J
K
40
20
33
51
63
10
15
27
33
40
55
L 97
F,10
G,15
H,27
I,33
J,40
K,55
L,97
B,20
C,33
D,51
E,63
A,40
Root
Alternatives for Data Entry k* in Index
Alternative 2: Data entry is <k, rid of data record with search key value k>
o Ex. <50, (1, 10)>: search key value = 50, rid tells that the data is in page 1, slot 10
10* 15* 20* 27* 33* 33* 40* 40* 51* 55* 63* 97*
20 33 51 63
40
Root
A
B
C
DE
F
G
H
I
J
K40
20
33
5163
10
15
27
33
40
55
L 97
Alternatives for Data Entry k* in Index
Alternative 3: Data entry is <k, list of rids with search key k>
o Ex: <50, (1,10), (2,20), (3,1)>
search key value = 50; three record IDs
10* 15* 20* 27* 33* 40* 51* 55* 63* 97*
20 33 51 63
40
Root
A
B
C
DE
F
G
H
I
J
K40
20
33
5163
10
15
27
33
40
55
L 97
Index entries
Formats of Data entries
Data Records
Pros and Cons Data entries are typically much smaller than data records. So,
Alternatives 2 and 3 are better than Alternative 1 with large data records, especially if search keys are small.
If more than one index is required on a given file, at most one index can use Alternative 1; the rest must use Alternatives 2 or 3.
Alternative 3 is more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length.
Index entries
Data entries
direct search for
(Index File)(Data file)
Data Records
data entriesData entries
Data Records
DENSE INDEX SPARSE INDEX
Dense or Sparse
• Dense: At least one data entry per data record• Sparse: At least one data entry per block/page
Pros and Cons:• Dense: less space-efficient, but great for both equality and range
search• Sparse: more space-efficient, but need sequential search within a
page
Dense or Sparse
10* 15* 20* 27* 33* 33* 40* 40* 51* 55* 63* 97*
20 33 51 63
40
Root
A
B
C
DE
F
G
H
I
J
K40
20
33
5163
10
15
27
33
40
55
L 97
Dense or Sparse
20 33 51 63
40
Root
AB
C
D EF
G H
I J
K4020
33
51 6310
15 27
33 40
55 L 97
Index entries
Data entries
direct search for
(Index File)(Data file)
Data Records
data entriesData entries
Data Records
CLUSTERED INDEX UNCLUSTERED INDEX
Clustered or Unclustered
Clustered Index : • The ordering of data records is organized the same as or
close to the ordering of data entries in the index
• Sparse index is always clustered (why?)• A clustered index does not have to be sparse (why?)
Pros and Cons:• Clustered: maintenance cost high, but great for range search• Unclustered: low maintenance cost, but high retrieval cost
• Retrieving one record may need to load one page
Index Classification
• Primary vs. secondary: If search key contains primary key, then called primary index.– Unique index: Search key contains a candidate key.– Non-unique index: Search key is a non-candidate
attributeo Multiple records may be associated with a same
attribute value
• Support equality and range-searches efficiently• Balanced tree• Search/Insert/delete at log F N cost (F = fanout, N = # leaf
pages)• F => 100• The cost of traversing from root to data entries is usually
regarded as 4• Minimum 50% occupancy (except for root). • Each node except root contains d <= m <= 2d entries.
The root node contains 1<= m <= 2d entries.• The parameter d is called the order of the tree.
Index Entries
Data Entries
("Sequence set")
(Direct search)
B+Tree: the Most Widely Used Index for RDBMS
doubly linked (why?)
B+ Tree Example• Search begins at root, and key comparisons direct it to a leaf.• Search for tuples whose search key = 15
• Follow the left pointer if the desired value is less than the value in the node
• Otherwise, follow the right pointer
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
B+ Tree Example
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
dense sparse
Dense/SparseClustered/Unclustered
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
B+Tree Index on R.A
– Format of data entry: alternative 2– Size of data entry = 20 bytes; – Page size=4K bytes; 96 bytes are reserved– Total number of records = 100,000; record size = 40
bytes– Reduction Factor = 0.1
• #matching entries/#total entries
Total Cost = Cost of traversing from the root to the leaf (assume 4 I/Os) +Cost of retrieving the pages in the sequence set + the cost of retrieving pages containing the data records
. = (R)R A value
• B+tree, Alternative 2, dense, unclustered• I/O cost of retrieving pages of qualifying data
entries– Matching data entries: 0.1*100000=10,000
entries– #Date entries per page: – Pages of matching data entries = 10000/200 =
50 pages• I/O cost of retrieving qualifying tuples
– 10,000 pages since the index is unclustered, the qualifying tuples are not always in the same order as the data entries.
– In the worst case, for each qualifying data entry, one I/O is needed
• Total I/O Cost = 4+ 50+10,000 pages
4096 96200
20
Data Records
UNCLUSTERED INDEX
Data entries
• Format of data entry: alternative 2• Size of data entry = 20 bytes; • Page size=4K bytes; 96 bytes are reserved• Total number of records = 100,000; record size = 40 bytes• Reduction Factor = 0.1
• B+tree, Alternative 2, dense, clustered• I/O cost of retrieving pages of qualifying data
entries– Matching data entries: 0.1*100000=10000 entries– #Date entries per page: – #Pages of matching data entries =
• I/O cost of retrieving qualifying tuples– #Matching tuples: 10000 – Since the index is dense and clustered, the
qualifying tuples are also clustered– # pages: 10000/100=100 due to
(4096-96)/40=100 tuples per page• Total I/O Cost = 4+ 50+100 pages
4096 96200
20
10000
50200
Data entries
Data Records
CLUSTERED INDEX
• B+tree, Alternative 2, sparse (must be clustered)• I/O cost of retrieving qualifying tuples
– #Matching tuples: 0.1*100,000=10,000 – Since the index is clustered, the qualifying tuples are also
clustered– # pages: due to 100 tuples per page
• I/O cost of retrieving pages of qualifying data entries– Matching data pages: 100 – #Data entries per page: – #Pages of matching data entries = =1 page
• Total I/O Cost = 4+ 1+100 pages
4096 96200
20
100
200
10000
100
Data entries
Data Records
CLUSTERED INDEX
SPARSE INDEXEach entry links to one page
Hash-based Index
• Hash-based index on A is good for equality search on attribute A– Usually cannot support range search
. = (R)R A value
Data File
3000
3000
5004
4003
2007
6003
6003
5004
Ashby, 25, 3000
Smith, 44, 3000
Bristow, 30, 2007
Basu, 33, 4003
Cass, 50, 5004
Tracy, 44, 5004
Daniels, 22, 6003
Jones, 40, 6003
h2sal
h2(sal)=00
h2(sal)=11
Hashed index
Data entriesData records
I/O costs1) Cost for retrieving the
matching data entries2) Cost for retrieving the
qualifying tuples
I/O cost = cost for retrieving the matching data entries
+ cost for retrieving the qualifying tuples
• I/O cost of retrieving pages of matching data entries– Matching data entries: 0.1*100000=10000 entries– I/O cost = 10000*1.2 = 12000 I/Os
• I/O cost for retrieving the qualifying tuples = 12000 I/Os
• Total cost = 12000+12000 = 24000 I/Os
Total number of records: 100,000Reduction Factor: 0.1Cost for searching the matching data entry: 1.2 I/OAssume dense and unclustered
Factors to Consider
)( op . RvalueAR• No index
• unsorted data• sorted data
• Index• tree index• hash-based index