datab ch3 file
TRANSCRIPT
-
7/28/2019 DataB Ch3 File
1/15
File Organization Chapter_3 (1 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
File Organization for DBMS
Physical Storage Media
Several types of data storage exist in most computer systems. These storage
media are classified by the speed with which data can be accessed, by the cost per unit
of data to buy the memory, and by how reliable they are. Among the media
available are:
Cache :This is the fastest and most costly form of storage. The size ofcache memory is very small and the use of cache is managed by theoperating system.
Main Memory : This is the storage media used for data that isavailable to be operated on. The general-purpose machine instructions
operate on main memory. The main memory is sometimes referred to
as core memory. The contents of the main memory are usually lost
if a power failure or system crash occurs.
Disk Storage :This is the primary medium for the long-term storageof data. Typically the entire data base is stored on disk. Data must be
moved from disk to main memory in order for the data to be operated
on. After operations are performed, data must be returned to disk.
Disk storage is referred to as direct-access storage because it is
possible to read data on disk in any order. Disk storage usually
survives power failure and system crashes.
Tape Storage :This is storage used primarily for backup and archivaldata. Although tape is much cheaper than disk, access to data is much
slower, since the tape must be read sequentially from the beginning.
For this reason, tape storage is referred to as sequential-access
storage and is used primarily for recovery from failures. Tape devices
are less complex than disks; thus, they are more reliable.
The figure below shows a simple disk. The head is a device which stays closeto the surface of the platter and reads or writes information encoded magnetically on
-
7/28/2019 DataB Ch3 File
2/15
File Organization Chapter_3 (2 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
the platter. The platter is organized into concentric tracks of data. The arm can be
positioned over any one of the tracks. The platter is spun at a high speed. To read or
write information, the arm is positioned over the correct track; and, when the data to
be accessed passes under the head, the read or write operation is performed.
Since the platter rotates at a high speed, it does not take very long for the
contents of an entire track to pass under the head. This time is known as the disk
latency time. The time for repositioning the arm, known as seek time, is relatively
longer than the latencytime. It is useful to store related information on the same track
or on physically close tracks in order to minimize the seek time.
It is typical to have high-capacity disks with multiple platters. Multiple-platter
disks are called disk packs. All the disk arms are moved as a unit by the actuator.Each arm has two heads: one to read and write the top surface of the platter below it,
Platter
Head
Arm
Track 0
Track 1
Track n
Disk latency time
Seek time
Platter 0
actuator
Arm 0
Arm 1
Arm 2
Platter 1Platter 2
Platter 3Disk pack
-
7/28/2019 DataB Ch3 File
3/15
File Organization Chapter_3 (3 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
and one to read and write the bottom surface of the platter above it. At any moment,
the set of tracks over which the heads are located form a cylinder. This cylinder holds
the data that is accessible without any movement of the actuator. That is, all data
within this cylinder is accessible within the disk latency time. Just as it is efficient to
store related data in a single track or a collection of close tracks, so it is efficient to
store related data in the same cylinder, or in cylinders that are close to one another.
Data is transferred between disk and main memory in units called blocks. A
block is a contiguous sequence of bytes from a single track of one platter. Block sizes
range from 512 bytes to several thousand bytes.
File Organization
A f i leis organized logically as a sequence of records. These recordsare mapped onto disk blocks. Files are provided as a basic construct in operating
systems. Although blocks are of a fixed size determined by the physical properties of
the disk and by the operating system, record sizes vary. For example, in a network
database, it is likely that he owner record type is of a different size than the member
record type. One approach to mapping the database to files is to use several files and
store records of only one fixed length in any file. An alternative is to structure the
files such that we can accommodate multiple lengths for records.
Let us consider ofdepositrecords for our bank database. Each record of this
file is defined as follows:
type deposi t= recordbranch_name : string[20];account_number : integer;customer_name : string[20];balance : real;end;
Table 1 :
record 0 Aden 102 Ali 12,000record 1 Arwa 200 Ahmed 50,000
blocks
-
7/28/2019 DataB Ch3 File
4/15
File Organization Chapter_3 (4 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
record 2 National 115 Abdulla 20,000
record 3 Watni 333 Salem 1000
record 4 Taiz 209 Omer 30,000
record 5 Islamic 191 Aziz 10,000
record 6 Crater 555 Mohd 10,000
record 7 Maala 524 Jamil 12,000
record 8 Sanaa 882 Salah 15,000
If we assume that each character occupies a byte, an integer occupies 4 bytes,
and a real 8 bytes, our deposit record is 52 bytes long. A simple approach is to use
the first 52 bytes for the first record, next 52 bytes for the second record, and so on
(Table 1). The problem here are :
It is difficult to delete a record.
Unless the block size happens to be a multiple of 52, some records will
cross block boundaries. That is part of the record will be stored in oneblock and part in another. It would thus require two block accesses to
read or write such a record.
When a record is deleted, we could move the record that came after it into the
space formerly occupied by the deleted record, and so on, until every record following
the deleted record has been moved ahead. (Table 2). This approach will require
moving a large number of records. The second approach is to move the last record of
the file into the space of the deleted record.(Table 3).
Table 2 :record 0 Aden 102 Ali 12,000
record 1 Arwa 200 Ahmed 50,000
record 3 Watni 333 Salem 1000
record 4 Taiz 209 Omer 30,000
record 5 Islamic 191 Aziz 10,000
record 6 Crater 555 Mohd 10,000
record 7 Maala 524 Jamil 12,000
record 8 Sanaa 882 Salah 15,000
Table 3 :record 0 Aden 102 Ali 12,000
record 1 Arwa 200 Ahmed 50,000
record 8 Sanaa 882 Salah 15,000
record 3 Watni 333 Salem 1000
record 4 Taiz 209 Omer 30,000
record 5 Islamic 191 Aziz 10,000
record 6 Crater 555 Mohd 10,000
record 7 Maala 524 Jamil 12,000
Since the insertion tent to be more frequent than deletions, it is acceptable to
leave the space of the deleted record open. At the beginning of the file, we allocate af i le header. This header contains a variety of information about the file. It may store
-
7/28/2019 DataB Ch3 File
5/15
File Organization Chapter_3 (5 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
the address of the first record whose contents are deleted. In addition, we use this first
record to keep the address of the next record whose contents are deleted, and so on.
We may think of these stored addresses as pointers. (Table 4).
Table 4:
record 0 Aden 102 Ali 12,000
record 2 National 115 Abdulla 20,000
record 3 Watni 333 Salem 1000
record 5 Islamic 191 Aziz 10,000
record 7 Maala 524 Jamil 12,000
record 8 Sanaa 882 Salah 15,000
f i le organization
Just as arrays, trees, and other data structures are used to implement data
organization in main memory. A number of strategies are used to support theorganization of data in secondary memory. The four fundamental file organization
techniques are:
(1) Sequential,(2) Indexed Sequential,(3) Relative,(4) Multi-Key.
The file organizations differ two basic ways. First the organization determines
the filesrecordsequencing, which is the physical ordering of the records onstorage. Second, the file organization determines the set of operation necessary
to fine particular records. The organization most appropriate for a particular
file is determine by the operational characteristics of the storage medium used
and the nature of the operations to be performed on the data. The most
important storage characteristic of the storage device is that whether the device
allows direct accessto particular record, or allows only sequenti al accessto
record occurrences. Magnetic disks are examples of direct access storage
devices (DASDs), and magnetic tapes are examples of sequential storage
devices.
-
7/28/2019 DataB Ch3 File
6/15
File Organization Chapter_3 (6 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
SEQUENTIAL FILEORGANIZATION
A sequentialf i leis designed for efficient processing of records in sorted order
based on some searchkey. The file records are written consecutively when thefile is created and must be accessed consecutively when the file used. The
records are maintained in the logical sequence of theirprimarykey values.
The processing of a sequential file is simple but inefficient for random access.
However, if access to the file is strictly sequential, a sequential file is suitable.
The file could be stored on a sequential storage device such as a magnetic tape.
Search for a given record in a sequential file requires, on average, access to
half the records in the file. Updating usually requires the creation of a new file.
To maintain file sequence, records are copied to the point where the updating is
required. The changes are then made and copied into the new file. Followingthis, the remaining records are copied. This method of updating a sequential file
creates an automatic backup copy. Addition can be handled similar to updating.
Inversely, deletion of a record requires a compression of the file space, achieved
by shifting of records.
The basic advantage offered by a sequential file is the ease of access to the
next record, the simplicity of organization and the absence of auxiliary data
structure. However, replies to simple queries are time consuming for large files.
A single update is an expensive proposition if a new file must be created. To
reduce the cost per update, all the requests are batched, sorted in the order of the
sequential file and then executed in one pass. Such a file that contains the
updates is called a transaction f il e.
A possible method of reducing the creation of a new file at each update run is to
create original file with holes (space left for the addition of new records). As such,
if a block could hold K records, then at initial creation it is made to contain only L*Krecords, where 0
-
7/28/2019 DataB Ch3 File
7/15
File Organization Chapter_3 (7 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
An index is a set ( x , address ) pairs. Indexing associates a set of objects toa set of orderable quantities, which are usually smaller in number or their
properties provide a mechanism for faster search. The purpose of indexing is to
speed the search process. A sequential (or sorted on primary keys) file that is
indexed is called an index sequential file. The index provides for randomaccess to records, while the sequential nature of the file provides easy access to
the subsequent records. An additional feature of this file system is the
OVER_FLOWarea, which provides additional space for record addition without
necessitating the creation of a new file.
TYPES OF INDEXES
The idea behind an index access structure is similar to that behind the indexesused in textbooks. An index is usually defined on a single field of a file, called an
indexing field. The index stores each value of the indexfieldalong with a list ofpointers to all disks blocks that contain a record with that field value. The values
of the index are ordered so that we can do a binarysearch on the index. There
are several types of indexes.
Primary index: Is an index specified on the orderingkey field of an orderedfile of records. The ordering key field is used to physically order the file records
on disk, and every record has a unique value for that field.
Clustering index: If the ordering field is not a key field, that is, severalrecords in the file can have the same value for the ordering field.
Secondary index: Can be specified on any non-ordering field of a file. Thefile can have several secondary indexes in addition to its primary access method.
a) PRIMARY INDEXES
-
7/28/2019 DataB Ch3 File
8/15
File Organization Chapter_3 (8 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
A primary index is an ordered file whose records are of fixed length with two
fields. Thefirst fieldis of the same data types as the ordering key field of the data
file, and thesecond fieldis a pointer to a disk block. The ordering key field is called
the primary key of the data file.
The first record in each block of the data file is called the anchor record of the
block.
BLOCK BLOCK
ANCHOR POINTER
Aaron
Adams
Wright
NAME SSN SALARY
ADDRESS
Aaron 100 3000 112 road
Abbot 101 12000 Keen 10
Acosta
Adams
.
Akers
Wright
.
.
Zimmer
b) CLUSTERING INDEXES
If records of a file are physically ordered on a non-key field that does not have
a distinct value for each record, that field is called the clustering field of the file.
A clustering index is an ordered file with two fields, the first field is of the same type
as the clustering field of the data file, and the second field is a block pointer.
CLUSTERING BLOCK
FIELD POINTER
1
2
8DEPARTMENT NAME SSN SALARY ADDRESS
1 Aaron 100 3000 112 road
1 Abbot 101 12000 Keen 10
1
1 Acosta
2
-
7/28/2019 DataB Ch3 File
9/15
File Organization Chapter_3 (9 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
2
2
2
8
8
8
8
c) SECONDARY INDEXES
A secondary index is also an ordered file with two fields, the second field is a
pointer to a disk block.
1
2
3
45
6
9 Aaron 100
6 Abbot 101
2
1
3
4
5
15
..
The first field is of the same data type as some non-ordering field of the data
file. The field on which the secondary index is constructed is called an indexing
field of the file, whether its values are distinct for every record or not.
A secondary index will usually need substantially more storage space than a primary
index because of its larger number of entries.
d) MULTI-LEVELINDEXING SCHEMES
For a large file, it is possible to create a hierarchy of indexes with the lowest
level index pointing to the records, while the higher-level index point to the indexes
below them. The lowest level index consists of the pair of each record
in the file.
key key key key
-
7/28/2019 DataB Ch3 File
10/15
File Organization Chapter_3 (10 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
1000 I11 200 I21 60 I31 10 P11
2000 I12 450 I22 99 I32 29 P12
780 I23 150 55
1000 200 60
999
1100 210 7688
2000 450 99
-
7/28/2019 DataB Ch3 File
11/15
File Organization Chapter_3 (11 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
ISAM Technique
When a record is stored by ISAM (Index Sequential Access Method), itsrecord key must be one of the fields in the record. Each record is stored on one of the
tracks of a disk. If room does not permit, they are spilled over onto the next track in
the same cylinder.
Figure below shows two cylinders of records, but only their keys are shown.
When ISAM retrieves a record, it needs to know the cylinder, the track address, and
the record key.
CYLINDER
1
1 50 60 70 80 90
2 100 110 120 130 1403 150 160
930 940
19 950 960 970 980 990
20 1000 1010 1020 1050 1060
CYLIND
ER
2
1 1090 1100 1230 1250 1300
2 1345 1560 1600 1700 1711
3 1900
19
20 2990 3001
Cylinder 0Cylinder 1
Cylinder 2Top not used
Track
Track
TrackTrack
-
7/28/2019 DataB Ch3 File
12/15
File Organization Chapter_3 (12 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
DIRECT FILEORGANIZATION
In the index-sequential file organization, the mapping from the search-key
value to the storage location is via index entries.In direct file organization, the key value is mapped directly to the storage location.
The usual method of direct mapping is by performing some arithmetic manipulation
of the key value. This process is called HASHING.
Let us consider a hash function hthat maps the key valueKto the value h(K). The
value h(K) is used as an address.
It is obvious that a hash function that maps many different key values to a
single address is a bad hash function. A collision is said to occur when two distinct
key values are mapped to the same storage location.
Consider the hash key h(K) = K % 100 which produce a set of indices from 0 to 999.
Position0 4967000
1 4967700
2 8421845
990 00009978
999 00018773
RESOLVING HASH CLASHES
The simplest method of resolving hash clashes is to place the record in the
next available position in the array of locations.
Othermethod is called L inear Probing or Rehashing. In general, with a rehash
function rh, if the location h(K) is already occupied by a record, a rh(K)is used.
Another solution is to use a linked list for every position.
HashAddressKey
-
7/28/2019 DataB Ch3 File
13/15
File Organization Chapter_3 (13 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
MULTI-KEY FILEORGANIZATION
The organization, which enables a single data file to support multiple access
paths, each by a different key. Consider the following table;
Recor
d
K# Name Occup at io
n
Degre
e
Sex Salary
A 800 Aaa Programmer
M.Tech
M 10000
B 510 Bbb Analyst B.Sc F 15000
C 950 Ccc Analyst B.Sc F 12000
D 750 Ddd Programmer
M.Sc F 12000
E 620 Eee programmer
B.Sc M 9000
K# Upper Value
700 Record B Record E
900 Record D Record A
1100 Record C
Linking together all records of the same type.
K# INDEX OCCUPATIONINDEX
Max K 700 900 1100 Value Analyst Programmer
Numbers
2 2 1 Length 2 3
Pointer B D C Pointer B E
K# 0 0 0
Occupation C 0 0
Sex index D A C 0 0
Salaryindex
O 0 C D 0
B E D A C
Value Female Male Value
-
7/28/2019 DataB Ch3 File
14/15
File Organization Chapter_3 (14 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University
Length
3 2 Length 1 3 1
Pointer
B E Pointer E A B
SEX INDEX SALARY INDEX
-
7/28/2019 DataB Ch3 File
15/15
File Organization Chapter_3 (15 / 15)
___________________________________________________________________________________
_________________________________________________________________________________
Dr. Mohammed Fadle AbdullaComputer Sci & Engg. Department, Aden University