slides 1 introduction to file structures
TRANSCRIPT
Introduction to File Structures
Objectives
At the end of the lesson, the student should be able to:
• Explain the concepts behind physical file organization
• Differentiate logical and physical files
• Determine the advantages of using primary over secondary storage devices
• Identify the characteristics, capacities and access costs of different storage devices
• Explain the concepts behind file organization and file access
• Identify the different indexing mechanisms
• Use file compression techniques to improve space utilization and file access performance
CS 165 Database Systems 1 Lesson 1: Introduction to File Structures
Issues in Physical File Organization
• Slow disk access• Volative primary storage• Virtually infinite secondary storage• Cheaper secondary storage• Dynamic nature of data• Need for fast searching and retrieval
Ideally, we would like to get information with one access to the disk. If it is not possible, even with just as few accesses as possible. We also want to have file structures with information groups, or what we call records, to get a bulk of information in just one access to the disk.
Solutions Made
• Sequential access• Indexed access• Binary trees• Tree-structured file organization: B-tree, B*-tree and B+-tree• Direct access design in the form of hashing
CS 165 Database Systems 2 Lesson 1: Introduction to File Structures
Logical vs. Physical Files
physical file ⇒ a file in a disk, which means it exists physicallylogical file ⇒ a file as viewed by the user
Example: assign(input_file,’input.txt’);
input_file ⇒ logical, input.txt ⇒ physical
Secondary Storage Devices
⇒ hold files that are not currently being used
⇒ used for mid-term to long-term storage
Types of Storage Devices
• Direct Access Storage Devices (DASDs) ⇒ also known as random access devices⇒ magnetic disks such as hard disks, floppy disks and optical disks
• Serial Devices ⇒ use media such as magnetic tapes that permit only serial access.
CS 165 Database Systems 3 Lesson 1: Introduction to File Structures
Disks
Use magnetism to store the data on a magnetic surfaceAdvantages: high storage capacity, reliable, gives direct access to data A drive spins the disk very quickly underneath a read/write head
Organization of Disks
All magnetic disks are similarly formatted, or divided into areas, called tracks, sectors and cylinders.
• Platters • Tracks • Sectors• Disk pack •
• Cylinder
Seeking ⇒ Moving the r/w head to find the right location on the disk
⇒ Usually the slowest part of reading information
CS 165 Database Systems 4 Lesson 1: Introduction to File Structures
Disk Capacity
• Amount of data on a track depends upon how densely bits can be stored Low density High density
• To compute track, cylinder and drive capacity:
CT = Track Capacity = #sectors/track x #bytes/sectorCC = Cylinder Capacity = #tracks/cylinder x CT
CD = Drive Capacity = #cylinders/drive x CC
Example: What is the capacity of the drive with the following characteristics?
#bytes/sector = 512#sectors/track = 40#tracks/cylinder = 11#cylinders = 1331
CS 165 Database Systems 5 Lesson 1: Introduction to File Structures
Cost of a Disk Access (Latency)
Factors affecting disk access time: seek time [ts], rotational delay [tr], block transfer time [tbt]
Step Measured as:(in ms)
1. seek move the head to proper track
seek time
2. rotaterotate disk under the head to the correct sector
rotational delay
3. settlehead lowers to disk;wait for vibrations from moving to stop (actually touches only on floppies)
settling time
4. data transfercopy data to main memory
data transfer time
CS 165 Database Systems 6 Lesson 1: Introduction to File Structures
Seek Time
• Time it takes to move the arm to the correct cylinder
• Depends on the physical characteristics of the disk drive as well as on the #cylinders.
• Assumption: all cylinders are equally likely to be sources or destinations.
minimum seek time ⇒ time it takes to move the access arm from a track to its adjacent track.
maximum seek time ⇒ time it takes to move the access arm from the outermost to the innermost track.
average seek time ⇒ the average of the minimum and maximum seek times.
Rotational Delay
• Time it takes for the head to reach the right place on the track.
• Assumption: Correct block can be recognized by some flag or block identifier at the beginning.
rotational time (tr ) ⇒ time needed for a disk pack to complete one whole revolution.
CS 165 Database Systems 7 Lesson 1: Introduction to File Structures
• Average latency time: [tr/2]= time it takes the disk drive to make ½ revolution.
tr/2 = (1/2) (60 x 1000) / rpm
where rpm is revolutions per minute
or if tr is given:
tr/2 = tr / 2
• Hard disks usually rotates at about 7,200 rpm ~ 1 rev. per 8.33 ms so the average rotational delay is 4.167 msec.
• For floppy disk: 360 rpm, tr/2 = 83.3 ms.
Block Transfer Time
• Time it takes for the head to pass over a block
• The bytes of data can be transferred from disk to RAM and vice versa.
• Depends on rotational speed, recording density (#bytes/track), and #bytes in a block.
CS 165 Database Systems 8 Lesson 1: Introduction to File Structures
Let tbt = block transfer time rbt = block transfer rate B = #bytes transferred
tbt = B / rbt
Example: Assume a block size of 1,000 bytes and that the blocks are stored randomly. The disk drive has the following characteristics:
ts/2 = 30 mstr = 16.67 mstr/2 = 8.3 msrbt = 806 KB/sec = 825,344 bytes/sec
• What is the average access time per block?
Average access time per block= seek + rotational latency + transfer= ts/2 + tr/2 + tbt
=30 ms/block + 8.3 ms/block + 1,000bytes/block 825,344bytes/sec
= (30 + 8.3 + 1.21) ms /block= 39.51 ms/block
CS 165 Database Systems 9 Lesson 1: Introduction to File Structures
Magnetic Tape
• Sequential access
• Logical position of a byte within a file corresponds directly to its physical location relative to the start of the file.
• Good medium for archival storage, for transportation of data
• Tape drives come in many shapes, sizes and speeds. Performance of each drive can be measured in terms of the following quantities:Tape (or recording) density ⇒ number of characters or bytes of data that can be stored per inch (bpi)
• Tape Speed ⇒ commonly 30 to 200 inches per second (ips)
Disk vs. Tape
DISK TAPERandom Access Sequential AccessUse to store files in shorter terms Long-term storage of filesGenerally serves many processes Dedicated to one process Expensive Less expensiveUsed as main secondary storage Considered to be a tertiary storage
CS 165 Database Systems 10 Lesson 1: Introduction to File Structures
RAID: Improving File Access Performance by Parallel Processing
Redundant Array of Inexpensive Disks
A set of physical disk drives that appear to the database users and programs as if they form one large logical storage unit.
Parallel database processing speeds up writes and reads. This can be accomplished through using a RAID. RAID does not change the logical or physical structure of application programs or database queries.
CS 165 Database Systems 11 Lesson 1: Introduction to File Structures
RAID Levels
CS 165 Database Systems 12 Lesson 1: Introduction to File Structures
CS 165 Database Systems 13 Lesson 1: Introduction to File Structures
Fundamental File Structure Concepts
Data is stored in a physical storage for later retrieval. In order to organize the data for easy retrieval, physical file organization and access mechanism have to be determined.
Field and Record Organization⇒ Refers to the physical structure of records to store⇒ Fixed or variable in length
A Stream File
PROGRAM: writstrmget output file name and open it with the logical name OUTPUTget LAST name as inputwhile (LAST name has a length > 0)
get FIRST name, ADDRESS, CITY, STATE, and ZIP as inputwrite LAST to the file OUTPUTwrite FIRST to the file OUTPUTwrite ADDRESS to the file OUTPUTwrite CITY to the file OUTPUTwrite STATE to the file OUTPUTwrite ZIP to the file OUTPUTget LAST name as input
endwhileclose OUTPUTend PROGRAM
CS 165 Database Systems 14 Lesson 1: Introduction to File Structures
⇒ Writes a stream of bytes containing no added information: hard to get the data back since there is no integrity of the fundamental organizational units of the input data.
Field Structures
FieldA conceptual tool Smallest logically meaningful unit of information in a fileA subdivision of a record containing a single attribute of the entity the record
describes
Methods of Field Organization
1. Fixed-length fields
• Disadvantage: wasted space, insufficient space
2.Length indicator at the beginning of each field
3.Delimiters to separate fields
4.“keyword = value” expression to identify fields
• used in combination with delimiters• advantage: self-descriptive• disadvantage: wasted space
CS 165 Database Systems 15 Lesson 1: Introduction to File Structures
Record Structures
Records
Also a conceptual toolSet of fields that belong together when the file is viewed in terms of a higher level of
organization
Methods of Record Organization
1. Fixed-length records
• Does not imply that the sizes or number of fields in the record must be fixed
2. Fixed number of fields
3. Length indicator at the start of each record
4. Use of index to keep track of addresses
• Index in another file or at the file's header.
5. Delimiter at the end of each record
CS 165 Database Systems 16 Lesson 1: Introduction to File Structures
File Organization
The following are the main types of file organization:• Sequential • Relative
Sequential File Organization
⇒ Oldest type of file organization since during the 1950s and 1960s, the foundation of many information systems was sequential processing
Sequential Files ⇒Files that are read from beginning to end
Physical Characteristics
• Physical order of the records is the same as the logical order• Logical representation: (record 1) ⇒ (record 2) ⇒ (record 3) ⇒ … ⇒ (record n)
Two Types of Sequential Files
• Unordered pile files• Sorted sequential files
CS 165 Database Systems 17 Lesson 1: Introduction to File Structures
Relative File Organization
• There exists a predictable relationship between the key used to identify a record and that record’s absolute address on an external file
• Allows access to a record directly given only the key, regardless of the position of the record in the file
• Characterized as providing random access because the logical organization of the file need not correspond to its physical organization
• Simplest relative file organization: key value corresponds directly to the physical location of the record in the file
Useful for dense keys, i.e., values of consecutive keys differ only by one.
If the key collection is not dense, it might result to wasted space. This can be solved by mapping the large range of non-dense key values into smaller range of record positions in the file. ⇒ The key is no longer the address of the record in the file. Hashing or indexing may be used.
CS 165 Database Systems 18 Lesson 1: Introduction to File Structures
File Access
⇒ Refers to the manner data is retrieved
File Access Methods
Sequential Access • O(n) – expensive if done directly to the disk
Relative Access
• Addresses of records can be obtained directly from a key
• Uses an indexing technique that allows a user to locate a record in a file with as few accesses as possible, ideally with just one access
• The indexing scheme could be one of the following: • Direct addressing• Binary search tree• B-trees• Multiple-key indexing• Hashing
If access is sequential, file organization used will not have significant effect on the access cost. However, relative access entails fixing the size of records or using an index.
CS 165 Database Systems 19 Lesson 1: Introduction to File Structures
Indexing Mechanisms
Index
Consists of keys and reference fieldsWorks by indirectionUsed to provide multiple access path to a file Gives keyed access to variable-length record files
Key
Identifies a record based on the record’s contents not on the sequence of the records
Primary keys
Keys that uniquely identify a single record
Primary keys should be unchanging
Secondary keys
Used to overcome the shortcomings of the primary key
Used to access records according to data content
Types of Indexes
• Primary Index on Unordered Files • Primary Index on Ordered Files• Clustering Index on Ordered Files• Secondary Index
CS 165 Database Systems 20 Lesson 1: Introduction to File Structures
Primary Index on Unordered Files
• The index is a simple arrays of structures that contain the keys and reference fields• Allows binary search• The index provides the order to unordered physical records• Physical files are entry-sequenced
CS 165 Database Systems 21 Lesson 1: Introduction to File Structures
Primary Index on Ordered Files
• Physical records are ordered based on the primary key• The index is ordered but only one index record for each block• Reduces the index requirement, enabling binary search over the values without
having to read the entire file to perform binary search
CS 165 Database Systems 22 Lesson 1: Introduction to File Structures
Clustering Index on Ordered Files
CS 165 Database Systems 23 Lesson 1: Introduction to File Structures
Secondary Index
• Used to facilitate faster access to commonly queried non-primary fields• Typically point to the primary index
Advantage: Record deletion and update cause less workDisadvantage: Less efficient
Retrieval Using Combination of Secondary Keys⇒ use boolean AND operation, specifying the intersection of two subsets of the data file
CS 165 Database Systems 24 Lesson 1: Introduction to File Structures
Hashing
• Used to obtain addresses from keys. A hash function is used to map a range of key values into a smaller range of relative addresses.
• Hashing is like indexing in that it involves associating a key with a relative record address.
• Unlike indexing, with hashing, there is no obvious connection between the key and the address generated since the function "randomly selects" a relative address for a specific key value, without regard to the physical sequence of the records in the file. Thus it is also referred to as randomizing scheme.
• Problem in hashing: Presence of collision. Collision happens when two or more input keys generate the same address when the same hash function is used.
• Solution: Collision resolution technique or use of perfect hash functions.
CS 165 Database Systems 25 Lesson 1: Introduction to File Structures
Common Hashing Techniques Perfect Hashing TechniquesHash functions which are able to generate random addresses (not necessarily unique)
Hash functions which are able to generate a perfectly uniform distribution of addresses
• Prime Number Division Method • Digit Extraction• Folding• Mid-Square• Radix Conversion
• Quotient Reduction• Remainder Reduction• Associated Value• Reciprocal Hashing
AdvantagesEasier to implement
Provide 1-1 mapping of keys into addresses (no collisions)
Disadvantages• Records may collide• Records may be unevenly
distributed. In the worst case, all records hash into a single address
• Requires knowledge of the set of key values in order to generate a perfectly uniform distribution of keys
• Quite complicated to use• Has a lot of pre-requisites • It is hard to find a function that produces no
collision
CS 165 Database Systems 26 Lesson 1: Introduction to File Structures
Other Indexing Mechanisms
• Binary Search Trees: BST, Height-Balanced (AVL, Red-Black)
• B-Trees
Indexes That Are Too Large to Hold in Memory
Index too large for memory must be maintained on secondary storage ⇒ a lot of seeking is required.
Alternatives: Use hashing for direct access or tree-structured index for both keyed access and sequential access
CS 165 Database Systems 27 Lesson 1: Introduction to File Structures
Organizing Files for Performance
• Used to improve space utilization and file access times• Some methods: data compression and reclaiming unused space
Data Compression⇒ The process of making files smaller by encoding the information to take up less space
Methods• Using a Different Notation• Suppressing Repeated Sequences• Assigning Variable-Length Codes
Redundancy Reduction – compression by reducing redundancy in data representation
Using a Different Notation
Compact Notation ⇒ a compression technique which we decrease the number of bits in data representation.
Fixed-length fields of a record are good candidates for compressiono Choosing minimal length that is still conservative enougho Using alternative values: for example, course code instead of course name
CS 165 Database Systems 28 Lesson 1: Introduction to File Structures
Suppressing Repeated Sequences
Run-Length Encoding (RLE)
• A compression technique in which runs of repeated codes are replaced by a count of the number of repetitions of the code, followed by the code that is repeated
• Images with repeated pixels are good candidates for RLE
• The Algorithm: Read through the pixels in the image, copying the values to the file sequence,
except where the same pixel value occurs more that one successively. Substitute with the following 3 bytes the pixel values which occurred in succession:
• Run-length code indicator;• Pixel value that is repeated; and• The number of times it is repeated.
For example, we want to compress the following, with 0xFF not included in the image:22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
The compressed version is22 23 FF 24 07 25 FF 26 06 25 24
• RLE does not guarantee any particular amount of space savings
CS 165 Database Systems 29 Lesson 1: Introduction to File Structures
Assigning Variable-Length Codes
The most frequently used piece of data is assigned to have the shortest code
Example 1: Morse Code
E • T I • • M S • • • O H • • • •
However, Morse Code still need some delimiter to recognize characters
CS 165 Database Systems 30 Lesson 1: Introduction to File Structures
Example 2: Huffman Code
Suppose we have an alphabet consisting of only seven characters:
Letter a b c d e f gProbability 0.4 0.1 0.1 0.1 0.1 0.1 0.1Code 1 010 011 0000 0001 0010 0011
• Each letter here occurs with the probability indicated
• Letter a has the greatest probability to occur more frequently, so it is assigned the one bit code
• So the string abacde is represented as ‘1010101100000001’
• Seven letters can be stored using three bits only, but in this example, as much as four bits are used to ensure that the distinct codes can be stored together without delimiters and still could be recognized
CS 165 Database Systems 31 Lesson 1: Introduction to File Structures