slides 1 introduction to file structures

Introduction to File Structures

Objectives

At the end of the lesson, the student should be able to:

• Explain the concepts behind physical file organization

• Differentiate logical and physical files

• Determine the advantages of using primary over secondary storage devices

• Identify the characteristics, capacities and access costs of different storage devices

• Explain the concepts behind file organization and file access

• Identify the different indexing mechanisms

• Use file compression techniques to improve space utilization and file access performance

CS 165 Database Systems 1 Lesson 1: Introduction to File Structures

Issues in Physical File Organization

• Slow disk access• Volative primary storage• Virtually infinite secondary storage• Cheaper secondary storage• Dynamic nature of data• Need for fast searching and retrieval

Ideally, we would like to get information with one access to the disk. If it is not possible, even with just as few accesses as possible. We also want to have file structures with information groups, or what we call records, to get a bulk of information in just one access to the disk.

Solutions Made

• Sequential access• Indexed access• Binary trees• Tree-structured file organization: B-tree, B*-tree and B+-tree• Direct access design in the form of hashing


Logical vs. Physical Files

physical file ⇒ a file in a disk, which means it exists physicallylogical file ⇒ a file as viewed by the user

Example: assign(input_file,’input.txt’);

input_file ⇒ logical, input.txt ⇒ physical

Secondary Storage Devices

⇒ hold files that are not currently being used

⇒ used for mid-term to long-term storage

Types of Storage Devices

• Direct Access Storage Devices (DASDs) ⇒ also known as random access devices⇒ magnetic disks such as hard disks, floppy disks and optical disks

• Serial Devices ⇒ use media such as magnetic tapes that permit only serial access.


Disks

Use magnetism to store the data on a magnetic surfaceAdvantages: high storage capacity, reliable, gives direct access to data A drive spins the disk very quickly underneath a read/write head

Organization of Disks

All magnetic disks are similarly formatted, or divided into areas, called tracks, sectors and cylinders.

• Platters • Tracks • Sectors• Disk pack •

• Cylinder

Seeking ⇒ Moving the r/w head to find the right location on the disk

⇒ Usually the slowest part of reading information


Disk Capacity

• Amount of data on a track depends upon how densely bits can be stored Low density High density

• To compute track, cylinder and drive capacity:

CT = Track Capacity = #sectors/track x #bytes/sectorCC = Cylinder Capacity = #tracks/cylinder x CT

CD = Drive Capacity = #cylinders/drive x CC

Example: What is the capacity of the drive with the following characteristics?

#bytes/sector = 512#sectors/track = 40#tracks/cylinder = 11#cylinders = 1331


Cost of a Disk Access (Latency)

Factors affecting disk access time: seek time [ts], rotational delay [tr], block transfer time [tbt]

Step Measured as:(in ms)

1. seek move the head to proper track

seek time

2. rotaterotate disk under the head to the correct sector

rotational delay

3. settlehead lowers to disk;wait for vibrations from moving to stop (actually touches only on floppies)

settling time

4. data transfercopy data to main memory

data transfer time


Seek Time

• Time it takes to move the arm to the correct cylinder

• Depends on the physical characteristics of the disk drive as well as on the #cylinders.

• Assumption: all cylinders are equally likely to be sources or destinations.

minimum seek time ⇒ time it takes to move the access arm from a track to its adjacent track.

maximum seek time ⇒ time it takes to move the access arm from the outermost to the innermost track.

average seek time ⇒ the average of the minimum and maximum seek times.

Rotational Delay

• Time it takes for the head to reach the right place on the track.

• Assumption: Correct block can be recognized by some flag or block identifier at the beginning.

rotational time (tr ) ⇒ time needed for a disk pack to complete one whole revolution.


• Average latency time: [tr/2]= time it takes the disk drive to make ½ revolution.

tr/2 = (1/2) (60 x 1000) / rpm

where rpm is revolutions per minute

or if tr is given:

tr/2 = tr / 2

• Hard disks usually rotates at about 7,200 rpm ~ 1 rev. per 8.33 ms so the average rotational delay is 4.167 msec.

• For floppy disk: 360 rpm, tr/2 = 83.3 ms.

Block Transfer Time

• Time it takes for the head to pass over a block

• The bytes of data can be transferred from disk to RAM and vice versa.

• Depends on rotational speed, recording density (#bytes/track), and #bytes in a block.


Let tbt = block transfer time rbt = block transfer rate B = #bytes transferred

tbt = B / rbt

Example: Assume a block size of 1,000 bytes and that the blocks are stored randomly. The disk drive has the following characteristics:

ts/2 = 30 mstr = 16.67 mstr/2 = 8.3 msrbt = 806 KB/sec = 825,344 bytes/sec

• What is the average access time per block?

Average access time per block= seek + rotational latency + transfer= ts/2 + tr/2 + tbt

=30 ms/block + 8.3 ms/block + 1,000bytes/block 825,344bytes/sec

= (30 + 8.3 + 1.21) ms /block= 39.51 ms/block


Magnetic Tape

• Sequential access

• Logical position of a byte within a file corresponds directly to its physical location relative to the start of the file.

• Good medium for archival storage, for transportation of data

• Tape drives come in many shapes, sizes and speeds. Performance of each drive can be measured in terms of the following quantities:Tape (or recording) density ⇒ number of characters or bytes of data that can be stored per inch (bpi)

• Tape Speed ⇒ commonly 30 to 200 inches per second (ips)

Disk vs. Tape

DISK TAPERandom Access Sequential AccessUse to store files in shorter terms Long-term storage of filesGenerally serves many processes Dedicated to one process Expensive Less expensiveUsed as main secondary storage Considered to be a tertiary storage


RAID: Improving File Access Performance by Parallel Processing

Redundant Array of Inexpensive Disks

A set of physical disk drives that appear to the database users and programs as if they form one large logical storage unit.

Parallel database processing speeds up writes and reads. This can be accomplished through using a RAID. RAID does not change the logical or physical structure of application programs or database queries.


RAID Levels


Fundamental File Structure Concepts

Data is stored in a physical storage for later retrieval. In order to organize the data for easy retrieval, physical file organization and access mechanism have to be determined.

Field and Record Organization⇒ Refers to the physical structure of records to store⇒ Fixed or variable in length

A Stream File

PROGRAM: writstrmget output file name and open it with the logical name OUTPUTget LAST name as inputwhile (LAST name has a length > 0)

get FIRST name, ADDRESS, CITY, STATE, and ZIP as inputwrite LAST to the file OUTPUTwrite FIRST to the file OUTPUTwrite ADDRESS to the file OUTPUTwrite CITY to the file OUTPUTwrite STATE to the file OUTPUTwrite ZIP to the file OUTPUTget LAST name as input

endwhileclose OUTPUTend PROGRAM


⇒ Writes a stream of bytes containing no added information: hard to get the data back since there is no integrity of the fundamental organizational units of the input data.

Field Structures

FieldA conceptual tool Smallest logically meaningful unit of information in a fileA subdivision of a record containing a single attribute of the entity the record

describes

Methods of Field Organization

1. Fixed-length fields

• Disadvantage: wasted space, insufficient space

2.Length indicator at the beginning of each field

3.Delimiters to separate fields

4.“keyword = value” expression to identify fields

• used in combination with delimiters• advantage: self-descriptive• disadvantage: wasted space


Record Structures

Records

Also a conceptual toolSet of fields that belong together when the file is viewed in terms of a higher level of

organization

Methods of Record Organization

1. Fixed-length records

• Does not imply that the sizes or number of fields in the record must be fixed

2. Fixed number of fields

3. Length indicator at the start of each record

4. Use of index to keep track of addresses

• Index in another file or at the file's header.

5. Delimiter at the end of each record


File Organization

The following are the main types of file organization:• Sequential • Relative

Sequential File Organization

⇒ Oldest type of file organization since during the 1950s and 1960s, the foundation of many information systems was sequential processing

Sequential Files ⇒Files that are read from beginning to end

Physical Characteristics

• Physical order of the records is the same as the logical order• Logical representation: (record 1) ⇒ (record 2) ⇒ (record 3) ⇒ … ⇒ (record n)

Two Types of Sequential Files

• Unordered pile files• Sorted sequential files


Relative File Organization

• There exists a predictable relationship between the key used to identify a record and that record’s absolute address on an external file

• Allows access to a record directly given only the key, regardless of the position of the record in the file

• Characterized as providing random access because the logical organization of the file need not correspond to its physical organization

• Simplest relative file organization: key value corresponds directly to the physical location of the record in the file

Useful for dense keys, i.e., values of consecutive keys differ only by one.

If the key collection is not dense, it might result to wasted space. This can be solved by mapping the large range of non-dense key values into smaller range of record positions in the file. ⇒ The key is no longer the address of the record in the file. Hashing or indexing may be used.


File Access

⇒ Refers to the manner data is retrieved

File Access Methods

Sequential Access • O(n) – expensive if done directly to the disk

Relative Access

• Addresses of records can be obtained directly from a key

• Uses an indexing technique that allows a user to locate a record in a file with as few accesses as possible, ideally with just one access

• The indexing scheme could be one of the following: • Direct addressing• Binary search tree• B-trees• Multiple-key indexing• Hashing

If access is sequential, file organization used will not have significant effect on the access cost. However, relative access entails fixing the size of records or using an index.


Indexing Mechanisms

Index

Consists of keys and reference fieldsWorks by indirectionUsed to provide multiple access path to a file Gives keyed access to variable-length record files

Key

Identifies a record based on the record’s contents not on the sequence of the records

Primary keys

Keys that uniquely identify a single record

Primary keys should be unchanging

Secondary keys

Used to overcome the shortcomings of the primary key

Used to access records according to data content

Types of Indexes

• Primary Index on Unordered Files • Primary Index on Ordered Files• Clustering Index on Ordered Files• Secondary Index


Primary Index on Unordered Files

• The index is a simple arrays of structures that contain the keys and reference fields• Allows binary search• The index provides the order to unordered physical records• Physical files are entry-sequenced


Primary Index on Ordered Files

• Physical records are ordered based on the primary key• The index is ordered but only one index record for each block• Reduces the index requirement, enabling binary search over the values without

having to read the entire file to perform binary search


Clustering Index on Ordered Files


Secondary Index

• Used to facilitate faster access to commonly queried non-primary fields• Typically point to the primary index

Advantage: Record deletion and update cause less workDisadvantage: Less efficient

Retrieval Using Combination of Secondary Keys⇒ use boolean AND operation, specifying the intersection of two subsets of the data file


Hashing

• Used to obtain addresses from keys. A hash function is used to map a range of key values into a smaller range of relative addresses.

• Hashing is like indexing in that it involves associating a key with a relative record address.

• Unlike indexing, with hashing, there is no obvious connection between the key and the address generated since the function "randomly selects" a relative address for a specific key value, without regard to the physical sequence of the records in the file. Thus it is also referred to as randomizing scheme.

• Problem in hashing: Presence of collision. Collision happens when two or more input keys generate the same address when the same hash function is used.

• Solution: Collision resolution technique or use of perfect hash functions.


Common Hashing Techniques Perfect Hashing TechniquesHash functions which are able to generate random addresses (not necessarily unique)

Hash functions which are able to generate a perfectly uniform distribution of addresses

• Prime Number Division Method • Digit Extraction• Folding• Mid-Square• Radix Conversion

• Quotient Reduction• Remainder Reduction• Associated Value• Reciprocal Hashing

AdvantagesEasier to implement

Provide 1-1 mapping of keys into addresses (no collisions)

Disadvantages• Records may collide• Records may be unevenly

distributed. In the worst case, all records hash into a single address

• Requires knowledge of the set of key values in order to generate a perfectly uniform distribution of keys

• Quite complicated to use• Has a lot of pre-requisites • It is hard to find a function that produces no

collision


Other Indexing Mechanisms

• Binary Search Trees: BST, Height-Balanced (AVL, Red-Black)

• B-Trees

Indexes That Are Too Large to Hold in Memory

Index too large for memory must be maintained on secondary storage ⇒ a lot of seeking is required.

Alternatives: Use hashing for direct access or tree-structured index for both keyed access and sequential access


Organizing Files for Performance

• Used to improve space utilization and file access times• Some methods: data compression and reclaiming unused space

Data Compression⇒ The process of making files smaller by encoding the information to take up less space

Methods• Using a Different Notation• Suppressing Repeated Sequences• Assigning Variable-Length Codes

Redundancy Reduction – compression by reducing redundancy in data representation

Using a Different Notation

Compact Notation ⇒ a compression technique which we decrease the number of bits in data representation.

Fixed-length fields of a record are good candidates for compressiono Choosing minimal length that is still conservative enougho Using alternative values: for example, course code instead of course name


Suppressing Repeated Sequences

Run-Length Encoding (RLE)

• A compression technique in which runs of repeated codes are replaced by a count of the number of repetitions of the code, followed by the code that is repeated

• Images with repeated pixels are good candidates for RLE

• The Algorithm: Read through the pixels in the image, copying the values to the file sequence,

except where the same pixel value occurs more that one successively. Substitute with the following 3 bytes the pixel values which occurred in succession:

• Run-length code indicator;• Pixel value that is repeated; and• The number of times it is repeated.

For example, we want to compress the following, with 0xFF not included in the image:22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24

The compressed version is22 23 FF 24 07 25 FF 26 06 25 24

• RLE does not guarantee any particular amount of space savings


Assigning Variable-Length Codes

The most frequently used piece of data is assigned to have the shortest code

Example 1: Morse Code

E • T I • • M S • • • O H • • • •

However, Morse Code still need some delimiter to recognize characters


Example 2: Huffman Code

Suppose we have an alphabet consisting of only seven characters:

Letter a b c d e f gProbability 0.4 0.1 0.1 0.1 0.1 0.1 0.1Code 1 010 011 0000 0001 0010 0011

• Each letter here occurs with the probability indicated

• Letter a has the greatest probability to occur more frequently, so it is assigned the one bit code

• So the string abacde is represented as ‘1010101100000001’

• Seven letters can be stored using three bits only, but in this example, as much as four bits are used to ensure that the distinct codes can be stored together without delimiters and still could be recognized


slides 1 introduction to file structures

Documents