chapter 5 – managing files of records

1

Chapter 5 –Chapter 5 –Managing Files of RecordsManaging Files of Records

2

What’s Up for This Chapter?What’s Up for This Chapter?This Chapter’s Material

– Accessing records in files– Record structures for access– File access methods vs. file organizations– Some real-world examples of file structures– File portability issues

3

The Central ProblemThe Central Problem Locating Stored Data

– Once the data has been stored into a file,how do you find it to retrieve it?

– What does “find the data” even mean? How do you decide what you want to find? How do you look for it? What if it’s not there? What if something very much like it is there? What if there are lots of “it” there?

– And, of course, there are efficiency considerations How fast is your search algorithm? What would you have to do to the file to use a faster one? Which will you do more often, add records or find them?

– Bringing you back to the design of the file itself

4

Record KeysRecord KeysWhat Is a Key?– Data stored in a record by which you look for the

record– Can be one field or a set of fields

Examples – { name } or {last name + first_name }

Two Types of Keys– Primary key

Key value, unique in entire file, by which an individual record

can be located or determined to be absent

– Secondary key Key value by which one or more records can be located

5

Primary KeysPrimary KeysRequired Characteristics– Unique across the entire file

Can never have 2 records with same primary key Error to try to add record with duplicate primary key

– In “canonical” form Format precisely known, so search candidates can be brought

into that same format before the search Example – words (names, etc.) in all upper-case

– Not often used any more: rather, program the system to do thesearch independently of case

– Unchanging Value for given record should never change

– Given primary key value should always identify same record– Example – Texas Driver’s License number stays with you, even

if you move away from Texas, then come back

6

Primary Keys, cont’d.Primary Keys, cont’d.Implication on File Design– Don’t use possibly non-unique field(s) as primary

key Bad – name, birth date, etc.

– Don’t use anything that can possibly change Bad – name, address, etc.

– What can we use? Best – artificial identifier

– Student number

– Driver’s license number

– Other artificially created unique value

7

Secondary KeysSecondary Keys Not Such Stringent Rules– Duplicates allowed

Still have to define what “find” means if duplicates allowed

– Usually real data, as opposed to primary keys The kinds of thing you’d want to search for in real life

– Not used to impose any order on the file Can return results based on secondary key(s)

– Selected by secondary key value(s)– Sorted on secondary key value(s)

8

SearchingSearchingFrom 2325 – Two Major Methods– Sequential

Start at beginning, look until you find what you’re after Choices:

– Non-unique keys allowed?– Return first match or all of them?

– Binary Start in middle, remove half the list each time through Requires:

– Primary key values unique across file– File sorted on primary– Records directly accessible

There are others, but …

9

Sequential SearchingSequential SearchingPerformance– It might take 1 try; it might take N tries

Average number of tries = N / 2 if:– Searching on a unique key

– Returning first match

Average number of tries = N if:– Returning all matches

10

Sequential SearchingSequential SearchingPerformance– Big factor in disk access

Worst case:– File fragmented around the disk– Each program read takes one physical read

Best case:– File fairly contiguous on disk– I/O System buffers things so very few (1?) actual reads are done– In multi-user OSs, this seldom happens

However:– If read/write head didn’t move between accesses

• Rotational latency & transfer times small compared to seek time• Multiple physical reads wouldn’t have as much of an impact

– However, most OSs are multi-tasking now• Can’t rely on read/write head’s being where you left it• Must assume N physical reads take N full disk accesses

11

Improving Sequential SearchesImproving Sequential Searches

Reduce Number of Physical Reads– We can’t do anything about:

File fragmentation– If file’s clusters scattered around disk, multiple seeks are necessary

Multi-tasking environment– Have to assume each program read causes a physical read– (May not be true, if I/O System has good internal caching)

– So what do we do? Increase the number of records pulled in by each physical read

– Saw this with magnetic tape – group the records into blocks– Similar to way we collected fields into records, but …

• Grouping fields into records is dependent on data characteristics• Grouping records into blocks is dependent on I/O system & disk

– Block size should be:• Multiple of disk sector size• Compatible with I/O System’s ability to read

12

When to Use Sequential SearchingWhen to Use Sequential Searching

Sequential Searching is Good for:– Text files where you’re looking for a pattern

Unix ‘grep’ (general regular expression processor) command

– Small files Like you use in labs here

– Files that are searched very infrequently Not worth the effort to sort to make binary search work

– When you expect a large number of matches Example – searching on a secondary key

It’s Not so Good for:– Binary files– Sorted files– Big files

13

Unix Tools for Sequential Unix Tools for Sequential AccessAccess cat

– Seen this one – concatenate files– cat F1 F2 >F3

wc– Word count (also character & line count)– wc article.txt

grep– Search file for occurrences of regular expression pattern– grep “Ames" personlist.txt

od– Octal dump – or hex, or …– od -ch list.dat

14

Direct AccessDirect AccessWhat is it?– Go straight to the record you want in the file

No searching No unnecessary disk accesses

– What’s its “order”? Time to find a record is independent of number of records

However, it can be harder to do

15

Direct AccessDirect Access How to Do It?– At I/O System level, seek to record

C++ seek operations go to relative byte address (RBA) in file Variants:

– Seek with “get” pointer vs. seek with “put” pointer– Relative to start or end of file (default: start)

– But that still doesn’t answer the question How do we know what RBA a particular record starts at? We’ve talked about index files – but that’s for later We could move the problem up one level

– Use relative record number (RRN) But that’s no real help

– Still need some kind of index – way to find record’s RRN– Also requires use of fixed-length records:

RBA = RRN * Record_Size(assuming, of course, that the first RRA is 0)

16

Building a File of RecordsBuilding a File of Records

Like Building a Record of Fields–Same problem, up one level

Fixed-length or specified-length records?How to directly access records?

–But wait – there’s more:Want to require software to know as few details about file

as possibleTo do that, those details need to be stored with (in) the file

–File header recordsStore file-specific information at start of fileHeader record format

–Constant across all file types within one system–Why?

17

File Header RecordsFile Header RecordsThings a Header Record Might Contain–File structure

Type of record structureNumber of data recordsLength of records (if fixed-length)Record delimiter (if delimited)

–Record structure (if records have consistent structure)Number of fieldsLength of each field or delimiter between each fieldFormat of each fieldKey information – if needed

–Primary key field–Secondary key field(s), if any

–Date/time of most recent access–Date/time of most recent update

18

File Header Records, File Header Records, continuedcontinued

Header Record Format– Binary or character?

Depends – is it important for people to read it?

– Here’s a place where HTML-style format might work Lets files of different formats have different headers

(in some ways) Only invokes that parse overhead once per file

19

What’s the Difference?What’s the Difference?File Organization–Format of the file itself

Fixed-length, specified-length, or delimited recordsASCII or binary character encoding

File Access Method–Way(s) software can get at contents of file

Sequential vs. directIndexed sequential

20

Designing a FileDesigning a FileAccess Affects Organization

–If sequential access is all we needPretty much any organization is OKSubject, of course, to application needs

–If we need direct accessNeed fixed-length recordsCan also use indexed files, but that’s for later on

But Organization Also Affects Access–What if data to be stored in a record is wildly variable?

Fixed-length records would be extremely wastefulBut if we use specified-length records, how to do direct access?

–Just about have to use indexing then

21

MetadataMetadata

Data About Data–Usually in the form of a file header–Example in text

Astronomy image storage formatHTML format (name = value)But look on page 177: coding style makes a BIG difference

–Parsing this kind of dataRead field name; read field valueConvert ASCII value to type required for storage & useStore converted value into right variable

–Why use this type of header?

22

More MetadataMore MetadataPC Graphics Storage Formats

–DataColor values for each pixel in imageData compression often used (GIF, JPG)Different color “depth” possibilities

–MetadataHeight, widthNumber of bits per pixel (color depth)If not true color (24 bits / pixel)

–Color look-up table• Normally 256 entries• Indexed by values stored for each pixel (normally 1 byte)• Contains R/G/B values for color combination

–Formatted to be loaded directly into PC graphics RAM

23

Mixing Data Objects in a FileMixing Data Objects in a FileObjective–Store different types of data in the same file–Textbook example – mix of astronomy data

“File” header (HTML-style)“File” of notes – lines of ASCII text“File” of image data – in whatever format

–So our data file becomes a file of filesEach individual “file” (header, notes, or image) looks like

a record in this new “mega-file”These “mega-records” are of varying lengthHow do we store the “records” in the “mega-records”?

–Could use another level of specified-length record software

–Or, …

24

Our “Mega-File”Our “Mega-File”

NotesSub-file

ImageSub-file

Mega-fileHeader

NotesSub-file

ImageSub-file…

ImageHeader

ImageData

Text lineText lineText lineText lineText line …Text lineTerminator line

Organization

Notes Header

25

More on Our Mega-FileMore on Our Mega-FileAccess–Can we just read it sequentially?

Why or why not?What if we wanted to skip a notes sub-file?What if some image didn’t even have a notes sub-file?

–Can we access it directly?What would the header have to include to allow that?

–An index of the “records” in the file–We call the entries in that index “tags”

Each tag in the tag list has:–Type of sub-file referred to

• Special-case type: end of file–RBA of sub-file in mega-file–Length of sub-file (not necessary, but helpful)–Key information, if any, for sub-file

26

More on Our Mega-FileMore on Our Mega-File

Access, continued– So how do we access the mega-file now?

Read and process the header– Get whole-file information

– Build in-memory tag table for sub-files Sequential access

– Same as before

– May be able to program in some speed-ups from tag table Direct access

– Locate sub-file in tag table

– Go right to it

27

ExtensibilityExtensibilityLook at Our “Mega-File” Format Again–Header tells us things about the sub-files:

What kinds of files they areWhere to find them

–Files themselvesTo the mega-file processor, just random bytesTo the sub-file processor, meaningful information

What if we need a new type of sub-file?–Define a new type of header entry–Extend header processor to understand that entry–Write (or borrow or buy) code to handle new sub-file

Cardinal Rule:–Everything changes –file types, data types, ...

28

Factors Affecting Portability - 1Factors Affecting Portability - 1

Operating System Differences–Example – text lines

End with line-feed characterEnd with carriage-return and line-feedPrefixed by a count of characters in the line

Natural Language Differences–Example – character coding

Single-byte coding – ASCII, EBCDICDouble-byte coding – Unicode

Programming Language Differences–Pascal can’t directly process varying-length records–Different C++ compilers use different byte lengths

for the standard data types

29

Factors Affecting Portability - Factors Affecting Portability - 22

Computer Architecture Differences–Byte order in 16-bit and 32-bit integer values

Big-endian – leftmost byte is most significantLittle-endian – rightmost byte is most significant

–Storage of data in memorySome architectures require values that are N bytes long

to start at a byte whose address is divisible by N

0x15 0x32

Big-endian Little-endianinterpretation: interpretation:

0x1532 0x3215

30

How to Port FilesHow to Port FilesDefine Your Format C*A*R*E*F*U*L*L*Y

–Once a format is defined, never change itIf you need a new format, add it so as not to invalidate

the existing formatsIf you need to change a format, add a new one instead,

and let programs that need the new version use it

–Decide on a standard format for data elementsText lines

–ASCII , EBCDIC, or Unicode?–Which character(s) to end lines?

Binary–Tightly packed or multiple-of-N addressing?–Which “endian”?

–You can always write code to convert to & from thestandard format on a new language, computer, etc.

31

The Conversion ProblemThe Conversion ProblemFew Environments – can do directly

Many Env’ts. – need intermediate form

IBM VAX

VAX IBM

IBM IBM

VAX VAX

IA-32 IA-32

IA-64 IA-64

.

.

.XML

(or some otherstandard format)

chapter 5 – managing files of records

Documents