algorithms and data structures

Algorithms and data structuresAlgorithms and data structures

Protected by http://creativecommons.org/licenses/by-nc-sa/3.0/hr/20.04.23

Creative CommonsCreative Commons

You are free to:You are free to: shareshare — copy and redistribute the material in any medium or format — copy and redistribute the material in any medium or format adaptadapt — remix, transform, and build upon the material — remix, transform, and build upon the material

Under the following terms: Under the following terms: Attribution Attribution — You must give— You must give appropriate credit, appropriate credit, provide a link to the license, and provide a link to the license, and

indicate if changes were madeindicate if changes were made. You may do so in any reasonable manner, but not . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. in any way that suggests the licensor endorses you or your use.

NonCommercial NonCommercial — You may not use the material for— You may not use the material for commercial purposes commercial purposes. . ShareAlike ShareAlike — If you remix, transform, or build upon the material, you must — If you remix, transform, or build upon the material, you must

distribute your contributions under the same license as the original.distribute your contributions under the same license as the original.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Text copied from http://creativecommons.org/licenses/by-nc-sa/3.0/

20.04.23Algorithms and data structures, FER

Notices:

You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.

No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

2 / 40

20.04.23

Addressing techniquesAddressing techniques

BasicsBasicsRetrieval proceduresRetrieval proceduresHashingHashing

Algorithms and data structures, FER 20.04.23

BasicsBasics

Knowing the Knowing the key key of some record, the question arises how to find this of some record, the question arises how to find this recordrecord

Primary keyPrimary key Defines a record uniquely Defines a record uniquely

– E.g. E.g. StudentIDStudentID

Concatenated (composite) keysConcatenated (composite) keys necessary for unique identification of some types of recordsnecessary for unique identification of some types of records

– E.g. E.g. StudentIDStudentID & & CourseCodeCourseCode & & ExamDateExamDate uniquely define a record of examination (a uniquely define a record of examination (a possible examination by a committee due to student’s complaint is neglected!)possible examination by a committee due to student’s complaint is neglected!)

Secondary keySecondary key NeNeeed not uniquely define the record, but it points at some attribute valued not uniquely define the record, but it points at some attribute value

– E.g. E.g. YearOfStudy in the record with course dataYearOfStudy in the record with course data

4 / 40


Sequential searchSequential search

Searching the file record by record - the most primitive waySearching the file record by record - the most primitive way It is used for It is used for sequentialsequential files where all the records have to be read anyhow files where all the records have to be read anyhow Other terms: linear, serial searchOther terms: linear, serial search The records need not be sortedThe records need not be sorted

On average On average n/2n/2 records are read records are read CComplexity: omplexity: O(n)O(n)

– Best case: Best case: O(1)O(1)– Worst case: Worst case: O(n)O(n)

IspisiTrazi(IspisiTrazi(PrintSearchPrintSearch))

Repeat for all the recordsRepeat for all the records

If the current record equals the searched oneIf the current record equals the searched one

Record is foundRecord is found

Leave the loopLeave the loop

5 / 40


Sequential searching of sorted recordsSequential searching of sorted records

How to improve the sequential searchHow to improve the sequential search?? Sort Sort the records according to a keythe records according to a key!!

What are the complexities in the best, worst and average case while What are the complexities in the best, worst and average case while searching sorted recordssearching sorted records??

Sort the recordsSort the records

Repeat for all recordsRepeat for all records

If the current record equals the searched oneIf the current record equals the searched one



If the current record is larger than the searched If the current record is larger than the searched oneone

Record does not existRecord does not exist


6 / 40


Questions and exercisesQuestions and exercises

1.1. A colleague tells you that s/he wrote a sequential search algorithm A colleague tells you that s/he wrote a sequential search algorithm with complexity with complexity O(log n)O(log n). Shall you . Shall you congratulate congratulate himhim oror laugh laugh at at him? him?

2.2. In the best case, the record will be found after the minimum In the best case, the record will be found after the minimum number of comparisons. Where was this record locatednumber of comparisons. Where was this record located??

3.3. Where was the record found after the maximum number of Where was the record found after the maximum number of comparisonscomparisons??

7 / 40


Block wise readingBlock wise reading

In In direct/random access filesdirect/random access files (all the records are of the same (all the records are of the same length!) length!) sorted sorted by the primary key, it is not indispensable to check all by the primary key, it is not indispensable to check all the records the records E.g. only each E.g. only each hundredthhundredth record is examined record is examinedWhen the block of the record with the searched key is located, When the block of the record with the searched key is located,

the the blockblock is searched is searched sequentiallysequentially

How to find the How to find the optimaloptimal block size? block size?

8 / 40


Example Example – – places in Croatiaplaces in Croatia

We are looking for the city of Malinska in a list of We are looking for the city of Malinska in a list of F=6935F=6935 places; places; each page contains each page contains B=60B=60 places places There are There are F / BF / B leading recordsleading records ( (and corresponding pagesand corresponding pages) - ) - F / B = 116F / B = 116

AdaAdaAdamovecAdamovecAdžamovciAdžamovci......BairBairBajagićBajagićBajčićiBajčići

BajićiBajićiBajkiniBajkiniBakar-dioBakar-dio......BarilovićBarilovićBarkovićiBarkovićiBarlabaševecBarlabaševec

Mali GradacMali GradacMali GrđevacMali GrđevacMali IžMali Iž..MalinskaMalinska..Manja VasManja VasManjadvorciManjadvorciManjerovićiManjerovići

ZvijerciZvijerciZvjerinacZvjerinacZvonećaZvoneća......ŽitomirŽitomirŽivajaŽivajaŽivikeŽivike

Živković KosaŽivković KosaŽivogošćeŽivogošćeŽlebec GoričkiŽlebec Gorički..ŽutnicaŽutnicaŽužićiŽužićiŽužićiŽužići

11 22

115115 116116

MaoviceMaoviceMaoviceMaoviceMaračićiMaračići......MartinMartinMartinaMartinaMartinacMartinac

61616060

CitanjePoBlokovima CitanjePoBlokovima ((ReadingBlockWiseReadingBlockWise))

9 / 40


Optimal block sizeOptimal block size

In case of In case of FF records, and the block size records, and the block size BB,, there are there are F / BF / B leading leading records and blocksrecords and blocks It It isis expected that during the search by blocks it is necessary to expected that during the search by blocks it is necessary to read a read a half half

of the existing of the existing leading records leading records of the blocksof the blocks– On average, the searched leading record will be found after having read On average, the searched leading record will be found after having read (F / B) / 2 = (F / B) / 2 = F /( 2 B)F /( 2 B) records records

Within the located block there are Within the located block there are BB records so it can be expected that the records so it can be expected that the searched record would be found after on average searched record would be found after on average B / 2B / 2 ( (sequential!sequential!) ) readings within that blockreadings within that block

The total expected number of readings is The total expected number of readings is F / (2 B) + B / 2F / (2 B) + B / 2 After the derivation by After the derivation by BB is equalled to zero, is equalled to zero, optimaloptimal block size is obtained: block size is obtained:

BB = √ = √FF What is the optimal block size for the list of places in Croatia? What is the optimal block size for the list of places in Croatia?

10 / 40


Binary searchBinary search

The binary search starts at the The binary search starts at the half half of the file/arrayof the file/array and continues with constant and continues with constant halving halving of the of the interval where the searched record could be foundinterval where the searched record could be found prerequisite: data prerequisite: data sortedsorted!! Average number of searching steps Average number of searching steps loglog22 n n The procedure is inappropriate for disk memories with direct access due to time/consuming The procedure is inappropriate for disk memories with direct access due to time/consuming

positioning of reading heads. It is recommendable for positioning of reading heads. It is recommendable for corecore memory. memory. The fact is used that the array is sorted and in each step the search area is halved The fact is used that the array is sorted and in each step the search area is halved

Complexity is Complexity is O(logO(log22n)n)Number of elementsNumber of elements = n = n

The searched The searched elementelementSearch stepsSearch steps

= log= log22nn

11 / 40


Example of binary searchExample of binary search

Looking forLooking for 2525

22 55 66 88 99 1212 1515 2121 2323 2525 3131 3939

12 / 40


Algorithm for binary searchAlgorithm for binary search

lowerlower__boundbound = 0 = 0

upperupper__boundbound = = totaltotal__elementselements__countcount

RepeatRepeat

Find the middle recordFind the middle record

If the If the middlemiddle record equals to the searched onerecord equals to the searched one


Leave the loopLeave the loopIf the lower bound is greater If the lower bound is greater thanthan or equal to the upper or equal to the upper

boundbound

Record Record is is not foundnot found

Leave the loop Leave the loop

If the If the middlemiddle recordrecord is smaller then the searched oneis smaller then the searched one

Set the lower bound to the position of the current record Set the lower bound to the position of the current record + 1+ 1

If the If the middlemiddle recordrecord is larger then the searched oneis larger then the searched one

Set the upper bound to the position of the current recordSet the upper bound to the position of the current record - 1 - 1 (BinarnoPretrazivanje) BinarnySearch(BinarnoPretrazivanje) BinarnySearch

13 / 40


QuestionsQuestions

1.1. What are the execution times in binary search of What are the execution times in binary search of n records n records for the for the best, worst and average case?best, worst and average case?

Average case is just for 1 step simpler than the worst case. The proof can be seen on:Average case is just for 1 step simpler than the worst case. The proof can be seen on:

http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm, March 24th, 2014, March 24th, 2014

2.2. For the search of places in Croatia (list of 6935 places), what is the For the search of places in Croatia (list of 6935 places), what is the maximum number of maximum number of steps steps necessary to locate the searched necessary to locate the searched place?place?

3.3. Shall the binary search be always faster then the sequential, even Shall the binary search be always faster then the sequential, even for a large set of data?for a large set of data?

14 / 40

http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm


ProblemProblem

Suppose that Suppose that nn unsorted data in a set can be sorted in timeunsorted data in a set can be sorted in time O(n O(n loglog22 n) n) . . You have to performYou have to perform nn searches in this data set. What is searches in this data set. What is betterbetter:: To usTo use the e the sequential sequential search?search? To sort data and then apply theTo sort data and then apply the binary binary searchsearch??

SolutionSolution: : it makes more sense to sort and then search binary it makes more sense to sort and then search binary!!– nn sequential searchessequential searches: : nn * * O(n)O(n) = = O(nO(n22))– sortsort + + binarybinary: : O(n logO(n log2 2 n)n) + + nn * * O (logO (log2 2 n)n) = = O(n logO(n log2 2 n)n)

O(n logO(n log2 2 n) < O(nn) < O(n22))

15 / 40


Index-sequential filesIndex-sequential files

Every record contains a Every record contains a keykey as as a a unique identifierunique identifier If If a a file is sorted by the key, it is appropriate to form a look-up tablefile is sorted by the key, it is appropriate to form a look-up table

Input Input for the table is the for the table is the key key of the searched recordof the searched record Output Output is the is the informationinformation regarding the more precise location of the searched record regarding the more precise location of the searched record

Such Such a a table is called table is called indexindex The index need not point to each record but only to a The index need not point to each record but only to a blockblock

– In the example – places in Croatia, it was shown that the optimal block size was In the example – places in Croatia, it was shown that the optimal block size was √F√F For large files there are indices on For large files there are indices on multiple levelsmultiple levels –– with the optimal with the optimal

organisation, in the worst caseorganisation, in the worst case in each of the indices and in the data filein each of the indices and in the data file, , the the same number of records is readsame number of records is read

Optimal sizes of indices on Optimal sizes of indices on 22 levels are levels are 33√F i √F i 33√F √F 22 O (O (33√F )√F ) ForFor k k levels: levels: kk+1+1√F , √F , k+1k+1√F√F22, , kk+1+1√F√F33 ,… ,… kk+1 +1 √F√F k k-1-1 , , kk+1 +1 √F √F kk O (O (kk+1+1√F )√F )

16 / 40


Index-sequential filesIndex-sequential files

Insertion and deletionInsertion and deletion In traditional In traditional sequential sequential file (magnetic tape), they are possible only in the way by file (magnetic tape), they are possible only in the way by copying copying

the whole file with addition and/or deletion of recordsthe whole file with addition and/or deletion of records In In direct direct (random) (random) access access files, deletion is performed files, deletion is performed logicallylogically, i.e. a tag is written to mark a , i.e. a tag is written to mark a

deleted recorddeleted record After a certain number of revisions, the file has to be After a certain number of revisions, the file has to be reorganisedreorganised

Data are written sequentially and by the key value, Data are written sequentially and by the key value, indexingindexing is repeated (so called file is repeated (so called file maintenance)maintenance)

17 / 40


Search proceduresSearch procedures

Index non-sequential filesIndex non-sequential files If the search by multiple keys is required or if addition and deletion of records is frequent If the search by multiple keys is required or if addition and deletion of records is frequent

(volatile files), in the first case it is very difficult (if not impossible) and in the latter case (volatile files), in the first case it is very difficult (if not impossible) and in the latter case difficult, to maintain the requirement that the records difficult, to maintain the requirement that the records are are sorted and contained within their sorted and contained within their initial blockinitial block

In that case, the In that case, the index should contain the addressindex should contain the address ( (relative or absoluterelative or absolute) ) of each single of each single recordrecord

The key contains the The key contains the addressaddress The simplest case is to form the The simplest case is to form the key key so that some part of it so that some part of it contains the contains the

record addressrecord address– E.g. At some entrance examination for enrolment, the application number can serve as E.g. At some entrance examination for enrolment, the application number can serve as

the key, and simultaneously it can be the ordinal number of the record in the key, and simultaneously it can be the ordinal number of the record in a a direct direct access fileaccess file

– Very often, such Very often, such a a simple procedure is not possible because the coding scheme simple procedure is not possible because the coding scheme cannot be adapted to each single applicationcannot be adapted to each single application

18 / 40


The idea of hashingThe idea of hashing

Problem: a company employs about hundred thousand employees, Problem: a company employs about hundred thousand employees, every person has his or her own unique identifier ID, generated from every person has his or her own unique identifier ID, generated from the interval [0, 1 000 000]. The records read & write must be fast. the interval [0, 1 000 000]. The records read & write must be fast. How to organise the file?How to organise the file? Direct file with the key equals the ID? Direct file with the key equals the ID?

– 1 000 000 x 4 bytes~ 4MB, 1 000 000 x 4 bytes~ 4MB, 90% of space is unused!90% of space is unused!

It is possible to devise procedures to It is possible to devise procedures to transform the key into addresstransform the key into address, , or, even better, into some or, even better, into some ordinal numberordinal numberThe position of the record is stored under this ordinal number The position of the record is stored under this ordinal number This modification This modification improves the flexibilitimproves the flexibilityy

19 / 40


HashingHashing

Let us suppose to have Let us suppose to have MM buckets (blocks of records with the same starting buckets (blocks of records with the same starting address of the block) availableaddress of the block) available

A A pseudo-random number pseudo-random number from the interval from the interval 0 0 to to M-1 M-1 is calculated from the is calculated from the key key value using a value using a hash-function hash-function

This number is the This number is the address of a group of dataaddress of a group of data (of a bucket) where all of the (of a bucket) where all of the respective keys are transformed into the same pseudo-random numberrespective keys are transformed into the same pseudo-random number Collision Collision happens when two different keys are transformed into a same happens when two different keys are transformed into a same

addressaddress If a bucket is full, it is possible to insert a pointer to the overflow area, or If a bucket is full, it is possible to insert a pointer to the overflow area, or

insertion is attempted in the next neighbouring bucket - insertion is attempted in the next neighbouring bucket - Bad neighbour Bad neighbour policypolicy

In hashing the following parameters can vary:In hashing the following parameters can vary: Bucket capacityBucket capacity Packing densityPacking density

20 / 40


ExampleExample

Store the names into a hash tableStore the names into a hash table hash-function = (sum of ASCII codes) % (number of buckets)hash-function = (sum of ASCII codes) % (number of buckets)

00112233

VanjaVanja

MatijaMatija

AndreaAndrea

DorisDoris

SašaSaša

AlexAlex

SandiSandi

PericaPerica

IvaIva

??

21 / 40


Bucket capacityBucket capacity

A pseudoA pseudo--random number is generated through a transformation of random number is generated through a transformation of the key, yielding the the key, yielding the bucket addressbucket addressIf the bucket capacity equalsIf the bucket capacity equals 11, , overflow overflow is frequentis frequentWith increasing of the bucket size, decreses the probability of With increasing of the bucket size, decreses the probability of

overflow, but reading of a single bucket is more time consuming overflow, but reading of a single bucket is more time consuming and the amount of sequential search within a bucket increasesand the amount of sequential search within a bucket increases

It is recommendable to match the bucket size with the physical size It is recommendable to match the bucket size with the physical size of the record on external memory (disk block)of the record on external memory (disk block)

22 / 40


Packing densityPacking density

After the bucket size has been chosen, the packing density can be After the bucket size has been chosen, the packing density can be selected, i.e. selected, i.e. tthe number of buckets to store the foreseen number of he number of buckets to store the foreseen number of recordsrecordsTo reduce the number of overflowsTo reduce the number of overflows, , a larger capacity is chosena larger capacity is chosenPacking densityPacking density = = number of recordsnumber of records / / total capacitytotal capacity

– N = N = number of records to be storednumber of records to be stored– M = M = number of bucketsnumber of buckets– C = C = number of records within a bucketnumber of records within a bucket

Packing densityPacking density = N / (M *C) = N / (M *C)

23 / 40


How to deal with overflow?”How to deal with overflow?”

Using the primary areaUsing the primary areaIf a bucket is full, use the next one, etc.If a bucket is full, use the next one, etc.After the last one, comes the first oneAfter the last one, comes the first oneEfficient if the bucket size exceeds 10Efficient if the bucket size exceeds 10

Separate chaining Separate chaining Buckets are organised as linear listsBuckets are organised as linear lists

24 / 40


Statistics of hashingStatistics of hashing

LetLet M M be the number of bucketsbe the number of buckets, , andand N N the amount of input datathe amount of input data. . The probability for directing The probability for directing xx records into a certain bucket obeys the binomial distributionrecords into a certain bucket obeys the binomial distribution::

probability forprobability for Y Y overflowsoverflows: P(C + Y): P(C + Y) the expected number of overflows from a giventhe expected number of overflows from a given bucketbucket::

Total expected number of overflowsTotal expected number of overflows: 100: 100s s M/NM/N The average number of records to be entered into the hash table before collisionThe average number of records to be entered into the hash table before collision ~ 1.25 √M~ 1.25 √M The average total number of entered records before every bucket contains at least 1 record isThe average total number of entered records before every bucket contains at least 1 record is

M ln MM ln M

xNx

MMxNx

NxP

11

1

!!

!

1Y

YYCPs

25 / 40


Transformation of the key into addressTransformation of the key into address

Generally, the key is transformed into theGenerally, the key is transformed into the address in 3 steps:address in 3 steps: If the key is not numeric, it should be transformed into a number, preferably If the key is not numeric, it should be transformed into a number, preferably

without loss of informationwithout loss of information An algorithm is applied to transform the key, as uniformly as possible, into a An algorithm is applied to transform the key, as uniformly as possible, into a

pseudopseudo--random number with an random number with an order of magnitude of the bucket countorder of magnitude of the bucket count The result is multiplied by a constant The result is multiplied by a constant 1 for transformation into an 1 for transformation into an interval interval

of relative addressesof relative addresses, equal to the number, equal to the number o of bucketsf buckets– relative addresses are converted into the absolute ones on a concrete physical unit relative addresses are converted into the absolute ones on a concrete physical unit

and, as a rule, that is the task of system programsand, as a rule, that is the task of system programs

An ideal transformation: probability ofAn ideal transformation: probability of transformingtransforming 2 different keys2 different keys in a table of size M in a table of size M into the same addressinto the same address is is 1/M1/M

26 / 40


Characteristics of a good transformationCharacteristics of a good transformation

The output value depends only on the input dataThe output value depends only on the input data If it were dependent also on some other variable, its value should be also If it were dependent also on some other variable, its value should be also

known at the searchknown at the search The function uses all the information from the input dataThe function uses all the information from the input data

If it were not using If it were not using itit, under , under a a small variation of the input data a large small variation of the input data a large number of equal outputs would be achievednumber of equal outputs would be achieved – the distribution would depart – the distribution would depart from the wished onefrom the wished one

Uniformly distributes the output valuesUniformly distributes the output values Otherwise efficiency is decreasedOtherwise efficiency is decreased

For similar input data, results in very different output valuesFor similar input data, results in very different output values In reality, the input data are often very similar, while a uniform distribution at In reality, the input data are often very similar, while a uniform distribution at

the output is requiredthe output is required Which of these requirements are not fulfilled by our hash function from the Which of these requirements are not fulfilled by our hash function from the

previous example? previous example? 27 / 40


Usage of hashingUsage of hashing

When is it appropriate?When is it appropriate?Compilers use it for recording of declared variablesCompilers use it for recording of declared variablesFor For spelling checkerspelling checkers and dictionariess and dictionariesIn games to store the positions of playersIn games to store the positions of playersFor equality checks (in information security)For equality checks (in information security)

– If two elements result in different hash values, they must be differentIf two elements result in different hash values, they must be differentWhen quick and frequent search is requiredWhen quick and frequent search is required

When is it not appropriate?When is it not appropriate? When records are searched by a non-key attributeWhen records are searched by a non-key attribute valuevalue When the data need to be sortedWhen the data need to be sorted

– E.g. To find the minimum value of the keyE.g. To find the minimum value of the key

28 / 40


Example: Transformation of key into addressExample: Transformation of key into address

6 digit key, 7000 buckets; key: 1721486 digit key, 7000 buckets; key: 172148 Method: middle digits of the key squaredMethod: middle digits of the key squared

Key squared yields a 12 digits number. Digits from the Key squared yields a 12 digits number. Digits from the 55thth to the 8 to the 8thth are used are used

1721481721482 2 = = 002962963493349339043904 The middle 4 digits should be transformed into the interval The middle 4 digits should be transformed into the interval [0, 6999][0, 6999]

– As the pseudo-random number obtains values from the interval As the pseudo-random number obtains values from the interval [0, [0, 9999]9999], while the bucket addresses can be from the interval , while the bucket addresses can be from the interval [0, 6999][0, 6999], the , the multiplication factor is 6999/9999 multiplication factor is 6999/9999 0.7 0.7

Bucket address = 3493 * 0.7 = Bucket address = 3493 * 0.7 = 24452445 The results correspond to behaviour of the The results correspond to behaviour of the rouletteroulette

29 / 40


Methods for transformation of key into addressMethods for transformation of key into address

Root from the middle digits of the key squaredRoot from the middle digits of the key squared Like in the previous example but after the square operation theLike in the previous example but after the square operation the root from the root from the

8 middle digits 8 middle digits is calculated and cut-off to obtain a four digit integer:is calculated and cut-off to obtain a four digit integer: sqrt (96349339) = 9815sqrt (96349339) = 9815

DivisionDivision The key is divided by a The key is divided by a prime number prime number slightly less or equal to the number of slightly less or equal to the number of

buckets (e.g. 6997)buckets (e.g. 6997) Remainder after division is the bucket addressRemainder after division is the bucket address

– Bucket address = 172148 mod (6997) = 4220Bucket address = 172148 mod (6997) = 4220 Keys having a sequence of values are well distributedKeys having a sequence of values are well distributed

Shifting of digits and additionShifting of digits and addition E.g. key= 1720E.g. key= 172073597359

– 1720 + 1720 + 73597359 = = 90799079

30 / 40


Methods for transformation of key into addressMethods for transformation of key into address

OverlappingOverlapping Overlapping resembles shiftingOverlapping resembles shifting, , but is more appropriate for long keysbut is more appropriate for long keys E.g.E.g. t the keyhe key= = 172172407407359359

– 407407 + + 953953 + + 271 271 = 1631= 1631

Change of the counting baseChange of the counting base The number is calculated as if having another counting baseThe number is calculated as if having another counting base B B E.gE.g. B = . B = 1111, , key key = = 172148172148

– 11**111155 + + 77**111144 + + 22**111133 + + 11**111122 + + 44**111111 + + 88**111100 = 26 = 2663736373 The necessary number of least significant digitsThe necessary number of least significant digits is is selected and transformed selected and transformed

into the address intervalinto the address interval: : bucket addressbucket address = 6373 * 0.7 = 4461 = 6373 * 0.7 = 4461 The best method can be choseThe best method can be chosenn after simulation for a concrete application after simulation for a concrete application

Generally, division gives the best resultsGenerally, division gives the best results

31 / 40


Determination of parametersDetermination of parameters

Example:Example: There are There are 350 students 350 students enrolled in a study. Their ID - identification number enrolled in a study. Their ID - identification number

((1111 characters characters) and the family name () and the family name (14 characters14 characters) should be stored, ) should be stored, under the requirement to retrieve the records fast using their ID.under the requirement to retrieve the records fast using their ID.

Remark:Remark: ID containsID contains 1111 digitsdigits

– The last digit is for control and it can but need not be stored, if the rule The last digit is for control and it can but need not be stored, if the rule how to calculate it from the rest of the digits is knownhow to calculate it from the rest of the digits is known

32 / 40


SolutionSolution

A record contains11+1 + 14+1 = A record contains11+1 + 14+1 = 27 bytes27 bytes Let the physical block on the disk be Let the physical block on the disk be 512 bytes512 bytes

The bucket size should be equal or less than thatThe bucket size should be equal or less than that– 512/27 = 512/27 = 18,96318,963

Therefore, a bucket shall contain data for Therefore, a bucket shall contain data for 18 students18 students and and 26 26 bytes of bytes of unused spaceunused space

A A somewhat larger table capacity, e.g. 30%, somewhat larger table capacity, e.g. 30%, shall be provided to reduce the shall be provided to reduce the number of expected overflowsnumber of expected overflows

– That means there are That means there are 350/18350/18 *1.3 = *1.3 = 25 buckets25 buckets– ID should be transformed into a bucket address from the interval ID should be transformed into a bucket address from the interval [0, 24[0, 24]]

ID is rather long,ID is rather long, so so overlappingoverlapping can be considered can be considered– Methods can be Methods can be combinedcombined – after overlapping – after overlapping division division can be performedcan be performed– Address shall be calculated by division with a prime number close to the bucket number, e.g. Address shall be calculated by division with a prime number close to the bucket number, e.g. 2323

33 / 40


Writing of records into buckets of the hash tableWriting of records into buckets of the hash table

ID FamilyName

HASH

0

M-1

C

BLOCK on disk

1

M

34 / 40


Examples of key transformationExamples of key transformation

If the bucket is full, entering into the next bucket is attempted If the bucket is full, entering into the next bucket is attempted cyclically (cyclically (bad neighbour policybad neighbour policy))

ID = 57057029262926036036x

2926

630

075

3631 mod (23) = 20

ID = 6702926037x

2926

730

076

3732 mod (23) = 6

ID ID = = 5702926037x

29262926

730730

075075

3731 mod (23) = 53731 mod (23) = 5

ID ID = = 6702926036x

29262926

630630

076076

3632 mod (23) = 213632 mod (23) = 21

35 / 40


Algorithm - 1Algorithm - 1

Create an empty table on a disk Create an empty table on a disk

Read ID and family name sequentially, until there are dataRead ID and family name sequentially, until there are data

If the control digit is not correctIf the control digit is not correct

““Incorrect ID"Incorrect ID"

ElseElse

Set the tag that the record is not writtenSet the tag that the record is not written

Calculate the bucket addressCalculate the bucket address

Remember it as the initial addressRemember it as the initial address

RepeatRepeat

Read the existing records from the bucket Read the existing records from the bucket

Repeat for all the records from the bucketRepeat for all the records from the bucket

If the record is not emptyIf the record is not empty

If the already written ID is equal to the input oneIf the already written ID is equal to the input one

“ “Record already exists"Record already exists"

Put the tag that the record is writtenPut the tag that the record is written


Else

36 / 40


Algorithm - 2Algorithm - 2

Write the input recordWrite the input record

Put the tag that the record is writtenPut the tag that the record is written


If the record is not writtenIf the record is not written

Increment the bucket address for 1 and calculate the Increment the bucket address for 1 and calculate the modulo(buckets count)modulo(buckets count)

If the achieved address equals the initialIf the achieved address equals the initial

Table is fullTable is full

Until record written or table fullUntil record written or table full

EndEnd

HashHash

37 / 40


ExercisesExercises Update Update Hash Hash such that instead JMBG (13 character ID) it uses OIB (11 such that instead JMBG (13 character ID) it uses OIB (11

character ID). Update the function “Kontrola” for checking the control digit. character ID). Update the function “Kontrola” for checking the control digit. Implement deletion by ID.Implement deletion by ID.

Let Let productsproducts be the name of the unformatted file organised using hashing. Each be the name of the unformatted file organised using hashing. Each record consists of record consists of IDID (int), (int), namename (50+1 char), (50+1 char), quantityquantity (int) and (int) and priceprice (float). The (float). The record is considered empty if its record is considered empty if its ID ID equals equals zerozero. Count the number of non-empty . Count the number of non-empty records in a file. The size of the block on the disk is records in a file. The size of the block on the disk is BLOCKBLOCK, and expected , and expected maximum number of records is maximum number of records is MAXRECMAXREC. These parameters are contained in . These parameters are contained in parameters.hparameters.h..

Let an unformatted file be organised using hashing. Each record of the file Let an unformatted file be organised using hashing. Each record of the file consists of consists of namename (50+1 char), (50+1 char), quantityquantity (int) and (int) and priceprice (float). The record is (float). The record is considered empty if its considered empty if its quantity quantity equals equals zerozero. The size of the block on the disk is . The size of the block on the disk is BLOCK BLOCK what is contained in what is contained in parameters.hparameters.h. Write the function that will find the . Write the function that will find the packing density. The prototype of the function is: packing density. The prototype of the function is:

float density (const char *file_name);float density (const char *file_name);

38 / 40


ExercisesExercises

Write the function for the insertion of the Write the function for the insertion of the IDID (int) and the (int) and the namename (char 20+1) into (char 20+1) into the memory resident hash table with the memory resident hash table with 500 buckets500 buckets. Each bucket consists of a . Each bucket consists of a single record. If the bucket is full, the next bucket is used (single record. If the bucket is full, the next bucket is used (cyclicallycyclically). The input ). The input arguments are the bucket address (previously computed), ID and name. The arguments are the bucket address (previously computed), ID and name. The function returns 1 if the insertion is successful, 0 if the record already exists, function returns 1 if the insertion is successful, 0 if the record already exists, and -1 if the table is full so the record cannot be inserted.and -1 if the table is full so the record cannot be inserted.

Write the function that will find the Write the function that will find the IDID (int) and the (int) and the company name company name (30+1) in the (30+1) in the memory resident hash table with memory resident hash table with 200 buckets200 buckets. Each bucket consists of a single . Each bucket consists of a single record. If the bucket is full and does not contain the searched key value, the record. If the bucket is full and does not contain the searched key value, the next bucket is used (next bucket is used (cyclicallycyclically). Input arguments are the bucket address ). Input arguments are the bucket address (previously computed) and the ID. The output argument is the company name. (previously computed) and the ID. The output argument is the company name. The function returns 1 if the record is found, and 0 if it is not found. The function returns 1 if the record is found, and 0 if it is not found.

39 / 40


ExercisesExercises

Let a key be Let a key be 7-digit telephone number7-digit telephone number. Write the function for transformation of . Write the function for transformation of the key into address. Hash table consists of M buckets. The the key into address. Hash table consists of M buckets. The division division method method should be used. The prototype of the function is:should be used. The prototype of the function is:

int address (int m, long teleph);int address (int m, long teleph);

Let Let productsproducts be the name of the unformatted file organised using hashing. A be the name of the unformatted file organised using hashing. A record consists of ID (4-digit integer), name (up to 30 char) and price (float). record consists of ID (4-digit integer), name (up to 30 char) and price (float). The record is considered empty if its The record is considered empty if its ID ID equals equals zerozero. Write the function for . Write the function for emptying of the file. The size of the block on the disk is emptying of the file. The size of the block on the disk is BLOCKBLOCK, and expected , and expected maximum number of records is maximum number of records is MAXRECMAXREC. These parameters are contained in . These parameters are contained in parameters.hparameters.h..

Write the function that will compute the bucket address in the table of Write the function that will compute the bucket address in the table of 500 500 bucketsbuckets. The key is . The key is 4-digit 4-digit ID, and the method is the ID, and the method is the root from the central digits root from the central digits of the key squaredof the key squared..

40 / 40

algorithms and data structures

Documents