bases de datos - modelo racional
DESCRIPTION
Relational Model DatabasesTRANSCRIPT
-
UNIVERSIDAD DE BELGRANO
1
UNIVERSIDAD DE BELGRANO
The Relational Model
El Modelo Relacional de Base de Datos.
Autor: Prof. PAULA M. ANGELERI
Ctedra: Diseo de Sistemas
- 2001 -
-
UNIVERSIDAD DE BELGRANO
2
Executive Summary
This paper is, basically, a good introduction to the Relational Databases theory. It
explains the main Relational issues in a easy-to-follow manner. It is basically structured in 5
main sections. The first one explains the title of the paper itself, and defines each of its terms.
The second section is an Introduction that explains the benefits of using a database approach. The
third section explains why it is need to Normalized a Relational Database through an explanatory
example. The next section explains how can we Normalized a Relational Database, and the rules
involved in the Normalization process. Again the theory is accompanied by an explanatory
example. After these examples, another mechanism that can help to understand the way the
database is structured is provided: a king of guidelines for mapping the Entity Relationship Data
Model into the Relational Model. This guideline is explained with the purpose of helping the
readers who are familiar with Data Modeling, but still are not familiar with the Normalization
Theory. The idea is to take advantage of this data modeling knowledge, and providing them a
mechanism they can understand, in order to check if the Normalization process was well applied.
The last section describes the problems associated with the Normalization process (low
performance in queries), and then it provides a mechanism to solve this problem. After the paper
body an Appendixes section was included. In this section the origins of the Relational Model is
explained. In addition, a Glossary describes all the relational terms that are not explaining during
the paper.
-
UNIVERSIDAD DE BELGRANO
3
Table of Contents
Do Relational Databases Need Normalization?.......................................................................... 4
What Is a Database? .................................................................................................................. 4
What Does Relational Mean?................................................................................................ 5
The Relational Database Model ............................................................................................. 5
What Is Normalization?......................................................................................................... 7
Introduction ................................................................................................................................. 8
Why Should we Apply Normalization?................................................................................... 11
Data Redundancy.................................................................................................................. 12
Inconsistency among data..................................................................................................... 12
Advantages of Further Normalization .................................................................................. 12
Explanatory example ............................................................................................................ 13
Major New Concepts ............................................................................................................ 16
How to Apply Normalization in a Relational Database?....................................................... 18
The Normalization Theory ................................................................................................... 18
Mapping the ER Model into the Relational Model: ............................................................. 23
Normalization Implications ...................................................................................................... 27
Query Optimization: ................................................................................................................ 29
The Optimization Process..................................................................................................... 34
Conclusion.................................................................................................................................. 40
Appendixes................................................................................................................................. 41
The Origins of the Relational Model ....................................................................................... 41
Glossary ................................................................................................................................... 43
References .................................................................................................................................. 47
-
UNIVERSIDAD DE BELGRANO
4
Do Relational Databases Need Normalization? The question of whether to apply normalization in relational databases is a controversial
question in the database field of study. Some people believe that relational databases must be
normalized because normalization ensures that the data stored in the database are accurate. This
paper will provide enough evidence to show that a relational database needs normalization in
order to maintain consistent data. However, there is still an important problem that the
normalization theory did not resolve: the bad performance in queries.
Those that do not believe in the normalization theory support their argument by saying
that applying normalization provides two inconvenient disadvantages. The first one is that a
normalized database needs more time to answer a query. The second one is that the programmer
must write a larger code in the query programs.
This paper will discuss, in detail, both positions. In addition, the terms database,
relational and normalization will be defined, for those who are not familiar with them.
What Is a Database?
A database is a collection of computerized data. More precisely, a database is a collection
of files logically related. These files contain data that are stored in a permanent way so that they
can be used whenever it is necessary. The access to these data is given by a database system. A
database system is a computerized system composed of data, hardware, software and users (Date
4). The main goal of a database system is to provide the company centralized control of its data
(9).
-
UNIVERSIDAD DE BELGRANO
5
As an example of a database, we can consider the student database of a university. It has a
file that contains the personal data of the students (name, address, telephone, etc.). There is
another file that contains the information about the courses that each student is taking, and there
is a file with the information about the courses that the students have taken in the past.
What Does Relational Mean? In this context, the term relational implies that two or more files have logical links: one
file has a reference to another file. For instance, one file contains the student name, address and
zip code, and the other file contains the zip code and the name of the city. In the first file we have
a logical link to the second file: the zip code, because it can be used as a connector between both
files.
The Relational Database Model The relational model was developped by Codd in the early 1970's. The power and
completeness of this model is derived by the richness of the relational calculus and algebraic
operators in which this model is based. According to Haderle D.J., "Many users say:"It's
intuitive." The simplicity is derived from Codd's development of the model on a sound
theoretical base."
The strengths of the relational model are based on the fact that the data records are not
physically dependent. They are linked by meaning. In other words, records are referenced by
relational data records. The relational model has logical bonds between data.
According to Atre, one of the major advantages of the relational approach is its
simplicity.[...] The end users do not have to be concerned about the physical storage structure;
-
UNIVERSIDAD DE BELGRANO
6
they can be oriented toward the information content of their data and need not worry about the
physical representation details. The ability to use the database without knowing the
representation details is called data independence. (Atre, 93)
In the relational model, data structure is represented in a table or relation, which consists
of a set of tuples (that are often called "rows" or "records.") intersected by a set of columns
(often called "attributes"). Summarizing, a table is divided horizontally into rows and vertically
into columns.
The intersection of a row and a column is a "field". In this model each field contains a
single data value, which may be null. For a given column all data values are of the same type or
"domain." Consequently, for a given table all rows are of the same type. Another important
concept is that in the relational model, a "primary key" is selected for each table. A primary key
is the attribute or shortest group of attributes that identify each row (Nijssen and Halpin 247-
256).
To clarify all these new terms, I will present an example of a table: ID Name Telephone 1576 Angeleri, Paula 788-3395 1598 Paulon, Daniela 773-6889 => rows or tuples 1783 Rabossi, Marcelo 797-0345 Value of a field: "1576" Domain: numbers Primary key: attribute ID, because it is the only column that identifies a tuple. (We can have two
persons with the same name, or the same telephone, but each person has his/her own
-
UNIVERSIDAD DE BELGRANO
7
identification number.) In the example, above, the primary key value is 1576, and there is no
other record with this value.
What Is Normalization? Normalization is the process of converting the files of a database into a more desirable
form (Date 238). Normalization indicates whether a file needs to be split into two or more files.
The following example will show how a file can be split:
Not Normalized Approach:
ID Name Address Zip Code City 1576 Angeleri, Paula 1111 Dartmouth Ave. 91711 Claremont 1598 Paulon, Daniela 124 Foothill Bld. 91711 Claremont 1783 Rabossi,
Marcelo 7666 Arrow Hwy. 91888 Upland
Normalized Approach:
ID Name Address Zip Code 1576 Angeleri, Paula 1111 Dartmouth Ave. 91711 1598 Paulon, Daniela 124 Foothill Bld. 91711 1783 Rabossi, Marcelo 7666 Arrow Hwy. 91888
Zip Code City 91711 Claremont 91888 Upland
You can see thanks to this example how a file can be split into two files, by storing the name of
the city in another file.
-
UNIVERSIDAD DE BELGRANO
8
Introduction
Today all companies need databases to store their data because they have a large volume
of information. The purpose of this stored information is to be recovered when it is need. So it is
important to have the data organized in order to have easier and quicker access to it. This data
organization is known as "data models." A model is an abstraction of reality, often represented in
a diagram. Data is represented in a model in order to become more clear for the database
designer.
The purpose of this research is to provide enough information to convince the reader that
relational databases must be normalized in order to improve the benefits of having a database.
Some of these benefits are:
- Providing quick access to data.
- Sharing data.
- Having consistent information.
- Reducing redundancy.
- Giving on-line access.
- Having up-to-date information.
- Providing standards. (Date, 1990, 13-22)
- Applying security restrictions.
- Preventing failure condition errors.
Quick access to data
-
UNIVERSIDAD DE BELGRANO
9
The database system is equipped with a Data Query Language (DQL) that provides easy and
quick access to consult data.
Data shared
Sharing data means that the data stored in the database can be accessed by different systems and
users.
Consistent information
The information is consistent when the data are accurate. For understanding this meaning
suppose that the database has duplicated information. There could be some occasions on which
this information is updated only once, leaving the other duplicated data without any change. This
is called update anomaly. (Data inconsistency will be explained later on, in more detail).
Redundancy
In a non-database approach, each computer program has its own private files. That implies
redundant information because data used in two different programs must be stored twice. In a
database, data is shared by many users and many programs, reducing redundancy among data that
can carry to inconsistencies. (Data Redundancy is also explained later on).
On-line access
Information is available on demand at any time, in an interactive way.
-
UNIVERSIDAD DE BELGRANO
10
Up-to-date information
As the information is centralized, data are updated from different system programs operated by
users of different departaments. Therefore data is current.
Standards
Applicable standards are used to make uniform data and documentation. Also the database
approach provides standard reports and forms that help system programmers.
Security restrictions
The database system provides an effective control to data access. Each user of the database has a
limited view of the data and a restricted authorization access. Therefore the database system
provide security (preventing data damage) and confidentiality (data are seing only for authorized
users) (Davis, 42).
Failure condition errors
The database system provide a mecanism to recover data in the event of either hardware or
software failure (Davis, 43).
-
UNIVERSIDAD DE BELGRANO
11
Why Should we Apply Normalization?
Databases help to resolve problems. But it is necessary to use some patterns in order to avoid
other problems. For example, normalization eliminates data redundancy.
Another advantage of applying normalization patterns is that normalization reduces problems
concerning data manipulation: inserting, deleting and updating data (Dutka and Hanson 9).
Finally, Normalization is applied in order to avoid a problem called inconsistency among
data. When data are not consistent, they are not reliable.
Summarizing, the Normalization process is applied in order to achieve the following
objectives:
1. It ensures that the database does not present insert, update, or delete anomalies.
2. It allows that new kinds of data were introduced in the database, without the need of
restructuring it. This is important because it enlarges the lifetime of the applications.
3. It stresses the benefits of having a database, by made it more informative for the user.
4. It ensures that the database is free of any design bias that favors certain queries at the expense
of others.
5. It allows that any relation can be represented as a table, by eliminating repeating groups.
6. It simplifies the data access operations.
However, the normalization process can bring a couple of possible disadvantages of
further normalization(Codd):
1. It incurs in extra relation names.
-
UNIVERSIDAD DE BELGRANO
12
It makes necessary for some queries to use more join operations than a denormalized
database. 1 According to Date, Codd explained that this inconvenience could be solved by
providing predefined views for heavily used queries. (Date, March 1999).
Data Redundancy Data redundancy means that the same data is stored in the database more than once. This is
not recommended because it means a duplication effort (the same update operation must be
performing as many times as duplication of the data exist). Secondly, data redundancy means a
waste of storage space because the data is stored repeatedly. Finally, it is possible that an
inconsistency among data could occur, as data cannot be updated in everyplace where it appears.
(Elmasri and Nevathe, 12)
Inconsistency among data
As I have mentioned before, a database can have multiple values of the same data, but
these values can not agree between each other. This phenomenon is called Inconsistency. It is
clear that an inconsistent database has contradictions. Therefore it will supply incorrect
information to its users (Date 10).
Advantages of Further Normalization Normalizing a Relational Database gives certainly many advantages, as I explained
before, now I want to clarify that these advantages can be accrued when adopting a higher level
of normalization, referring to the use of the Second and Third Normal Forms (NF), rather than
1 The term denormalized is not used by Date in this article, but it is used in at least one of his books.
-
UNIVERSIDAD DE BELGRANO
13
merely the First NF. This explanation is necessary due that a database is considered
Normalized when it tables are in First NF.
Explanatory example
A comparison will prove further the need for "Normalization". I will present an example
of a relational database that is not normalized. I will find out the problems that this database
presents. And I will show the same situation in a normalized relational database.
Suppose we have information about students and courses. And we use an approach that is not
normalized to store them:
Student_name address Course Mark year Fini, Mercedes 343 6th Street Mathematics I A 1997 Fini, Mercedes 343 6th Street Physics I A- 1997 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997 Fini, Mercedes 343 6th Street Mathematics II B- 1998 Now, suppose that Mercedes Fini has changed her address. We must update her address in the
database. So the user searches Mercedes Finis register and finds:
-
UNIVERSIDAD DE BELGRANO
14
Student name: Fini, Mercedes Address: 343 6th Street Course: Mathematics I Mark: A Year: 1997
Then, he/she will change Mercedes Finis address: Student name: Fini, Mercedes Address: 567 Main Street Course: Mathematics I Mark: A Year: 1997
And now we have: Student_name Address Course mark year Fini, Mercedes 567 Main Street Mathematics I A 1997 Fini, Mercedes 343 6th Street Physics I A- 1997 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997 Fini, Mercedes 343 6th Street Mathematics II B- 1998
-
UNIVERSIDAD DE BELGRANO
15
At this moment, I ask: Which is Mercedes Finis real address?
This database does not provide a good answer for my question. It will tell me that
Mercedes Fini has two different addresses, or it will tell me that Mercedes Fini lives at 343 6th
Street (and this is not true anymore), or it will tell me that Mercedes Fini lives at 567 Main Street
in a first query but subsequent queries could give me a different answer. The answer will depend
on the way the user expresses the query. At this moment you can ask: Why we cannot use an
approach like this below?
Student_name Address Course mark year Fini, Mercedes 343 6th. Street Mathematics I A 1997 Physics I A- 1997 Mathematics II B- 1998 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997
In this approach, Mercedes Fini has one address, and the user will not have any problem
to update it. However, the user will be in trouble if he/she tries to update, add or delete a course,
because all Mercedes Finis courses are together in a cell, and we could not use them separately.
At this moment, the question: Why it is needed to apply normalization? is answered. It
is necessary to apply normalization in order to avoid update, delete, and insert anomalies that will
introduce inconsistency in the database.
At this point we have another problem: How to apply normalization? Before answering this
question, I will introduce some database concepts, so the vocabulary used during the next
pages can be easy understood.
-
UNIVERSIDAD DE BELGRANO
16
Major New Concepts
Functional Dependence: having R.A and R.B, where A and B are attributes or set of attributes
of the relation R, the notation R.A R.B means that B is functionally dependent on A or A
functionally determines B in R. This functionally dependent relationship implies that knowing
the value of A, we can obtain the value of B (Kroenke, 112).
We can simplify the notation if R is understood: A B.
Transitive Dependence: A C is a transitive dependence if A B and B C. According to
Codd, C is strictly transitively dependent on A.
Candidate key: a candidate key K is the attribute or set of attributes that identifies a unique row
in a relation (Kroenke, 502). A candidate key is required to be unique, and nonredundant.
This latest term is defined by Date as no attribute can be discarded from K without destroying
the uniqueness property. Date states that the terms unique and irreducible will better suit
the definition of a candidate key. Additionally, Date remarks the importance of Codd
observations regarding the candidate key concept: (a) each attribute of relation R is functionally
dependent on each candidate key of R, and (b) within any given candidate key, no proper
subset of the attributes is functionally dependent on any other.
According to Date, these two statements imply that each attribute of R is functionally dependent
on each superkey of R being a superkey a superset of a candidate key (what implies that a
superkey is obtained by discarding the nonredundant requirement of a candidate key).
-
UNIVERSIDAD DE BELGRANO
17
Primary key: is one of the candidate keys, arbitrarily chosen by the database designer.
The operational difference between a primary key and a candidate key is that a tuple cannot
have an undefined value for any of the attributes that compose the primary key, meanwhile any of
the components of a candidate key can have an undefined value.
Insert/Update/Delete Anomaly: according to Date neither of this term was formally defined,
so he did not define them. For this reason I will take the definition from Kroenke (501):
Anomaly is an undesired consequence of a data modification.
An insertion anomaly occurs when it is need to add facts about two or more different terms to
a single row of a relation. This means that for adding a row in a table, it is need to give values for
some attributes that are not really need for this particular event. For example, for incorporating a
particular Department in a company, data about an employee is required.
A deletion anomaly occurs when some particular information is lost when deleting a single row.
This happens when, for example, it is required to delete an employee and in the process, the
database lost the information regarding his Department (what it is not desired).
An update anomaly occurs when there is more that one entry for a particular data, and the value
of one of these entries changes, meanwhile the rest remains unaltered. For example, I keep the
address of an employee twice, and I change its value only once.
According to Date, Codd stated that the anomalies are due to redundant information. Date
attributes this undesired situation to a low level of normalization in the database. (Date, March
1999).
-
UNIVERSIDAD DE BELGRANO
18
How to Apply Normalization in a Relational Database?
Normalization implies decomposition. That means to transform tables that are too large
into smaller ones that are free of update anomalies. Normalization technique provides a complete
theory for decomposing a table (Nijssen and Halpin 273). To explain how to apply this theory I
will use the same student-courses example:
TABLE: STUDIES
Student_name Address Course mark Year Fini, Mercedes 343 6th. Street Mathematics I A 1997 Physics I A- 1997 Mathematics II B- 1998 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997
The Normalization Theory The normalization theory involves 6 general rules. They are called Normal Forms. The
following picture shows the relation among them:
1NF2NF3NFBCNF 4NF 5NF
-
UNIVERSIDAD DE BELGRANO
19
The circles imply that a relational database is in Fifth Normal Form (5NF) if it is in 4NF
(and of course satisfies other conditions).
A relational database is in 4NF, is it is in BCNF (Boyce-Codd).
A relational database is in BCNF, is it is in 3NF.
A relational database is in 3NF, is it is in 2NF.
A relational database is in 2NF, is it is in 1NF.
For the purpose of this paper I will only explain the first three Normal Forms: 1NF, 2NF, 3NF.
First Normal Form: to satisfy the 1NF, a table or relation must have a primary key (unique
identifier). In addition, it cannot have null values for the primary key attributes, and each row-
column intersection must have single values (no repeating groups are allowed). Finally, all non-
key attributes must be functionally dependent on the primary key. The term functionally
dependent means that knowing the value of the attribute A, you can know the value of the
attribute B. Therefore the attribute B is functionally dependent on the attribute A, and the
attribute A determines the value of the attribute B.
Following this rule, the example above must be modify in order to avoid "repeating group"
(when a single field is used to store more than one value). At the example, Mercedes Fini has 3
course values: Mathematics I, Physics I and Mathematics II. The only way to store these 3 values
is having 3 different tuples or records with Mercedes Fini's courses:
-
UNIVERSIDAD DE BELGRANO
20
TABLE: STUDIES
student_name Address Course mark year Fini, Mercedes 343 6th Street Mathematics I A 1997 Fini, Mercedes 343 6th Street Physics I A- 1997 Fini, Mercedes 343 6th Street Mathematics II B- 1998 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997
Second Normal Form: in a 2NF table or relation, each non-key attribute (neither a key nor a part
of a composite key) must be functionally dependent on the whole of the key (not just a part of it).
In other words: all the attributes that are not a part of the primary key must depend on all the
attributes that compose the primary key. And, of course, the relation must be in 1NF.
Consider the table STUDIES: the primary key of the table is the set of attributes (student_name +
course) because knowing the values of both of them we can determine the whole register. Notice
that, by knowing just the student_name (Ex. Mercedes Fini) we cannot know that Mercedes
Finis grade in 1997, at the course Physics I, was B-. This is due to the fact that Mercedes Fini
has more than one register. Also, by knowing a value of the attribute course (Ex. Mathematics I)
we can not predict the values of the other attributes of the register. That is because there are many
students who studied Mathematics I.
We have selected the primary key for the table STUDIES. Now we have to apply the 2NF:
Do both, the student_name and the course determinate the non-key attribute address?
No, we can know the address of the student by knowing only the student_name. Therefore this
example does not follow the 2NF rule.
To solve this problem we must divide the table STUDIES in two tables:
-
UNIVERSIDAD DE BELGRANO
21
TABLE: STUDENT student_name Address Fini, Mercedes 343 6th Street Smith, John 23 Sunset Avenue Roland, Peter 86432 Two Nods Rd. TABLE: STUDIES Student_name Course Mark year Fini, Mercedes Mathematics I A 1997 Fini, Mercedes Physics I A- 1997 Fini, Mercedes Mathematics II B- 1998 Smith, John Mathematics I A- 1997 Roland, Peter Mathematics I B 1997 Now, Mercedes Fini has only one address because we have eliminated redundancy. The user can
update her address without any problem.
Third Normal Form: to be in 3NF, a relation must be in 2NF, and all non-key attributes must
depend on the primary key directly (no transitive dependencies are allowed).
By observing the next table, we can see that the mark of the student depends on the primary key
(student_name + Course + year). The problem here is that the attribute concept depends on the
attribute mark, which is not the primary key of the relation. This implies that this table it is not in
3NF.
TABLE: STUDIES Primary key: student name, Course, year student_name Course mark Year Concept Fini, Mercedes Mathematics I A 1997 Outstanding Fini, Mercedes Physics I A- 1997 Excellent Fini, Mercedes Mathematics II B- 1998 Good Smith, John Mathematics I A- 1997 Excellent Roland, Peter Mathematics I B 1997 Very Good
-
UNIVERSIDAD DE BELGRANO
22
The following two tables show how the previous table was decomposed for accomplishing the
3NF. We can see how the attribute concept was moved into another table, so we do not have
anymore a transitive dependency.
TABLE: STUDIES Primary key: student_name+course+year Student_name course mark Year Fini, Mercedes Mathematics I A 1997 Fini, Mercedes Physics I A- 1997 Fini, Mercedes Mathematics II B- 1998 Smith, John Mathematics I A- 1997 Roland, Peter Mathematics I B 1997 TABLE: MARKS Primary key: mark mark concept A Outstanding A- Excellent B- Good A- Excellent B Very Good
Now, we have seen how to apply normalization, and before we move to the next section I
want to explain that it is strongly recommended to use a number (or letters) identifier for the
primary key, rather than the name, the description, or other fields like them. This is because you
can probably have more than one person/customer/student/etc with the same name. Do you
never see a boy and his father sharing the same name?
Another problem you can have, when using a name or description for identifying the
tuples in a given relation is that sometimes, a person can type them differently. When this
happens, if someone want to process a statistic, the identifiers will look differently for the
processor, as it will not understand that different letters can expressed the same. Think in the
-
UNIVERSIDAD DE BELGRANO
23
example: Engineering or Enginering or engineerng. For the processor, they are basically
three different names.
For those that are not familiar with the normalization theory, but they are familiar with
the Data Modeling techniques, I will add this following section that explains how to convert your
Entity Relationship Model into a Relational model. After following this process it is
recommended to check each table, one by one, to see if a mistake was produced during the
conversion, and the database is not well normalized.
Mapping the ER Model into the Relational Model:
For mapping the ER to the RM you can follow the following guidelines, taken from
Elmasri and Navathe (172-178):
Step1: each entity type is mapped into a Relation or table in the RM, include the single
attributes of the entity, as columns (attributes) of the table, and select one of more attributes as
the primary key (unique identifier).
Step2: each weak entity (or dependent entity) is mapped into a Relation. The primary key of this
relationship will be formed by a combination of the primary key of the owner entity (the strong
entity), and the partial identifier of the weak entity.
Step3: each binary relationship 1:1 is mapped by duplicating the primary key of one Relation as a
foreign key in the other Relation. Can be also mapped by migrating both primary keys as foreign
keys but I do not recommend this approach, as it can carry to inconsistencies, Finally, it can be
mapped by merging both entities and the relationship into one single entity (this approach can be
-
UNIVERSIDAD DE BELGRANO
24
used when none of the entities participate in any other relationship, and both entities are total). I
still recommend the first approach.
Step4: each strong binary relationship with cardinality 1:N is mapped by migrating the attributes
of the primary key at the 1-side of the relationship as a foreign key in the table that is at the N-
side of the Relationship. It is also possible to build a table to represent the relationship. This
second approach is particularly selected when:
- the relationship can be M:N in the future
- a lot of records will have the null value, and this foreign key will be almost null.
Step5: each binary M:N relationship is mapped as a new table whose primary key is composed by
the primary key of one entitys table, and the primary key of the second entitys table.
Additionally, if this relationship has its own attributes, they are transformed in columns of the
new table.
Step6: each multivalued attribute M is mapped into a new table whose primary key is composed
by the primary key of the entity or relationship that has the multivalued attribute M, plus an
additional attribute of M that makes this primary key unique. If M is a composite attribute, all its
simple components are included in this new table.
Step7: each n-ary relationship is mapped into a new table. In this new table all the attributes that
composed the primary key of the each of the n-entities (usually three entities) become columns.
In addition, each attribute belonging to the relationship itself is mapped as a new column of the
new table. For identifying the primary key, we need to take into consideration the cardinality of
the relationship:
For knowing the cardinality you must ask yourself:
-
UNIVERSIDAD DE BELGRANO
25
Project
Project
- How many PARTS of the same (one) supplier are needed for one project? The answer will depend on the business rules. For instance, MANY (0..M) parts are provided by supplier S1, for the project P1.
Then ask yourself: - In how many PROJECTS does the same (one) supplier provide the same (one) part? Still the answer will depend on the business rules. For instance, the same part (P1) is provided by one supplier (S1) for MANY (1..M) projects.
Supplier Part
Supplier Part
-
UNIVERSIDAD DE BELGRANO
26
Project
Finally, you must ask about the third entity:
- How many SUPPLIERS provided the same (one) part for the same (one) project?
The answer will also depend on the business rules. For instance, MANY (1..M) suppliers can
provide the same part (P1), for the same project (PJ1).
Following my example (that is similar of the one proposed by Elsmari and Navathe, and
the same ternary relationship described in many data modeling books), we have that the
cardinality of this relationship is M:N:N. Notice that we have one maximum cardinality for each
entity that participates in the relationship. In this example we have three entities, and three
maximum cardinality: M,N,N.
In a ternary relationship, when the cardinality is M:N:N, the primary key of the table that
represents this relationship in the relational model will be composed by the concatenation of the
primary keys of each entity.
If the business rules established that the supplier that provides a particular part (P1), for a
particular project (PJ1), is always the same (what means that no other supplier can provide the
Supplier Part
-
UNIVERSIDAD DE BELGRANO
27
part P1 for the project PJ1), it is easy to know who is the supplier, by knowing the part identifier
(Part#) and the project identifier (Proj#). Talking in functional dependence terms, the part
number and the project number determine the supplier number.
(Part#, Proj#) => Supplier#
In this particular case, the primary key of the relationship table
SUPPLIERS_FOR_PROJECTS can be composed by the primary key of the PART table, and the
primary key of the PROJECT table: (Part#,Proj#). However, I still recommend to used all the
foreign key as the composite primary key (instead of only two fields), as the business rules can
changed, and it will be very hard to modify all the existing applications.
Normalization Implications
We have seen how normalization is important, because it helps us to avoid inconsistency
among data. However, it has other effects that are not desired. One of the problems of
normalizing a database is that you lose performance during the queries. As Nijssen and
Halpin say, the fact of reducing the number of tables, reduce the time of the query. On the
contrary, increasing the number of tables during the normalization process will increase the time
of the query execution (254). In other words, you spend more time during each query because
you need to link (join) more tables.
Some people believe that a database must be denormalize in order to have a better
performance during the query processing. According to Elmasri and Nevathe, denormalization
means that the attributes that are frequently required together in a query operation must be stored
in the same table or relation. However, it is necessary to control that each update transaction
-
UNIVERSIDAD DE BELGRANO
28
modifies all the values of a duplicate attribute in order to maintain the database consistent (469).
In the not normalized example presented before, we have seen that Mercedes Fini has three
times the address stored in the database. This implies that all the programs that modify the
address attribute must check that the value of this attribute was updated as many times as it
appears in the database. As you can understand, this denormalized approach is risky. For
example, a new programmer can forget to include in his/her program the controlling routine,
which ensures that all the values of a same data were updated simultaneously. By this, he/she will
introduce the undesirable inconsistency in the database.
The second problem is clearly states by Yao, when he says:
The fragmentation of data into normalized relations unfortunately makes it necessary to
write tedious and lengthy queries to obtain useful information.(107)
At this point I cannot disagree with this statement. It is true that the query will be longer in a
normalized approach. This is due to the fact that the query language needs the programmer to
clearly state what tables he/she will use, and which is the attribute or group of attributes that link
them.
For example, in a query language known as SQL (structured query language) you will have
the following instruction to obtain a data from only one table:
Select student_name, address, course From table STUDIES Where student_name= Fini, Mercedes;
This query will give me a report with Mercedes Finis address and courses.
In a normalized approach the same report will be obtained through the following instruction:
-
UNIVERSIDAD DE BELGRANO
29
Select student_name, address, course From table STUDENT, STUDIES Where student_name= Fini, Mercedes And STUDENT.student_name=STUDIES.student_name;
The last sentence shows the relation between the two tables (the primary_key attribute or
attributes that links them). You will have one of these sentences for each primary_key attribute
that links each table that contains the attributes that you want to list. As a consequence, the code
of the query is longer. However, it will take only few minutes more to write these few additional
sentences in the code.
In summary, I strongly recommend to normalize your relational database, even if the
query code become longer or the query performance is slower.
The following section explains the Query Optimization process that is provided by almost
every RDBMS.
Query Optimization: Did you try to do a query joining two tables with a large number of fields, and a large
number of records? If you did it, you probably will agree that the performance of the query
(response time) is not very convenient. Did you try to do the same query using a RDMS that
provides you the Query Optimization capability? If so, you must realize that the response time
has considerably improved. I will try to explain the way the Query Optimization capability
works, thought an example. Consider the following tables:
TABLE: STUDENT StudentID Student_Name Address ZipCode 10 Fini, Mercedes 343 6th Street 91711
-
UNIVERSIDAD DE BELGRANO
30
20 Smith, John 23 Sunset Avenue 91711 30 Roland, Peter 86432 Two Nods Rd. 91716 40 Funson, Larry 345 8th Street 91711 50 Neuman, Paul 1010 Foothill Ave. 91711 60 Gibson, Mel 863 Two Birds Rd. 91716 70 Tapia, Roland 343 6th Street 91711 80 Banderas, Peter 273 La Luna St. 91714 90 Sony, Benson 842 Main Boulevard 91716 TABLE: STUDIES StudentID CourseID Mark Year 10 101 A 1997 10 104 A- 1997 10 102 B- 1998 20 103 A- 1997 30 101 B 1997 40 103 A 1997 40 104 A- 1997 50 105 B- 1998 60 101 A- 1998 60 105 C 1999 70 104 A+ 1999 70 106 A 1999 80 104 B 1999 90 103 A+ 1999 90 105 B 1999 TABLE: COURSES CourseID Course_Name 101 Mathematics I 102 Mathematics II 103 Mathematics III 104 Physics I 105 Physics II 106 Physics III
Suppose we want to perform the following query:
SELECT Student_Name FROM STUDENT, STUDIES WHERE STUDENT.StudentID=STUDIES.StudentID AND STUDIES.CourseID=101;
-
UNIVERSIDAD DE BELGRANO
31
The corresponding Algebraic expression for this query is:
(( STUDIES JOIN STUDENT ) WHERE CourseID=101 ) [Student_Name]
In this example we have 9 students, 15 courses taken by the students (studies), and only 3
students that have taken the course 101. Now we will see how this query is performed without
any optimization at all:
First: JOIN relations STUDIES and STUDENTS (over StudentID).
During this step, the 15 tuples from the COURSE table are reading. Then each of the 9 students
tuples (records) is read 15 times, once per each course. This operation involves reading 135
tuples. The intermediate result consist on 15 joined tuples (that is, tuples that contain the fields
of both tables, and suite the condition: STUDENT.StudentID=STUDIES.StudentID).
Second: RESTRICT the result of the first step to just the tuples that suite the condition (tuples
where CourseID=101).
During this step, the 15 joined tuples are read in order to look the tuples that have
CourseID=101. The result is a relation that contains only 3 tuples.
Third: PROJECT the result of the second step over Student_Name.
During this step, the 3 tuples relation obtained are read, in order to select the Student_Name of
each. The output is a new relation containing only the Student_Name attribute, and the tuples that
are no duplicating (in this example there are no duplicate tuples, so the new relation will contain
3 tuples).
Now, if we count the processed tuples, we have that this query, performed in this way, involves
189 (15+135+15+15+3+3+3) processed tuples.
-
UNIVERSIDAD DE BELGRANO
32
Resolving this query by following the optimization process would involve the subsequent
steps:
First: RESTRICT relation STUDIES to just the tuples for course 101.
During this step, the 15 tuples of the STUDIES relation are read in order to produce the 3 tuples
intermediate relation that suites the condition CourseID=101.
Second: JOIN the result obtained during the first step, to the STUDENT relation (over
StudentID).
During this step, the 9 tuples from the STUDENT relation are reading. Then each of the 9
students tuples (records) is read 3 times, once per each 101 course. This operation involves
reading 27 tuples. The intermediate result consist on 3 joined tuples (that is, tuples that contain
the fields of both tables, and suite the condition: STUDENT.StudentID=STUDIES.StudentID).
Third: PROJECT the result obtained during the second step over Student_Name.
During this step, the 3 tuples of the obtained relation are read in order to select the
Student_Name of each. The output is a new relation containing only the Student_Name attribute,
and the tuples that do not have duplicates (in this example there are no duplicate tuples, so the
new relation will contain 3 tuples).
Now, if we count the processed tuples, we have that this query, performed in this second way,
involves 63 (15+3+9+27+3+ 3+3) processed tuples.
In this example, tables are smaller enough to be kept in main memory (only the 135
tuples relation will be probably kept on disk, depending on the hardware capabilities). Still, it is
easy to realize that the second approach is better, as the total number of tuples processed is
smaller. In addition, we can improve the performance even more by indexing the STUDIES
-
UNIVERSIDAD DE BELGRANO
33
relation on CourseID, so that the total number of tuples read from STUDIES will be only 3
(according to this example). Again, an index or hash on StudentID can be used in STUDENT
relation to improve the second step, by reading only the three needed tuples, instead of 9.
In real life this optimization is very important, as we will surely have a larger number of
records. For instance, suppose our STUDENT table contains 9,000 records, our STUDIES table
has 34,000 records, and only 800 records match the condition students taken the course 101.
Now, the maximum number of processed tuples can be 306,104,400 for a non-optimized query,
and 7,246,200 for a query that follows the optimization process. This is because, processing this
query without optimizing it, involves the following steps:
First, read the 34,000 STUDIES records, then, read the 9,000 STUDENTs records 34,000 times
(that is 306,000,000 tuples in total), and construct the intermediate 34,000 joined tuples table.
Due to the large number of records, we cannot keep all this tuples in main memory, so we need
to write them out to disk, and then read them from disk (what takes longer than keep them in
memory). Once we generated the intermediate 34,000 joined tuples table, we read this table back
into memory, and produce an 800 tuples tables (by performing the Restriction operation). We
assume this intermediate relation cannot be kept in memory according to its size, so it will be
need to save it out to disk. Finally, we need to read these 800 tuples again in order to perform the
Projection, resulting in a new 800 tuples relation. In total we have: 34,000 + 306,000,000 +
34,000 + 34,000 + 800 + 800 + 800 = 306,104,400 processed tuples (this is the maximum
number, when no duplicate tuples were found).
Following the optimization process we have:
-
UNIVERSIDAD DE BELGRANO
34
First, read the 34,000 tuples of the STUDIES relation and produce the 800 tuples relation
output. Then, read the 9,000 tuples from the STUDENT relation and, read each of the tuples 800
times, once per each 101 course (total 7,200,000 tuples processed), and produce a new 800
joined tuples relation. Take these 800 tuples and make the projection, in order to obtain the
desired relation (that has maximum 800 tuples, if no duplicate tuples are found). In total we
have: 34,000 + 800 + 9,000 + 7,200,000 + 800 + 800 + 800 = 7,246,200 processed tuples. You
easily can see how the optimization mechanism has considerably improved this query
performance. When the tables involved in the query have a larger number of records and
attributes, the advantage of using the query optimization mechanism can be even more important
(an easily noticed). Sometimes, this performance in the queries can be still improved more, as the
smaller number of reading tuples makes possible that the intermediate results were kept in main
memory, instead of being kept in disk space (so no I/O accesses to disk storage are needed).
Additionally, the use of index and hash mechanisms allows improving this performance much
more, as the number of reading tuples can be still reduced.
The Optimization Process
The optimization process can be described as a mechanism by which a query is broken
down into smaller queries that involve a lower number of processed tuples and a lower number
of processed fields. In other words, the optimization process tries to avoid the Cartesian product
operation (that involves a considerable large number of tuples and fields to be processed), and
tries to reduce to a minimum the total number of tuples to be scanned at each stage (by applying
-
UNIVERSIDAD DE BELGRANO
35
first a restriction operation). This mechanism considerably improves the performance in
resolving the query.
Sometimes, the query optimization process can be called: the query decomposition
process. During this process, a query that involves multiple tuple variables is transformed into
smaller queries that involve one or two variables each, by using the detachment and tuple
substitution mechanisms. During the detachment process, the query is decomposed in smaller
components that have one variable in common with the rest of the query, and that cannot be
decomposed any more (this is often called reduction). These components are separate from the
query and then they are substituted by a new subquery where a chosen variable is assigned to a
set of possible calculated values through the tuple substitution mechanism (generally called
nested loops). It is important to notice that detachment generates two queries from one, for this
reason it is applied before applying the tuple substitution mechanism. This second mechanism is
applied once it is no possible to apply detachment anymore. These two mechanisms are applied
repeatedly in order to decompose the query into irreducible subqueries (components) in order to
win the flexibility needed for improving the query performance. The idea is to apply first the
operations that reduce the absolute sizes of the intermediate results considering their attributes
size and the number of them-. As a rule, it is needed to resolve first the one-tuple variable, and
then the two-tuple variables. The following example explains this process:
Consider the following tables:
TABLE: STUDENT StudentID Student_Name Address ZipCode 10 Fini, Mercedes 343 6th Street 91711
-
UNIVERSIDAD DE BELGRANO
36
20 Smith, John 23 Sunset Avenue 91711 30 Roland, Peter 86432 Two Nods Rd. 91716 40 Funson, Larry 345 8th Street 91711 50 Neuman, Paul 1010 Foothill Ave. 91711 60 Gibson, Mel 863 Two Birds Rd. 91716 70 Tapia, Roland 343 6th Street 91711 80 Banderas, Peter 273 La Luna St. 91714 90 Sony, Benson 842 Main Boulevard 91716 TABLE: STUDIES StudentID CourseID Mark Year 10 101 A 1997 10 104 A- 1997 10 102 B- 1998 20 103 A- 1997 30 101 B 1997 40 103 A 1997 40 104 A- 1997 50 105 B- 1998 60 101 A- 1998 60 105 C 1999 70 104 A+ 1999 70 106 A 1999 80 104 B 1999 90 103 A+ 1999 TABLE: COURSES CourseID Course_Name 101 Mathematics I 102 Mathematics II 103 Mathematics III 104 Physics I 105 Physics II 106 Physics III
-
UNIVERSIDAD DE BELGRANO
37
TABLE: CITY ZipCode City_Name 91711 Claremont 91712 Rancho Cucamonga 91713 Upland 91714 Montclair 91715 San Dimas 91716 Pomona Suppose we want to perform the following query Q0: Get names of students that live in Pomona
and have taken Mathematics I during 1997.
SELECT Student_Name
FROM STUDENT S, STUDIES E, COURSES C, CITY L
WHERE L.City_Name= Pomona
AND L.ZipCode=S.ZipCode
AND S.StudentID=E.StudentID
AND E.Year=1997
AND E.CourseID=C.CourseID
AND C.Course_Name= Mathematics I;
The tuple variables in query Q0 are S, E, C, and L, where S varies over the limits of the
relation STUDENTS, E ranges over the relation STUDIES, C ranges over COURSES, and L
ranges over the relation CITY.
The first process involves the detachment of the one-variable queries L, C and E, as
follows:
D1: L = Select ZipCode from CITY where City_Name= Pomona;
-
UNIVERSIDAD DE BELGRANO
38
D2: C = Select CourseID from COURSES where Course_Name= Mathematics I;
D3: E = Select CourseID, StudentID from STUDIES where Year=1997;
Now, the remained query will be:
Q2: SELECT Student_Name
FROM STUDENTS S, L, E, C
WHERE S.ZipCode=L.ZipCode
AND S.StudentID=E.StudentID
AND E.CourseID=C.CourseID;
Now we process the two-variable detachments:
D4: S = Select StudentID, Student_Name from S where S.ZipCode=L.ZipCode;
D5: E= Select StudentID from E where CourseID=CCourseID;
And the remained query will be:
Q3: SELECT Student_Name from S where S.StudentID= E.StudentID;
Another way to do it is to process:
D4: E= Select StudentID from E where CourseID=CCourseID;
And the remained query will be:
Q3: Select Student_Name from S where S.ZipCode=L.ZipCode
And S.StudentID= E.StudentID;
-
UNIVERSIDAD DE BELGRANO
39
D 1 D 3D 2
L C E
D 4 D 5
Q 3
C E
E
L
S
S
O v e r a l lr e s u l t
The following picture shows the Decomposition Tree for Query Q0:
As a general rule, first it is need break down the RESTRICTION operation (with more
than one condition) into a cascade of RESTRICTION operations, to obtain the flexibility needed
for moving the RESTRICTION operations down, in the tree. Then it is needed to reorder the tree
nodes in order to process the most restrictive RESTRICTIONS first (that are the
RESTRICTIONS that produce a relation with the smallest number of tuples). Next, break down
the PROJECTION attributes, and put them as down in the tree as possible, by creating new
PROJECTION operations. This will reduce the overall number of processed attributes. Finally,
identify the group of operations that can be processed together.
-
UNIVERSIDAD DE BELGRANO
40
Conclusion A relational database needs to be consistent in order to make its data reliable and useful.
It is possible to have consistency by two different approaches: (1) applying normalization or (2)
not applying normalization in the relational databases. In the first approach, the database will be
always consistent. If we take the second position, or denormalization, we need to include control
routines in every program that updates the database. This means that the second approach implies
that the developer of the database is taking a risk. As I have mentioned, a new programmer can
forget to add the controlling routine in a program, allowing that the same data was updated in one
tuple, but not in others. This means that the second approach DOES NOT ALWAYS ensure
consistency among data. Therefore, it is recommended to select the first approach applying
normalization since this approach guarantees the data consistency. If you still have some doubts
about the query performance, remember that it can be improved through the Query Optimization
mechanisms. This process does not only improve the query response time, it also allows the
subqueries to be processed simultaneously in a multiprocessor environment.
-
UNIVERSIDAD DE BELGRANO
41
Appendixes
In this section of the paper I will give an overview about the origins of the Relational
Model. I do not add this material in the paper body, because I am trying to focus only in the
important issues of the Relational Model: basic concepts, normalization, and query optimization.
First thing you need to know about the Relational Model, is that the most prestigious
researchers in this field of study are Codd, E.F. and Date, C.J. (that is the reason I mentioned
them so many times).
The Origins of the Relational Model
According to Date, database management has evolved into a science thanks to Codds
work, that has provided a theoretical framework (the relational model) within which a variety of
important problems could be attacked in a scientific manner. He also recognizes that the main
importance of Codds work is that he has not only introduced the Relational Model, but the
concept of a data model, needed to deal with the problems associated with the storage of a
large amount of data.
E.F. Codd was the first researcher who introduced the logic of predicates in the database
management field, enlarging his contribution with the introduction of the relational algebra and
the relational calculus in the relational data model. Furthermore, Codd has developed the first
relational data language (ALPHA) that allowed the manipulation of data independently of its
physical representation (this independence is nowadays called physical data independence,
-
UNIVERSIDAD DE BELGRANO
42
mentioned as one of the advantages that the relational model has over the network model). In
addition, Codds papers introduced the term functional dependence, and the first three normal
forms of the Normalization Theory.
Codds second relational paper: A Relational Model of Data for Large Shared Data
Banks published in 1970 (Communications of the ACM 13, No.6) is probably the most famous
paper in the history of database management. (Date, Dec.1998)
Codds first paper is focused on data redundancy and related issues. This second paper
seems like a revision of the first one, but it stresses the importance of providing physical data
independence within a data model. In addition, this second paper includes some new concepts.
For instance, it introduces the idea that data must be always normalized. As a result, this paper
has had a big repercussion in the database field of study.
Codds third paper, in 1972, provides a formal definition for both: relational algebra and
relational calculus. Then, it presents an algorithm for mapping a calculus expression into an
algebra expression. Through this, Codd has proved that the algebra is as powerful as the calculus.
The algebra includes both, objects, and operators that follow particular principles. The
relational algebra has relations, as its objects, and has operations such as restriction, projection,
union, difference, join, etc. These operators satisfy certain laws, such as associativity,
commutativity, and closure. Closure is very important for the relational algebra, and it implies
that the result of an operation between two relations is always another relation. (Date, Jan.1998).
-
UNIVERSIDAD DE BELGRANO
43
Glossary
Alternate key: this term was apparently introduced by Date. An alternate key is any of the
candidate keys that were not chosen as the primary key of the relation (Date, March 1999).
Cartesian product: the relational Cartesian product produces a set of tuples, instead of simple
pairs. The result of a Cartesian product expression is a tuple, containing all components of the
first relation, together with all the components of the second relation. This algebraic operator is
commutative and associative (Date, Jan.1999)
Composition: the composition of the relation A{X,Y} with the relation B{Y,Z) is the projection
on X and Z of a join of A with B, always considering that X,Y and Y,Z are attributes of the
corresponding relations A and B. (Date, Nov.1998)
Consistency: at this point, Dates states that Codds definition of consistency doesnt capture all
aspects of integrity. My interpretation of Codds definition of consistency is as follows: a set of
relations is consistent when the redundancies match, by this I mean that the redundant data do
not have different values. According to Codd, to ensure consistency the system should be
informed about the redundancies in order to control them. (Date, Nov.1998)
Derivability: a relation is said to be derivable if and only if its expressible. (Date, Nov.1998)
Difference: the difference of two relations is a third relation formed by the tuples of the first
relation that do not appear in the second relation (Kroenke).
Division: the division is the inverse of the Cartesian product, as you can see in the following
formula quoted from Dates article: ( R TIMES S ) DIVIDEBY S = R (Date, Jan.1999)
-
UNIVERSIDAD DE BELGRANO
44
Factoring: this operation is also called nesting. The factoring operator transforms a normalized
relation into an unnormalized relation (Date, Jan.1999)
First Normal Forms (1NF): the definition for the 1NF is not provided in the article, however I
will add it in my summary. According to Kroenke, a table is said to be in 1NF when its cells have
single values (neither repeating groups nor arrays are allowed as values) (118).
Full Dependence: the functional dependence A B is a full dependence when B is not
functionally dependent on any subset of A (Date, March 1999)
Intersection: the intersection of two relations is a third relation formed by the tuples contained in
both relations (Kroenke).
Join: the join of a relation A{X,Y} with the relation B{Y,Z} is a relation C{X,Y,Z} where
C{X,Y}=A and C{Y,Z}=B. The natural join is a special case of the join operation that is the
only join possible when the attribute X is functionally dependent on the attribute Y (relation A)
or the attribute Z is functionally dependent on the attribute Y (relation B) (Date, Nov.1998). In
other words, the join of two relations is a third relation formed by applying the product operator
and then selecting only the tuples that follow a specified criteria (Kroenke). In the natural join
operation redundant attributed must be eliminated.
Key: is the unique identifier of the tuple, in the Relation. It is said that the key is
nonredundant when each attribute (that constitutes the key) is necessary for a unique
identification. This definition implies that if you remove an attribute from the key, then you will
not have a unique identifier (Date, Oct.1998).
-
UNIVERSIDAD DE BELGRANO
45
Normalization: Date defines the normalization process as a process of taking projections, and
further that we can recover the original relation by taking the natural join of those projections,
what implies that the process is reversible: the decomposition is nonloss.(March 1999)
Permutation: this operation allows the reordening of attributes of a given relation. One year after
this first paper, Codd stated that the order of the attributes of a given relation is irrelevant so
this operation would remain for the internal use of the language (Date, Nov.1998).
Projection: it implies that the points in n-dimensional space, corresponding to a relation of degree
n, can be projected over m of its attributes, in the m axes (having always that m
-
UNIVERSIDAD DE BELGRANO
46
Second Normal Form (2NF): Codds definition of 2NF is that a relation must be in 1NF and
every nonprime attribute must be fully dependent on each candidate key.
A prime attribute is an attribute that composed at least one candidate key. A nonprime attribute is
an attribute that is not prime (Date, March 1999).
Third Normal Form (3NF): Codds definition of 3NF is that a relation R is in 3NF if it is in 2NF
and every nonprime attribute is nontransitively dependent on each candidate key of R (Date,
March 1999)
Tie: it implies that given a relation with n attributes (X1 to Xn), the result of this operation is the
restriction of the relation to just the particular rows in which Xn=X1 (Date, Nov.1998).
Trivial Dependence: A B is a trivial functionally dependence if B is a subset of A.
Union: the union of two relations is a third relation formed by adding the tuples of both relations
and eliminating duplicate tuples (Kroenke).
-
UNIVERSIDAD DE BELGRANO
47
References
Atre S., Data Base: Structured Techniques for Design, Performance, and Management , 1980,
John Wiley & Sons, Inc.
Date C.J., An Introduction to Database Systems. Reading, Massachusetts: Addison-Wesley
Publishing Company, 1982.
Davis B., Data Base Management Systems: User Experience in the U.S.A., 1975, NCC
Publications.
Dutka, Alan F. and Hanson, Howard H., Fundamentals of Data Normalization. Reading,
Massachusetts: Addison-Wesley Publishing Company, 1989.
Elmasri, R. and Nevathe, S.B., Fundamentals of Database Systems. Redwood City, California:
The Benjamin/Cummings Publishing Company, Inc, 1994.
Kroenke, D.M., Database Processing: Fundamentals, Design, and Implementation (6th edition).
Upper Saddle River, N.J.: Prentice Hall, 1998.
Haderle Donald J., Database Role in Information Systems: The Evolution of Database
Technology and its impact on Enterprise Information Systems, In Goos G. and
Hatmanis J., Lecture Notes in Computer Science, 1990, Spring-Verlag.
Hawryszkiewycz I.T., Database Analysis and Design, Macmillan Publishing Company, 1991.
Nijssen G.M. and Halpin T.A., Conceptual Schema and Relational Database Design: A Fact
Oriented Approach. New York: Prentice Hall, 1989.
-
UNIVERSIDAD DE BELGRANO
48
Turk Thomas A., Planning And Designing The Data Base Environment, Van Nostrand
Reinhold Company Inc., 1985.
Yao, S. Bing, Principles of Database Design, Volume I: Logical Organizations, New Jersey:
Prentice Hall, 1985.
Articles:
Date, CJ, The Birth of the Relational Model (part 1 of 3), Intelligent Enterprise, October 1998.
In this article Date explained Codds first relational paper: Derivability, Redundancy, and
Consistency of Relations Stored in Large Data Banks, IBM Research Report RJ599, 1969.
Date, CJ, The Birth of the Relational Model (part 2 of 3), Intelligent Enterprise, November 1998.
In this article Date explained Codds first relational paper: Derivability, Redundancy, and
Consistency of Relations Stored in Large Data Banks, IBM Research Report RJ599, 1969.
Date, CJ, The Birth of the Relational Model (part 3 of 3), Intelligent Enterprise, December 1998.
This article talk about the Codds papers regarding the Relational Model. Particulary, the article
refers to the following paper: Codd, E.F., A Relational Model of Data for Large Shared Data
Banks , Communications of the ACM 13, No.6, 1970.
Date, JC., Thirty Years of Relational Codds Relational Algebra, Intelligent Enterprise, January
1999. In this article, Date analyzes Codds third famous paper: Relational Completeness of Data
Base Sublanguages, published in 1972 (IBM Research Report RJ987).
Date, JC., Thirty Years of Relational: The First Three Normal Forms, Intelligent Enterprise,
March 1999. In this article, Date analyzes Codds three relational papers: The Second and Third
-
UNIVERSIDAD DE BELGRANO
49
Normal Forms for the Relational Model, IBM technical memo (October 6th, 1970). Further
Normalization of Data Base Relational Model, IBM Research Report RJ909 (August 31st, 1971).
Normalized Data Base Structure: A Brief Tutorial, ACM SIGFIDET Workshop on Data
Description, Access, and Control, San Diego (November 11th-12th, 1971).