bases de datos - modelo racional

Upload: joshua

Post on 04-Nov-2015

8 views

Category:

Documents


0 download

DESCRIPTION

Relational Model Databases

TRANSCRIPT

  • UNIVERSIDAD DE BELGRANO

    1

    UNIVERSIDAD DE BELGRANO

    The Relational Model

    El Modelo Relacional de Base de Datos.

    Autor: Prof. PAULA M. ANGELERI

    Ctedra: Diseo de Sistemas

    - 2001 -

  • UNIVERSIDAD DE BELGRANO

    2

    Executive Summary

    This paper is, basically, a good introduction to the Relational Databases theory. It

    explains the main Relational issues in a easy-to-follow manner. It is basically structured in 5

    main sections. The first one explains the title of the paper itself, and defines each of its terms.

    The second section is an Introduction that explains the benefits of using a database approach. The

    third section explains why it is need to Normalized a Relational Database through an explanatory

    example. The next section explains how can we Normalized a Relational Database, and the rules

    involved in the Normalization process. Again the theory is accompanied by an explanatory

    example. After these examples, another mechanism that can help to understand the way the

    database is structured is provided: a king of guidelines for mapping the Entity Relationship Data

    Model into the Relational Model. This guideline is explained with the purpose of helping the

    readers who are familiar with Data Modeling, but still are not familiar with the Normalization

    Theory. The idea is to take advantage of this data modeling knowledge, and providing them a

    mechanism they can understand, in order to check if the Normalization process was well applied.

    The last section describes the problems associated with the Normalization process (low

    performance in queries), and then it provides a mechanism to solve this problem. After the paper

    body an Appendixes section was included. In this section the origins of the Relational Model is

    explained. In addition, a Glossary describes all the relational terms that are not explaining during

    the paper.

  • UNIVERSIDAD DE BELGRANO

    3

    Table of Contents

    Do Relational Databases Need Normalization?.......................................................................... 4

    What Is a Database? .................................................................................................................. 4

    What Does Relational Mean?................................................................................................ 5

    The Relational Database Model ............................................................................................. 5

    What Is Normalization?......................................................................................................... 7

    Introduction ................................................................................................................................. 8

    Why Should we Apply Normalization?................................................................................... 11

    Data Redundancy.................................................................................................................. 12

    Inconsistency among data..................................................................................................... 12

    Advantages of Further Normalization .................................................................................. 12

    Explanatory example ............................................................................................................ 13

    Major New Concepts ............................................................................................................ 16

    How to Apply Normalization in a Relational Database?....................................................... 18

    The Normalization Theory ................................................................................................... 18

    Mapping the ER Model into the Relational Model: ............................................................. 23

    Normalization Implications ...................................................................................................... 27

    Query Optimization: ................................................................................................................ 29

    The Optimization Process..................................................................................................... 34

    Conclusion.................................................................................................................................. 40

    Appendixes................................................................................................................................. 41

    The Origins of the Relational Model ....................................................................................... 41

    Glossary ................................................................................................................................... 43

    References .................................................................................................................................. 47

  • UNIVERSIDAD DE BELGRANO

    4

    Do Relational Databases Need Normalization? The question of whether to apply normalization in relational databases is a controversial

    question in the database field of study. Some people believe that relational databases must be

    normalized because normalization ensures that the data stored in the database are accurate. This

    paper will provide enough evidence to show that a relational database needs normalization in

    order to maintain consistent data. However, there is still an important problem that the

    normalization theory did not resolve: the bad performance in queries.

    Those that do not believe in the normalization theory support their argument by saying

    that applying normalization provides two inconvenient disadvantages. The first one is that a

    normalized database needs more time to answer a query. The second one is that the programmer

    must write a larger code in the query programs.

    This paper will discuss, in detail, both positions. In addition, the terms database,

    relational and normalization will be defined, for those who are not familiar with them.

    What Is a Database?

    A database is a collection of computerized data. More precisely, a database is a collection

    of files logically related. These files contain data that are stored in a permanent way so that they

    can be used whenever it is necessary. The access to these data is given by a database system. A

    database system is a computerized system composed of data, hardware, software and users (Date

    4). The main goal of a database system is to provide the company centralized control of its data

    (9).

  • UNIVERSIDAD DE BELGRANO

    5

    As an example of a database, we can consider the student database of a university. It has a

    file that contains the personal data of the students (name, address, telephone, etc.). There is

    another file that contains the information about the courses that each student is taking, and there

    is a file with the information about the courses that the students have taken in the past.

    What Does Relational Mean? In this context, the term relational implies that two or more files have logical links: one

    file has a reference to another file. For instance, one file contains the student name, address and

    zip code, and the other file contains the zip code and the name of the city. In the first file we have

    a logical link to the second file: the zip code, because it can be used as a connector between both

    files.

    The Relational Database Model The relational model was developped by Codd in the early 1970's. The power and

    completeness of this model is derived by the richness of the relational calculus and algebraic

    operators in which this model is based. According to Haderle D.J., "Many users say:"It's

    intuitive." The simplicity is derived from Codd's development of the model on a sound

    theoretical base."

    The strengths of the relational model are based on the fact that the data records are not

    physically dependent. They are linked by meaning. In other words, records are referenced by

    relational data records. The relational model has logical bonds between data.

    According to Atre, one of the major advantages of the relational approach is its

    simplicity.[...] The end users do not have to be concerned about the physical storage structure;

  • UNIVERSIDAD DE BELGRANO

    6

    they can be oriented toward the information content of their data and need not worry about the

    physical representation details. The ability to use the database without knowing the

    representation details is called data independence. (Atre, 93)

    In the relational model, data structure is represented in a table or relation, which consists

    of a set of tuples (that are often called "rows" or "records.") intersected by a set of columns

    (often called "attributes"). Summarizing, a table is divided horizontally into rows and vertically

    into columns.

    The intersection of a row and a column is a "field". In this model each field contains a

    single data value, which may be null. For a given column all data values are of the same type or

    "domain." Consequently, for a given table all rows are of the same type. Another important

    concept is that in the relational model, a "primary key" is selected for each table. A primary key

    is the attribute or shortest group of attributes that identify each row (Nijssen and Halpin 247-

    256).

    To clarify all these new terms, I will present an example of a table: ID Name Telephone 1576 Angeleri, Paula 788-3395 1598 Paulon, Daniela 773-6889 => rows or tuples 1783 Rabossi, Marcelo 797-0345 Value of a field: "1576" Domain: numbers Primary key: attribute ID, because it is the only column that identifies a tuple. (We can have two

    persons with the same name, or the same telephone, but each person has his/her own

  • UNIVERSIDAD DE BELGRANO

    7

    identification number.) In the example, above, the primary key value is 1576, and there is no

    other record with this value.

    What Is Normalization? Normalization is the process of converting the files of a database into a more desirable

    form (Date 238). Normalization indicates whether a file needs to be split into two or more files.

    The following example will show how a file can be split:

    Not Normalized Approach:

    ID Name Address Zip Code City 1576 Angeleri, Paula 1111 Dartmouth Ave. 91711 Claremont 1598 Paulon, Daniela 124 Foothill Bld. 91711 Claremont 1783 Rabossi,

    Marcelo 7666 Arrow Hwy. 91888 Upland

    Normalized Approach:

    ID Name Address Zip Code 1576 Angeleri, Paula 1111 Dartmouth Ave. 91711 1598 Paulon, Daniela 124 Foothill Bld. 91711 1783 Rabossi, Marcelo 7666 Arrow Hwy. 91888

    Zip Code City 91711 Claremont 91888 Upland

    You can see thanks to this example how a file can be split into two files, by storing the name of

    the city in another file.

  • UNIVERSIDAD DE BELGRANO

    8

    Introduction

    Today all companies need databases to store their data because they have a large volume

    of information. The purpose of this stored information is to be recovered when it is need. So it is

    important to have the data organized in order to have easier and quicker access to it. This data

    organization is known as "data models." A model is an abstraction of reality, often represented in

    a diagram. Data is represented in a model in order to become more clear for the database

    designer.

    The purpose of this research is to provide enough information to convince the reader that

    relational databases must be normalized in order to improve the benefits of having a database.

    Some of these benefits are:

    - Providing quick access to data.

    - Sharing data.

    - Having consistent information.

    - Reducing redundancy.

    - Giving on-line access.

    - Having up-to-date information.

    - Providing standards. (Date, 1990, 13-22)

    - Applying security restrictions.

    - Preventing failure condition errors.

    Quick access to data

  • UNIVERSIDAD DE BELGRANO

    9

    The database system is equipped with a Data Query Language (DQL) that provides easy and

    quick access to consult data.

    Data shared

    Sharing data means that the data stored in the database can be accessed by different systems and

    users.

    Consistent information

    The information is consistent when the data are accurate. For understanding this meaning

    suppose that the database has duplicated information. There could be some occasions on which

    this information is updated only once, leaving the other duplicated data without any change. This

    is called update anomaly. (Data inconsistency will be explained later on, in more detail).

    Redundancy

    In a non-database approach, each computer program has its own private files. That implies

    redundant information because data used in two different programs must be stored twice. In a

    database, data is shared by many users and many programs, reducing redundancy among data that

    can carry to inconsistencies. (Data Redundancy is also explained later on).

    On-line access

    Information is available on demand at any time, in an interactive way.

  • UNIVERSIDAD DE BELGRANO

    10

    Up-to-date information

    As the information is centralized, data are updated from different system programs operated by

    users of different departaments. Therefore data is current.

    Standards

    Applicable standards are used to make uniform data and documentation. Also the database

    approach provides standard reports and forms that help system programmers.

    Security restrictions

    The database system provides an effective control to data access. Each user of the database has a

    limited view of the data and a restricted authorization access. Therefore the database system

    provide security (preventing data damage) and confidentiality (data are seing only for authorized

    users) (Davis, 42).

    Failure condition errors

    The database system provide a mecanism to recover data in the event of either hardware or

    software failure (Davis, 43).

  • UNIVERSIDAD DE BELGRANO

    11

    Why Should we Apply Normalization?

    Databases help to resolve problems. But it is necessary to use some patterns in order to avoid

    other problems. For example, normalization eliminates data redundancy.

    Another advantage of applying normalization patterns is that normalization reduces problems

    concerning data manipulation: inserting, deleting and updating data (Dutka and Hanson 9).

    Finally, Normalization is applied in order to avoid a problem called inconsistency among

    data. When data are not consistent, they are not reliable.

    Summarizing, the Normalization process is applied in order to achieve the following

    objectives:

    1. It ensures that the database does not present insert, update, or delete anomalies.

    2. It allows that new kinds of data were introduced in the database, without the need of

    restructuring it. This is important because it enlarges the lifetime of the applications.

    3. It stresses the benefits of having a database, by made it more informative for the user.

    4. It ensures that the database is free of any design bias that favors certain queries at the expense

    of others.

    5. It allows that any relation can be represented as a table, by eliminating repeating groups.

    6. It simplifies the data access operations.

    However, the normalization process can bring a couple of possible disadvantages of

    further normalization(Codd):

    1. It incurs in extra relation names.

  • UNIVERSIDAD DE BELGRANO

    12

    It makes necessary for some queries to use more join operations than a denormalized

    database. 1 According to Date, Codd explained that this inconvenience could be solved by

    providing predefined views for heavily used queries. (Date, March 1999).

    Data Redundancy Data redundancy means that the same data is stored in the database more than once. This is

    not recommended because it means a duplication effort (the same update operation must be

    performing as many times as duplication of the data exist). Secondly, data redundancy means a

    waste of storage space because the data is stored repeatedly. Finally, it is possible that an

    inconsistency among data could occur, as data cannot be updated in everyplace where it appears.

    (Elmasri and Nevathe, 12)

    Inconsistency among data

    As I have mentioned before, a database can have multiple values of the same data, but

    these values can not agree between each other. This phenomenon is called Inconsistency. It is

    clear that an inconsistent database has contradictions. Therefore it will supply incorrect

    information to its users (Date 10).

    Advantages of Further Normalization Normalizing a Relational Database gives certainly many advantages, as I explained

    before, now I want to clarify that these advantages can be accrued when adopting a higher level

    of normalization, referring to the use of the Second and Third Normal Forms (NF), rather than

    1 The term denormalized is not used by Date in this article, but it is used in at least one of his books.

  • UNIVERSIDAD DE BELGRANO

    13

    merely the First NF. This explanation is necessary due that a database is considered

    Normalized when it tables are in First NF.

    Explanatory example

    A comparison will prove further the need for "Normalization". I will present an example

    of a relational database that is not normalized. I will find out the problems that this database

    presents. And I will show the same situation in a normalized relational database.

    Suppose we have information about students and courses. And we use an approach that is not

    normalized to store them:

    Student_name address Course Mark year Fini, Mercedes 343 6th Street Mathematics I A 1997 Fini, Mercedes 343 6th Street Physics I A- 1997 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997 Fini, Mercedes 343 6th Street Mathematics II B- 1998 Now, suppose that Mercedes Fini has changed her address. We must update her address in the

    database. So the user searches Mercedes Finis register and finds:

  • UNIVERSIDAD DE BELGRANO

    14

    Student name: Fini, Mercedes Address: 343 6th Street Course: Mathematics I Mark: A Year: 1997

    Then, he/she will change Mercedes Finis address: Student name: Fini, Mercedes Address: 567 Main Street Course: Mathematics I Mark: A Year: 1997

    And now we have: Student_name Address Course mark year Fini, Mercedes 567 Main Street Mathematics I A 1997 Fini, Mercedes 343 6th Street Physics I A- 1997 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997 Fini, Mercedes 343 6th Street Mathematics II B- 1998

  • UNIVERSIDAD DE BELGRANO

    15

    At this moment, I ask: Which is Mercedes Finis real address?

    This database does not provide a good answer for my question. It will tell me that

    Mercedes Fini has two different addresses, or it will tell me that Mercedes Fini lives at 343 6th

    Street (and this is not true anymore), or it will tell me that Mercedes Fini lives at 567 Main Street

    in a first query but subsequent queries could give me a different answer. The answer will depend

    on the way the user expresses the query. At this moment you can ask: Why we cannot use an

    approach like this below?

    Student_name Address Course mark year Fini, Mercedes 343 6th. Street Mathematics I A 1997 Physics I A- 1997 Mathematics II B- 1998 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997

    In this approach, Mercedes Fini has one address, and the user will not have any problem

    to update it. However, the user will be in trouble if he/she tries to update, add or delete a course,

    because all Mercedes Finis courses are together in a cell, and we could not use them separately.

    At this moment, the question: Why it is needed to apply normalization? is answered. It

    is necessary to apply normalization in order to avoid update, delete, and insert anomalies that will

    introduce inconsistency in the database.

    At this point we have another problem: How to apply normalization? Before answering this

    question, I will introduce some database concepts, so the vocabulary used during the next

    pages can be easy understood.

  • UNIVERSIDAD DE BELGRANO

    16

    Major New Concepts

    Functional Dependence: having R.A and R.B, where A and B are attributes or set of attributes

    of the relation R, the notation R.A R.B means that B is functionally dependent on A or A

    functionally determines B in R. This functionally dependent relationship implies that knowing

    the value of A, we can obtain the value of B (Kroenke, 112).

    We can simplify the notation if R is understood: A B.

    Transitive Dependence: A C is a transitive dependence if A B and B C. According to

    Codd, C is strictly transitively dependent on A.

    Candidate key: a candidate key K is the attribute or set of attributes that identifies a unique row

    in a relation (Kroenke, 502). A candidate key is required to be unique, and nonredundant.

    This latest term is defined by Date as no attribute can be discarded from K without destroying

    the uniqueness property. Date states that the terms unique and irreducible will better suit

    the definition of a candidate key. Additionally, Date remarks the importance of Codd

    observations regarding the candidate key concept: (a) each attribute of relation R is functionally

    dependent on each candidate key of R, and (b) within any given candidate key, no proper

    subset of the attributes is functionally dependent on any other.

    According to Date, these two statements imply that each attribute of R is functionally dependent

    on each superkey of R being a superkey a superset of a candidate key (what implies that a

    superkey is obtained by discarding the nonredundant requirement of a candidate key).

  • UNIVERSIDAD DE BELGRANO

    17

    Primary key: is one of the candidate keys, arbitrarily chosen by the database designer.

    The operational difference between a primary key and a candidate key is that a tuple cannot

    have an undefined value for any of the attributes that compose the primary key, meanwhile any of

    the components of a candidate key can have an undefined value.

    Insert/Update/Delete Anomaly: according to Date neither of this term was formally defined,

    so he did not define them. For this reason I will take the definition from Kroenke (501):

    Anomaly is an undesired consequence of a data modification.

    An insertion anomaly occurs when it is need to add facts about two or more different terms to

    a single row of a relation. This means that for adding a row in a table, it is need to give values for

    some attributes that are not really need for this particular event. For example, for incorporating a

    particular Department in a company, data about an employee is required.

    A deletion anomaly occurs when some particular information is lost when deleting a single row.

    This happens when, for example, it is required to delete an employee and in the process, the

    database lost the information regarding his Department (what it is not desired).

    An update anomaly occurs when there is more that one entry for a particular data, and the value

    of one of these entries changes, meanwhile the rest remains unaltered. For example, I keep the

    address of an employee twice, and I change its value only once.

    According to Date, Codd stated that the anomalies are due to redundant information. Date

    attributes this undesired situation to a low level of normalization in the database. (Date, March

    1999).

  • UNIVERSIDAD DE BELGRANO

    18

    How to Apply Normalization in a Relational Database?

    Normalization implies decomposition. That means to transform tables that are too large

    into smaller ones that are free of update anomalies. Normalization technique provides a complete

    theory for decomposing a table (Nijssen and Halpin 273). To explain how to apply this theory I

    will use the same student-courses example:

    TABLE: STUDIES

    Student_name Address Course mark Year Fini, Mercedes 343 6th. Street Mathematics I A 1997 Physics I A- 1997 Mathematics II B- 1998 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997

    The Normalization Theory The normalization theory involves 6 general rules. They are called Normal Forms. The

    following picture shows the relation among them:

    1NF2NF3NFBCNF 4NF 5NF

  • UNIVERSIDAD DE BELGRANO

    19

    The circles imply that a relational database is in Fifth Normal Form (5NF) if it is in 4NF

    (and of course satisfies other conditions).

    A relational database is in 4NF, is it is in BCNF (Boyce-Codd).

    A relational database is in BCNF, is it is in 3NF.

    A relational database is in 3NF, is it is in 2NF.

    A relational database is in 2NF, is it is in 1NF.

    For the purpose of this paper I will only explain the first three Normal Forms: 1NF, 2NF, 3NF.

    First Normal Form: to satisfy the 1NF, a table or relation must have a primary key (unique

    identifier). In addition, it cannot have null values for the primary key attributes, and each row-

    column intersection must have single values (no repeating groups are allowed). Finally, all non-

    key attributes must be functionally dependent on the primary key. The term functionally

    dependent means that knowing the value of the attribute A, you can know the value of the

    attribute B. Therefore the attribute B is functionally dependent on the attribute A, and the

    attribute A determines the value of the attribute B.

    Following this rule, the example above must be modify in order to avoid "repeating group"

    (when a single field is used to store more than one value). At the example, Mercedes Fini has 3

    course values: Mathematics I, Physics I and Mathematics II. The only way to store these 3 values

    is having 3 different tuples or records with Mercedes Fini's courses:

  • UNIVERSIDAD DE BELGRANO

    20

    TABLE: STUDIES

    student_name Address Course mark year Fini, Mercedes 343 6th Street Mathematics I A 1997 Fini, Mercedes 343 6th Street Physics I A- 1997 Fini, Mercedes 343 6th Street Mathematics II B- 1998 Smith, John 23 Sunset Avenue Mathematics I A- 1997 Roland, Peter 86432 Two Nods Rd. Mathematics I B 1997

    Second Normal Form: in a 2NF table or relation, each non-key attribute (neither a key nor a part

    of a composite key) must be functionally dependent on the whole of the key (not just a part of it).

    In other words: all the attributes that are not a part of the primary key must depend on all the

    attributes that compose the primary key. And, of course, the relation must be in 1NF.

    Consider the table STUDIES: the primary key of the table is the set of attributes (student_name +

    course) because knowing the values of both of them we can determine the whole register. Notice

    that, by knowing just the student_name (Ex. Mercedes Fini) we cannot know that Mercedes

    Finis grade in 1997, at the course Physics I, was B-. This is due to the fact that Mercedes Fini

    has more than one register. Also, by knowing a value of the attribute course (Ex. Mathematics I)

    we can not predict the values of the other attributes of the register. That is because there are many

    students who studied Mathematics I.

    We have selected the primary key for the table STUDIES. Now we have to apply the 2NF:

    Do both, the student_name and the course determinate the non-key attribute address?

    No, we can know the address of the student by knowing only the student_name. Therefore this

    example does not follow the 2NF rule.

    To solve this problem we must divide the table STUDIES in two tables:

  • UNIVERSIDAD DE BELGRANO

    21

    TABLE: STUDENT student_name Address Fini, Mercedes 343 6th Street Smith, John 23 Sunset Avenue Roland, Peter 86432 Two Nods Rd. TABLE: STUDIES Student_name Course Mark year Fini, Mercedes Mathematics I A 1997 Fini, Mercedes Physics I A- 1997 Fini, Mercedes Mathematics II B- 1998 Smith, John Mathematics I A- 1997 Roland, Peter Mathematics I B 1997 Now, Mercedes Fini has only one address because we have eliminated redundancy. The user can

    update her address without any problem.

    Third Normal Form: to be in 3NF, a relation must be in 2NF, and all non-key attributes must

    depend on the primary key directly (no transitive dependencies are allowed).

    By observing the next table, we can see that the mark of the student depends on the primary key

    (student_name + Course + year). The problem here is that the attribute concept depends on the

    attribute mark, which is not the primary key of the relation. This implies that this table it is not in

    3NF.

    TABLE: STUDIES Primary key: student name, Course, year student_name Course mark Year Concept Fini, Mercedes Mathematics I A 1997 Outstanding Fini, Mercedes Physics I A- 1997 Excellent Fini, Mercedes Mathematics II B- 1998 Good Smith, John Mathematics I A- 1997 Excellent Roland, Peter Mathematics I B 1997 Very Good

  • UNIVERSIDAD DE BELGRANO

    22

    The following two tables show how the previous table was decomposed for accomplishing the

    3NF. We can see how the attribute concept was moved into another table, so we do not have

    anymore a transitive dependency.

    TABLE: STUDIES Primary key: student_name+course+year Student_name course mark Year Fini, Mercedes Mathematics I A 1997 Fini, Mercedes Physics I A- 1997 Fini, Mercedes Mathematics II B- 1998 Smith, John Mathematics I A- 1997 Roland, Peter Mathematics I B 1997 TABLE: MARKS Primary key: mark mark concept A Outstanding A- Excellent B- Good A- Excellent B Very Good

    Now, we have seen how to apply normalization, and before we move to the next section I

    want to explain that it is strongly recommended to use a number (or letters) identifier for the

    primary key, rather than the name, the description, or other fields like them. This is because you

    can probably have more than one person/customer/student/etc with the same name. Do you

    never see a boy and his father sharing the same name?

    Another problem you can have, when using a name or description for identifying the

    tuples in a given relation is that sometimes, a person can type them differently. When this

    happens, if someone want to process a statistic, the identifiers will look differently for the

    processor, as it will not understand that different letters can expressed the same. Think in the

  • UNIVERSIDAD DE BELGRANO

    23

    example: Engineering or Enginering or engineerng. For the processor, they are basically

    three different names.

    For those that are not familiar with the normalization theory, but they are familiar with

    the Data Modeling techniques, I will add this following section that explains how to convert your

    Entity Relationship Model into a Relational model. After following this process it is

    recommended to check each table, one by one, to see if a mistake was produced during the

    conversion, and the database is not well normalized.

    Mapping the ER Model into the Relational Model:

    For mapping the ER to the RM you can follow the following guidelines, taken from

    Elmasri and Navathe (172-178):

    Step1: each entity type is mapped into a Relation or table in the RM, include the single

    attributes of the entity, as columns (attributes) of the table, and select one of more attributes as

    the primary key (unique identifier).

    Step2: each weak entity (or dependent entity) is mapped into a Relation. The primary key of this

    relationship will be formed by a combination of the primary key of the owner entity (the strong

    entity), and the partial identifier of the weak entity.

    Step3: each binary relationship 1:1 is mapped by duplicating the primary key of one Relation as a

    foreign key in the other Relation. Can be also mapped by migrating both primary keys as foreign

    keys but I do not recommend this approach, as it can carry to inconsistencies, Finally, it can be

    mapped by merging both entities and the relationship into one single entity (this approach can be

  • UNIVERSIDAD DE BELGRANO

    24

    used when none of the entities participate in any other relationship, and both entities are total). I

    still recommend the first approach.

    Step4: each strong binary relationship with cardinality 1:N is mapped by migrating the attributes

    of the primary key at the 1-side of the relationship as a foreign key in the table that is at the N-

    side of the Relationship. It is also possible to build a table to represent the relationship. This

    second approach is particularly selected when:

    - the relationship can be M:N in the future

    - a lot of records will have the null value, and this foreign key will be almost null.

    Step5: each binary M:N relationship is mapped as a new table whose primary key is composed by

    the primary key of one entitys table, and the primary key of the second entitys table.

    Additionally, if this relationship has its own attributes, they are transformed in columns of the

    new table.

    Step6: each multivalued attribute M is mapped into a new table whose primary key is composed

    by the primary key of the entity or relationship that has the multivalued attribute M, plus an

    additional attribute of M that makes this primary key unique. If M is a composite attribute, all its

    simple components are included in this new table.

    Step7: each n-ary relationship is mapped into a new table. In this new table all the attributes that

    composed the primary key of the each of the n-entities (usually three entities) become columns.

    In addition, each attribute belonging to the relationship itself is mapped as a new column of the

    new table. For identifying the primary key, we need to take into consideration the cardinality of

    the relationship:

    For knowing the cardinality you must ask yourself:

  • UNIVERSIDAD DE BELGRANO

    25

    Project

    Project

    - How many PARTS of the same (one) supplier are needed for one project? The answer will depend on the business rules. For instance, MANY (0..M) parts are provided by supplier S1, for the project P1.

    Then ask yourself: - In how many PROJECTS does the same (one) supplier provide the same (one) part? Still the answer will depend on the business rules. For instance, the same part (P1) is provided by one supplier (S1) for MANY (1..M) projects.

    Supplier Part

    Supplier Part

  • UNIVERSIDAD DE BELGRANO

    26

    Project

    Finally, you must ask about the third entity:

    - How many SUPPLIERS provided the same (one) part for the same (one) project?

    The answer will also depend on the business rules. For instance, MANY (1..M) suppliers can

    provide the same part (P1), for the same project (PJ1).

    Following my example (that is similar of the one proposed by Elsmari and Navathe, and

    the same ternary relationship described in many data modeling books), we have that the

    cardinality of this relationship is M:N:N. Notice that we have one maximum cardinality for each

    entity that participates in the relationship. In this example we have three entities, and three

    maximum cardinality: M,N,N.

    In a ternary relationship, when the cardinality is M:N:N, the primary key of the table that

    represents this relationship in the relational model will be composed by the concatenation of the

    primary keys of each entity.

    If the business rules established that the supplier that provides a particular part (P1), for a

    particular project (PJ1), is always the same (what means that no other supplier can provide the

    Supplier Part

  • UNIVERSIDAD DE BELGRANO

    27

    part P1 for the project PJ1), it is easy to know who is the supplier, by knowing the part identifier

    (Part#) and the project identifier (Proj#). Talking in functional dependence terms, the part

    number and the project number determine the supplier number.

    (Part#, Proj#) => Supplier#

    In this particular case, the primary key of the relationship table

    SUPPLIERS_FOR_PROJECTS can be composed by the primary key of the PART table, and the

    primary key of the PROJECT table: (Part#,Proj#). However, I still recommend to used all the

    foreign key as the composite primary key (instead of only two fields), as the business rules can

    changed, and it will be very hard to modify all the existing applications.

    Normalization Implications

    We have seen how normalization is important, because it helps us to avoid inconsistency

    among data. However, it has other effects that are not desired. One of the problems of

    normalizing a database is that you lose performance during the queries. As Nijssen and

    Halpin say, the fact of reducing the number of tables, reduce the time of the query. On the

    contrary, increasing the number of tables during the normalization process will increase the time

    of the query execution (254). In other words, you spend more time during each query because

    you need to link (join) more tables.

    Some people believe that a database must be denormalize in order to have a better

    performance during the query processing. According to Elmasri and Nevathe, denormalization

    means that the attributes that are frequently required together in a query operation must be stored

    in the same table or relation. However, it is necessary to control that each update transaction

  • UNIVERSIDAD DE BELGRANO

    28

    modifies all the values of a duplicate attribute in order to maintain the database consistent (469).

    In the not normalized example presented before, we have seen that Mercedes Fini has three

    times the address stored in the database. This implies that all the programs that modify the

    address attribute must check that the value of this attribute was updated as many times as it

    appears in the database. As you can understand, this denormalized approach is risky. For

    example, a new programmer can forget to include in his/her program the controlling routine,

    which ensures that all the values of a same data were updated simultaneously. By this, he/she will

    introduce the undesirable inconsistency in the database.

    The second problem is clearly states by Yao, when he says:

    The fragmentation of data into normalized relations unfortunately makes it necessary to

    write tedious and lengthy queries to obtain useful information.(107)

    At this point I cannot disagree with this statement. It is true that the query will be longer in a

    normalized approach. This is due to the fact that the query language needs the programmer to

    clearly state what tables he/she will use, and which is the attribute or group of attributes that link

    them.

    For example, in a query language known as SQL (structured query language) you will have

    the following instruction to obtain a data from only one table:

    Select student_name, address, course From table STUDIES Where student_name= Fini, Mercedes;

    This query will give me a report with Mercedes Finis address and courses.

    In a normalized approach the same report will be obtained through the following instruction:

  • UNIVERSIDAD DE BELGRANO

    29

    Select student_name, address, course From table STUDENT, STUDIES Where student_name= Fini, Mercedes And STUDENT.student_name=STUDIES.student_name;

    The last sentence shows the relation between the two tables (the primary_key attribute or

    attributes that links them). You will have one of these sentences for each primary_key attribute

    that links each table that contains the attributes that you want to list. As a consequence, the code

    of the query is longer. However, it will take only few minutes more to write these few additional

    sentences in the code.

    In summary, I strongly recommend to normalize your relational database, even if the

    query code become longer or the query performance is slower.

    The following section explains the Query Optimization process that is provided by almost

    every RDBMS.

    Query Optimization: Did you try to do a query joining two tables with a large number of fields, and a large

    number of records? If you did it, you probably will agree that the performance of the query

    (response time) is not very convenient. Did you try to do the same query using a RDMS that

    provides you the Query Optimization capability? If so, you must realize that the response time

    has considerably improved. I will try to explain the way the Query Optimization capability

    works, thought an example. Consider the following tables:

    TABLE: STUDENT StudentID Student_Name Address ZipCode 10 Fini, Mercedes 343 6th Street 91711

  • UNIVERSIDAD DE BELGRANO

    30

    20 Smith, John 23 Sunset Avenue 91711 30 Roland, Peter 86432 Two Nods Rd. 91716 40 Funson, Larry 345 8th Street 91711 50 Neuman, Paul 1010 Foothill Ave. 91711 60 Gibson, Mel 863 Two Birds Rd. 91716 70 Tapia, Roland 343 6th Street 91711 80 Banderas, Peter 273 La Luna St. 91714 90 Sony, Benson 842 Main Boulevard 91716 TABLE: STUDIES StudentID CourseID Mark Year 10 101 A 1997 10 104 A- 1997 10 102 B- 1998 20 103 A- 1997 30 101 B 1997 40 103 A 1997 40 104 A- 1997 50 105 B- 1998 60 101 A- 1998 60 105 C 1999 70 104 A+ 1999 70 106 A 1999 80 104 B 1999 90 103 A+ 1999 90 105 B 1999 TABLE: COURSES CourseID Course_Name 101 Mathematics I 102 Mathematics II 103 Mathematics III 104 Physics I 105 Physics II 106 Physics III

    Suppose we want to perform the following query:

    SELECT Student_Name FROM STUDENT, STUDIES WHERE STUDENT.StudentID=STUDIES.StudentID AND STUDIES.CourseID=101;

  • UNIVERSIDAD DE BELGRANO

    31

    The corresponding Algebraic expression for this query is:

    (( STUDIES JOIN STUDENT ) WHERE CourseID=101 ) [Student_Name]

    In this example we have 9 students, 15 courses taken by the students (studies), and only 3

    students that have taken the course 101. Now we will see how this query is performed without

    any optimization at all:

    First: JOIN relations STUDIES and STUDENTS (over StudentID).

    During this step, the 15 tuples from the COURSE table are reading. Then each of the 9 students

    tuples (records) is read 15 times, once per each course. This operation involves reading 135

    tuples. The intermediate result consist on 15 joined tuples (that is, tuples that contain the fields

    of both tables, and suite the condition: STUDENT.StudentID=STUDIES.StudentID).

    Second: RESTRICT the result of the first step to just the tuples that suite the condition (tuples

    where CourseID=101).

    During this step, the 15 joined tuples are read in order to look the tuples that have

    CourseID=101. The result is a relation that contains only 3 tuples.

    Third: PROJECT the result of the second step over Student_Name.

    During this step, the 3 tuples relation obtained are read, in order to select the Student_Name of

    each. The output is a new relation containing only the Student_Name attribute, and the tuples that

    are no duplicating (in this example there are no duplicate tuples, so the new relation will contain

    3 tuples).

    Now, if we count the processed tuples, we have that this query, performed in this way, involves

    189 (15+135+15+15+3+3+3) processed tuples.

  • UNIVERSIDAD DE BELGRANO

    32

    Resolving this query by following the optimization process would involve the subsequent

    steps:

    First: RESTRICT relation STUDIES to just the tuples for course 101.

    During this step, the 15 tuples of the STUDIES relation are read in order to produce the 3 tuples

    intermediate relation that suites the condition CourseID=101.

    Second: JOIN the result obtained during the first step, to the STUDENT relation (over

    StudentID).

    During this step, the 9 tuples from the STUDENT relation are reading. Then each of the 9

    students tuples (records) is read 3 times, once per each 101 course. This operation involves

    reading 27 tuples. The intermediate result consist on 3 joined tuples (that is, tuples that contain

    the fields of both tables, and suite the condition: STUDENT.StudentID=STUDIES.StudentID).

    Third: PROJECT the result obtained during the second step over Student_Name.

    During this step, the 3 tuples of the obtained relation are read in order to select the

    Student_Name of each. The output is a new relation containing only the Student_Name attribute,

    and the tuples that do not have duplicates (in this example there are no duplicate tuples, so the

    new relation will contain 3 tuples).

    Now, if we count the processed tuples, we have that this query, performed in this second way,

    involves 63 (15+3+9+27+3+ 3+3) processed tuples.

    In this example, tables are smaller enough to be kept in main memory (only the 135

    tuples relation will be probably kept on disk, depending on the hardware capabilities). Still, it is

    easy to realize that the second approach is better, as the total number of tuples processed is

    smaller. In addition, we can improve the performance even more by indexing the STUDIES

  • UNIVERSIDAD DE BELGRANO

    33

    relation on CourseID, so that the total number of tuples read from STUDIES will be only 3

    (according to this example). Again, an index or hash on StudentID can be used in STUDENT

    relation to improve the second step, by reading only the three needed tuples, instead of 9.

    In real life this optimization is very important, as we will surely have a larger number of

    records. For instance, suppose our STUDENT table contains 9,000 records, our STUDIES table

    has 34,000 records, and only 800 records match the condition students taken the course 101.

    Now, the maximum number of processed tuples can be 306,104,400 for a non-optimized query,

    and 7,246,200 for a query that follows the optimization process. This is because, processing this

    query without optimizing it, involves the following steps:

    First, read the 34,000 STUDIES records, then, read the 9,000 STUDENTs records 34,000 times

    (that is 306,000,000 tuples in total), and construct the intermediate 34,000 joined tuples table.

    Due to the large number of records, we cannot keep all this tuples in main memory, so we need

    to write them out to disk, and then read them from disk (what takes longer than keep them in

    memory). Once we generated the intermediate 34,000 joined tuples table, we read this table back

    into memory, and produce an 800 tuples tables (by performing the Restriction operation). We

    assume this intermediate relation cannot be kept in memory according to its size, so it will be

    need to save it out to disk. Finally, we need to read these 800 tuples again in order to perform the

    Projection, resulting in a new 800 tuples relation. In total we have: 34,000 + 306,000,000 +

    34,000 + 34,000 + 800 + 800 + 800 = 306,104,400 processed tuples (this is the maximum

    number, when no duplicate tuples were found).

    Following the optimization process we have:

  • UNIVERSIDAD DE BELGRANO

    34

    First, read the 34,000 tuples of the STUDIES relation and produce the 800 tuples relation

    output. Then, read the 9,000 tuples from the STUDENT relation and, read each of the tuples 800

    times, once per each 101 course (total 7,200,000 tuples processed), and produce a new 800

    joined tuples relation. Take these 800 tuples and make the projection, in order to obtain the

    desired relation (that has maximum 800 tuples, if no duplicate tuples are found). In total we

    have: 34,000 + 800 + 9,000 + 7,200,000 + 800 + 800 + 800 = 7,246,200 processed tuples. You

    easily can see how the optimization mechanism has considerably improved this query

    performance. When the tables involved in the query have a larger number of records and

    attributes, the advantage of using the query optimization mechanism can be even more important

    (an easily noticed). Sometimes, this performance in the queries can be still improved more, as the

    smaller number of reading tuples makes possible that the intermediate results were kept in main

    memory, instead of being kept in disk space (so no I/O accesses to disk storage are needed).

    Additionally, the use of index and hash mechanisms allows improving this performance much

    more, as the number of reading tuples can be still reduced.

    The Optimization Process

    The optimization process can be described as a mechanism by which a query is broken

    down into smaller queries that involve a lower number of processed tuples and a lower number

    of processed fields. In other words, the optimization process tries to avoid the Cartesian product

    operation (that involves a considerable large number of tuples and fields to be processed), and

    tries to reduce to a minimum the total number of tuples to be scanned at each stage (by applying

  • UNIVERSIDAD DE BELGRANO

    35

    first a restriction operation). This mechanism considerably improves the performance in

    resolving the query.

    Sometimes, the query optimization process can be called: the query decomposition

    process. During this process, a query that involves multiple tuple variables is transformed into

    smaller queries that involve one or two variables each, by using the detachment and tuple

    substitution mechanisms. During the detachment process, the query is decomposed in smaller

    components that have one variable in common with the rest of the query, and that cannot be

    decomposed any more (this is often called reduction). These components are separate from the

    query and then they are substituted by a new subquery where a chosen variable is assigned to a

    set of possible calculated values through the tuple substitution mechanism (generally called

    nested loops). It is important to notice that detachment generates two queries from one, for this

    reason it is applied before applying the tuple substitution mechanism. This second mechanism is

    applied once it is no possible to apply detachment anymore. These two mechanisms are applied

    repeatedly in order to decompose the query into irreducible subqueries (components) in order to

    win the flexibility needed for improving the query performance. The idea is to apply first the

    operations that reduce the absolute sizes of the intermediate results considering their attributes

    size and the number of them-. As a rule, it is needed to resolve first the one-tuple variable, and

    then the two-tuple variables. The following example explains this process:

    Consider the following tables:

    TABLE: STUDENT StudentID Student_Name Address ZipCode 10 Fini, Mercedes 343 6th Street 91711

  • UNIVERSIDAD DE BELGRANO

    36

    20 Smith, John 23 Sunset Avenue 91711 30 Roland, Peter 86432 Two Nods Rd. 91716 40 Funson, Larry 345 8th Street 91711 50 Neuman, Paul 1010 Foothill Ave. 91711 60 Gibson, Mel 863 Two Birds Rd. 91716 70 Tapia, Roland 343 6th Street 91711 80 Banderas, Peter 273 La Luna St. 91714 90 Sony, Benson 842 Main Boulevard 91716 TABLE: STUDIES StudentID CourseID Mark Year 10 101 A 1997 10 104 A- 1997 10 102 B- 1998 20 103 A- 1997 30 101 B 1997 40 103 A 1997 40 104 A- 1997 50 105 B- 1998 60 101 A- 1998 60 105 C 1999 70 104 A+ 1999 70 106 A 1999 80 104 B 1999 90 103 A+ 1999 TABLE: COURSES CourseID Course_Name 101 Mathematics I 102 Mathematics II 103 Mathematics III 104 Physics I 105 Physics II 106 Physics III

  • UNIVERSIDAD DE BELGRANO

    37

    TABLE: CITY ZipCode City_Name 91711 Claremont 91712 Rancho Cucamonga 91713 Upland 91714 Montclair 91715 San Dimas 91716 Pomona Suppose we want to perform the following query Q0: Get names of students that live in Pomona

    and have taken Mathematics I during 1997.

    SELECT Student_Name

    FROM STUDENT S, STUDIES E, COURSES C, CITY L

    WHERE L.City_Name= Pomona

    AND L.ZipCode=S.ZipCode

    AND S.StudentID=E.StudentID

    AND E.Year=1997

    AND E.CourseID=C.CourseID

    AND C.Course_Name= Mathematics I;

    The tuple variables in query Q0 are S, E, C, and L, where S varies over the limits of the

    relation STUDENTS, E ranges over the relation STUDIES, C ranges over COURSES, and L

    ranges over the relation CITY.

    The first process involves the detachment of the one-variable queries L, C and E, as

    follows:

    D1: L = Select ZipCode from CITY where City_Name= Pomona;

  • UNIVERSIDAD DE BELGRANO

    38

    D2: C = Select CourseID from COURSES where Course_Name= Mathematics I;

    D3: E = Select CourseID, StudentID from STUDIES where Year=1997;

    Now, the remained query will be:

    Q2: SELECT Student_Name

    FROM STUDENTS S, L, E, C

    WHERE S.ZipCode=L.ZipCode

    AND S.StudentID=E.StudentID

    AND E.CourseID=C.CourseID;

    Now we process the two-variable detachments:

    D4: S = Select StudentID, Student_Name from S where S.ZipCode=L.ZipCode;

    D5: E= Select StudentID from E where CourseID=CCourseID;

    And the remained query will be:

    Q3: SELECT Student_Name from S where S.StudentID= E.StudentID;

    Another way to do it is to process:

    D4: E= Select StudentID from E where CourseID=CCourseID;

    And the remained query will be:

    Q3: Select Student_Name from S where S.ZipCode=L.ZipCode

    And S.StudentID= E.StudentID;

  • UNIVERSIDAD DE BELGRANO

    39

    D 1 D 3D 2

    L C E

    D 4 D 5

    Q 3

    C E

    E

    L

    S

    S

    O v e r a l lr e s u l t

    The following picture shows the Decomposition Tree for Query Q0:

    As a general rule, first it is need break down the RESTRICTION operation (with more

    than one condition) into a cascade of RESTRICTION operations, to obtain the flexibility needed

    for moving the RESTRICTION operations down, in the tree. Then it is needed to reorder the tree

    nodes in order to process the most restrictive RESTRICTIONS first (that are the

    RESTRICTIONS that produce a relation with the smallest number of tuples). Next, break down

    the PROJECTION attributes, and put them as down in the tree as possible, by creating new

    PROJECTION operations. This will reduce the overall number of processed attributes. Finally,

    identify the group of operations that can be processed together.

  • UNIVERSIDAD DE BELGRANO

    40

    Conclusion A relational database needs to be consistent in order to make its data reliable and useful.

    It is possible to have consistency by two different approaches: (1) applying normalization or (2)

    not applying normalization in the relational databases. In the first approach, the database will be

    always consistent. If we take the second position, or denormalization, we need to include control

    routines in every program that updates the database. This means that the second approach implies

    that the developer of the database is taking a risk. As I have mentioned, a new programmer can

    forget to add the controlling routine in a program, allowing that the same data was updated in one

    tuple, but not in others. This means that the second approach DOES NOT ALWAYS ensure

    consistency among data. Therefore, it is recommended to select the first approach applying

    normalization since this approach guarantees the data consistency. If you still have some doubts

    about the query performance, remember that it can be improved through the Query Optimization

    mechanisms. This process does not only improve the query response time, it also allows the

    subqueries to be processed simultaneously in a multiprocessor environment.

  • UNIVERSIDAD DE BELGRANO

    41

    Appendixes

    In this section of the paper I will give an overview about the origins of the Relational

    Model. I do not add this material in the paper body, because I am trying to focus only in the

    important issues of the Relational Model: basic concepts, normalization, and query optimization.

    First thing you need to know about the Relational Model, is that the most prestigious

    researchers in this field of study are Codd, E.F. and Date, C.J. (that is the reason I mentioned

    them so many times).

    The Origins of the Relational Model

    According to Date, database management has evolved into a science thanks to Codds

    work, that has provided a theoretical framework (the relational model) within which a variety of

    important problems could be attacked in a scientific manner. He also recognizes that the main

    importance of Codds work is that he has not only introduced the Relational Model, but the

    concept of a data model, needed to deal with the problems associated with the storage of a

    large amount of data.

    E.F. Codd was the first researcher who introduced the logic of predicates in the database

    management field, enlarging his contribution with the introduction of the relational algebra and

    the relational calculus in the relational data model. Furthermore, Codd has developed the first

    relational data language (ALPHA) that allowed the manipulation of data independently of its

    physical representation (this independence is nowadays called physical data independence,

  • UNIVERSIDAD DE BELGRANO

    42

    mentioned as one of the advantages that the relational model has over the network model). In

    addition, Codds papers introduced the term functional dependence, and the first three normal

    forms of the Normalization Theory.

    Codds second relational paper: A Relational Model of Data for Large Shared Data

    Banks published in 1970 (Communications of the ACM 13, No.6) is probably the most famous

    paper in the history of database management. (Date, Dec.1998)

    Codds first paper is focused on data redundancy and related issues. This second paper

    seems like a revision of the first one, but it stresses the importance of providing physical data

    independence within a data model. In addition, this second paper includes some new concepts.

    For instance, it introduces the idea that data must be always normalized. As a result, this paper

    has had a big repercussion in the database field of study.

    Codds third paper, in 1972, provides a formal definition for both: relational algebra and

    relational calculus. Then, it presents an algorithm for mapping a calculus expression into an

    algebra expression. Through this, Codd has proved that the algebra is as powerful as the calculus.

    The algebra includes both, objects, and operators that follow particular principles. The

    relational algebra has relations, as its objects, and has operations such as restriction, projection,

    union, difference, join, etc. These operators satisfy certain laws, such as associativity,

    commutativity, and closure. Closure is very important for the relational algebra, and it implies

    that the result of an operation between two relations is always another relation. (Date, Jan.1998).

  • UNIVERSIDAD DE BELGRANO

    43

    Glossary

    Alternate key: this term was apparently introduced by Date. An alternate key is any of the

    candidate keys that were not chosen as the primary key of the relation (Date, March 1999).

    Cartesian product: the relational Cartesian product produces a set of tuples, instead of simple

    pairs. The result of a Cartesian product expression is a tuple, containing all components of the

    first relation, together with all the components of the second relation. This algebraic operator is

    commutative and associative (Date, Jan.1999)

    Composition: the composition of the relation A{X,Y} with the relation B{Y,Z) is the projection

    on X and Z of a join of A with B, always considering that X,Y and Y,Z are attributes of the

    corresponding relations A and B. (Date, Nov.1998)

    Consistency: at this point, Dates states that Codds definition of consistency doesnt capture all

    aspects of integrity. My interpretation of Codds definition of consistency is as follows: a set of

    relations is consistent when the redundancies match, by this I mean that the redundant data do

    not have different values. According to Codd, to ensure consistency the system should be

    informed about the redundancies in order to control them. (Date, Nov.1998)

    Derivability: a relation is said to be derivable if and only if its expressible. (Date, Nov.1998)

    Difference: the difference of two relations is a third relation formed by the tuples of the first

    relation that do not appear in the second relation (Kroenke).

    Division: the division is the inverse of the Cartesian product, as you can see in the following

    formula quoted from Dates article: ( R TIMES S ) DIVIDEBY S = R (Date, Jan.1999)

  • UNIVERSIDAD DE BELGRANO

    44

    Factoring: this operation is also called nesting. The factoring operator transforms a normalized

    relation into an unnormalized relation (Date, Jan.1999)

    First Normal Forms (1NF): the definition for the 1NF is not provided in the article, however I

    will add it in my summary. According to Kroenke, a table is said to be in 1NF when its cells have

    single values (neither repeating groups nor arrays are allowed as values) (118).

    Full Dependence: the functional dependence A B is a full dependence when B is not

    functionally dependent on any subset of A (Date, March 1999)

    Intersection: the intersection of two relations is a third relation formed by the tuples contained in

    both relations (Kroenke).

    Join: the join of a relation A{X,Y} with the relation B{Y,Z} is a relation C{X,Y,Z} where

    C{X,Y}=A and C{Y,Z}=B. The natural join is a special case of the join operation that is the

    only join possible when the attribute X is functionally dependent on the attribute Y (relation A)

    or the attribute Z is functionally dependent on the attribute Y (relation B) (Date, Nov.1998). In

    other words, the join of two relations is a third relation formed by applying the product operator

    and then selecting only the tuples that follow a specified criteria (Kroenke). In the natural join

    operation redundant attributed must be eliminated.

    Key: is the unique identifier of the tuple, in the Relation. It is said that the key is

    nonredundant when each attribute (that constitutes the key) is necessary for a unique

    identification. This definition implies that if you remove an attribute from the key, then you will

    not have a unique identifier (Date, Oct.1998).

  • UNIVERSIDAD DE BELGRANO

    45

    Normalization: Date defines the normalization process as a process of taking projections, and

    further that we can recover the original relation by taking the natural join of those projections,

    what implies that the process is reversible: the decomposition is nonloss.(March 1999)

    Permutation: this operation allows the reordening of attributes of a given relation. One year after

    this first paper, Codd stated that the order of the attributes of a given relation is irrelevant so

    this operation would remain for the internal use of the language (Date, Nov.1998).

    Projection: it implies that the points in n-dimensional space, corresponding to a relation of degree

    n, can be projected over m of its attributes, in the m axes (having always that m

  • UNIVERSIDAD DE BELGRANO

    46

    Second Normal Form (2NF): Codds definition of 2NF is that a relation must be in 1NF and

    every nonprime attribute must be fully dependent on each candidate key.

    A prime attribute is an attribute that composed at least one candidate key. A nonprime attribute is

    an attribute that is not prime (Date, March 1999).

    Third Normal Form (3NF): Codds definition of 3NF is that a relation R is in 3NF if it is in 2NF

    and every nonprime attribute is nontransitively dependent on each candidate key of R (Date,

    March 1999)

    Tie: it implies that given a relation with n attributes (X1 to Xn), the result of this operation is the

    restriction of the relation to just the particular rows in which Xn=X1 (Date, Nov.1998).

    Trivial Dependence: A B is a trivial functionally dependence if B is a subset of A.

    Union: the union of two relations is a third relation formed by adding the tuples of both relations

    and eliminating duplicate tuples (Kroenke).

  • UNIVERSIDAD DE BELGRANO

    47

    References

    Atre S., Data Base: Structured Techniques for Design, Performance, and Management , 1980,

    John Wiley & Sons, Inc.

    Date C.J., An Introduction to Database Systems. Reading, Massachusetts: Addison-Wesley

    Publishing Company, 1982.

    Davis B., Data Base Management Systems: User Experience in the U.S.A., 1975, NCC

    Publications.

    Dutka, Alan F. and Hanson, Howard H., Fundamentals of Data Normalization. Reading,

    Massachusetts: Addison-Wesley Publishing Company, 1989.

    Elmasri, R. and Nevathe, S.B., Fundamentals of Database Systems. Redwood City, California:

    The Benjamin/Cummings Publishing Company, Inc, 1994.

    Kroenke, D.M., Database Processing: Fundamentals, Design, and Implementation (6th edition).

    Upper Saddle River, N.J.: Prentice Hall, 1998.

    Haderle Donald J., Database Role in Information Systems: The Evolution of Database

    Technology and its impact on Enterprise Information Systems, In Goos G. and

    Hatmanis J., Lecture Notes in Computer Science, 1990, Spring-Verlag.

    Hawryszkiewycz I.T., Database Analysis and Design, Macmillan Publishing Company, 1991.

    Nijssen G.M. and Halpin T.A., Conceptual Schema and Relational Database Design: A Fact

    Oriented Approach. New York: Prentice Hall, 1989.

  • UNIVERSIDAD DE BELGRANO

    48

    Turk Thomas A., Planning And Designing The Data Base Environment, Van Nostrand

    Reinhold Company Inc., 1985.

    Yao, S. Bing, Principles of Database Design, Volume I: Logical Organizations, New Jersey:

    Prentice Hall, 1985.

    Articles:

    Date, CJ, The Birth of the Relational Model (part 1 of 3), Intelligent Enterprise, October 1998.

    In this article Date explained Codds first relational paper: Derivability, Redundancy, and

    Consistency of Relations Stored in Large Data Banks, IBM Research Report RJ599, 1969.

    Date, CJ, The Birth of the Relational Model (part 2 of 3), Intelligent Enterprise, November 1998.

    In this article Date explained Codds first relational paper: Derivability, Redundancy, and

    Consistency of Relations Stored in Large Data Banks, IBM Research Report RJ599, 1969.

    Date, CJ, The Birth of the Relational Model (part 3 of 3), Intelligent Enterprise, December 1998.

    This article talk about the Codds papers regarding the Relational Model. Particulary, the article

    refers to the following paper: Codd, E.F., A Relational Model of Data for Large Shared Data

    Banks , Communications of the ACM 13, No.6, 1970.

    Date, JC., Thirty Years of Relational Codds Relational Algebra, Intelligent Enterprise, January

    1999. In this article, Date analyzes Codds third famous paper: Relational Completeness of Data

    Base Sublanguages, published in 1972 (IBM Research Report RJ987).

    Date, JC., Thirty Years of Relational: The First Three Normal Forms, Intelligent Enterprise,

    March 1999. In this article, Date analyzes Codds three relational papers: The Second and Third

  • UNIVERSIDAD DE BELGRANO

    49

    Normal Forms for the Relational Model, IBM technical memo (October 6th, 1970). Further

    Normalization of Data Base Relational Model, IBM Research Report RJ909 (August 31st, 1971).

    Normalized Data Base Structure: A Brief Tutorial, ACM SIGFIDET Workshop on Data

    Description, Access, and Control, San Diego (November 11th-12th, 1971).