Let’s Define a Few Terms
A database is a collection of persistent data, collected and stored for its usefulness in some endeavor.
A database management system (DBMS) is database and a set of procedures (usually computer programs) whose purpose is to extract utility from the database.
Example Databases My “to do” list, including
Referee three papers Prepare for algebraic statistics seminar Paint the kitchen
My fish collection Name Species Habitat
Stock market transactions 1950—2011 Stock symbol Date Closing price Shares traded Etc.
Example DBMSs
My “to do” list A 35 index card
My fish collection MS Access
Stock market transactions 1950—2011 An Oracle or MySQL system, along
with lots of other stuff…
Computer DBMSs “Flat-file” systems, like dBASE (forever) “Hierarchical” systems like IDMS (late 1960s) “Network” systems (usually called CODASYL, as in
COBOL) like IMS (1971) “Relational” systems like DB2, mySQL, and Oracle (and
yes, MS Access). Original papers published by Codd, 1974, but it took 20 years before computers were fast enough to implement.
Object-oriented DBMS, like ObjectStore Data mining software, open source and from e.g., SAS
and Oracle Hadoop
Some Elementary Definitions Data item—the smallest named
unit of data with meaning: last name, first name, ID number, etc. (a.k.a. attribute)
Record—a group of related data items: salesperson, customer, product, etc. (a.k.a. row)
File—a collection of records of a single type (a.k.a. table)
The Database Lifecycle
Requirements analysis Logical design
Entity-relationship (E-R) modeling View integration Transform to SQL tables Normalization
Physical design
Components of the ER Model Entity: person, place, thing, event, idea,
etc. The principal data objects of interest. A particular occurrence is an entity instance
Relationship: an association between one or more entities. Relationships are described in terms of Degree Connectivity Existence
Components of the ER Model
Attribute: a characteristic of entity Employee has emp_id, name, age,
position, etc. Department has dept_id, name, main
office address, manager’s name, etc. An attribute may be an identifier or a
descriptor.
Relationships Degree: the number of entities associated
in the relationship: binary, ternary, n -ary. Connectivity: a constraint on the mapping
of one entity to another; one or many are the choices
Existence: a condition on the association describing whether or not an instance of one entity must be associated with an instance of another entity; mandatory and optional are the choices
ER Diagrams Encode “Business Rules” (Not “truth”)
department employeeis-managed-by
0,1 1
“A department is managed by exactly one employee, andan employee manages zero or one departments.”
dept_num dept_name mgr_id
10 Tresuries 330
11 Corp. bonds 15
12 Morgage-backed 221
emp_id emp_name school salary
15 Dave CMU Big
110 Dan NYU Ok
560 Xue Haas Bigger
330 Rohit CMU Ok
221 Xin CMU Really big
451 Eduardo Sloan Ok
Another Business Rule
department divisionis-subunit-of
11..N
“A department is a subunit of exactly one division, anda division consists of many departments.”
dept_num dept_name mgr_id div_num
10 Tresuries 330 1
11 Corp. bonds 15 1
12 Morgage-backed
221 1
32 Swaptions 533 2
div_num div_name mgr_id
1 Fixed income 224
2 Options 114
A Different Way to Organize Managers
employee is-manager-of1
“An employee is managed by exactly one employee, andan employee is the manager of zero or more employees.”
0..N
This is implies a recursive table: mgr_id refers to emp_id
emp_id emp_name
school salary mgr_id
15 Dave CMU Big 554
110 Dan NYU Ok 15
560 Xue Haas Bigger 321
330 Rohit CMU Ok 560
221 Xin CMU Really big 554
451 Eduardo Sloan Ok 221
More ER Diagrams
project employeeworks-on
1…4 N
“An employee works on at least one but at most four projects, anda project is worked on by many employees”
employee project
skill
skill-usedM N
P
“Employees can use many skills on any one of many projects,and each project has many employees with various skills”
More ER Diagrams
technician project
notebook
uses-notebook
1 1
1
“A technician uses exactly one notebook for each project. Eachnotebook belongs to one technician for each project.” Note thata technician may still work on many projects and maintain differentnotebooks for each project.
More ER Diagrams
project employee
location
assigned-to1 N
1
“Each employee assigned to a project works at only onelocation for that project but can be at different locations for different projects. At a particular location, an employeeworks on only one project and many employees can beassigned to a given project.”
Adding Attributes
For small ER diagrams, or for fragments of larger ones, it may be useful to include attribute names.
department employeeis-managed-by
1 1
dept_id
dept_name
dept_address
emp_id
emp_name
job_class
Fancier Ideas Composite attributes have several “sub-
attributes”. Ex: address has street_address, city, state, zip
Multivalued attributes may have a set of values. Ex. college_degree may have several values
Derived attributes can be computed from others. Ex: age can be computed from date_of_birth
Null values indicate that either the attribute value is not applicable (apartment_number), or unknown (either missing but we know it exists, like a person’s height, or not known, like a cell phone number).
Key Attributes The rows in a table need to be unique. Uniqueness is determined over some
set of attributes. Often we introduce an “artificial”
attribute to enforce uniqueness (e.g., dept_id).
A key is one or more attributes which, together, are guaranteed to be unique for rows in the table.
Weak Entities Entities that don’t have key attributes of
their own are called weak entities. “Normal” entities are called strong. Ex: Dependent (i.e., son or daughter or
spouse) is related to Employee, but may have attributes that are not unique amongst all dependents.
Uniqueness is achieved through the attachment to the employee (who has a unique key).
Weak Entities The ER diagram notation is this:
dependent employeeis-dependent-of
N 1
name
birth_date
relationship
emp_id
emp_name
job_class
A Larger Example A flight has a flight number, and is associated
with a particular airline. A flight can have one or more legs, which are
trips between two airports. A leg instance is a leg of a flight, flown on a
particular day. Each leg instance has some seats available. A particular airplane is assigned to each leg
instance. Each airplane is of a particular airplane type. A particular airplane type can land at some,
but not all, airports.
airport flight_leg
departure-airport
arrival-airport
1
1 N
N
flight
leg_no
N
1
number
airline
airplane_type
can-land
N
M
airplane
airplane_id num_seats
type_name max_seats
1
N
airport_code
type
leg_instance
date
assigned N1
instance-of
N
1
seatseat_no
reservation 1
N
More Details
What if a relationship needs attributes?
leg_instanceseat
seat_no date
reservation
cust_name cust_phone
N 1
Basic Transformation Rules Transform each entity into a table
containing its key and non-key attributes. Transform every many-to-many binary or
binary recursive relationship into a relationship table with the keys of the entities and the attributes of the relationship.
Transform every ternary or n -ary relationship into a relationship table.
The Basic Relational Database Move (Tattoo On Your Arm!)
student enrolled_in courseN M
student_id student_name course_id course_name
student course1 N
student_id student_name course_id course_name
enroll
student_id course_id
N 1
student(student_id, student_name,…)course(course_id, course_name,…)enroll(student_id*, course_id*,…)
This Works for Recursive Relationships Too
professor is-coauthor-withN
professor
coauthor
M
2
N
professor( prof_id, prof_name)coauthor( author_id*, coauthor_id*)
Transformation to Actual Tables Here come a series of examples, showing
An ER diagram fragment, and SQL to create an actual set of tables, and Some alternate notation we’ll use later
We don’t use the create table statements in MS Access (we would use the ACCESS GUI instead), but we will use them when declaring MySQL tables.
1-1, Both Entities Mandatory
report
has-abbr
abbreviation
Every report has exactly one abbreviation, andevery abbreviation represents exactly one report
1
1
create table report (report_no integer, report_name char(256), primary key(report_no));create table abbreviation (abbr char(6), report_no integer not null unique, primary key(abbr), foreign key(report_no) references report on delete cascade on update cascade);
report(report_no, report_name)abbreviation(abbr, report_no*)
If the referencedreport is deleted,delete the abbr
If the referencedreport has its primarykey updated, update thisforeign key
1-1, One Mandatory, One Not
department
managed-by
employee
Every department must have a manager, but anemployee can be a manager of at most one dept.
1
1
create table department (dept_no integer, dept_name char(20), mgr_id char(10) not null unique, primary key(dept_no); foreign key(mgr_id) references employee on delete set default on update cascade);create table employee (emp_id char(10), emp_name char(20), primary key(emp_id));
department(dept_no, dept_name, mgr_id*)employee(emp_id, emp_name)
default is nullby default
1-1, Both Entities Optional
employee
has-allocated
desktop
Some desktop computers are allocated to employees,but some aren’t; and some employees might not get one.
1
1
create table employee (emp_id char(10), desktop_no integer, primary key(emp_id));create table desktop (desktop_no integer, emp_id char(10), primary key(desktop_no), foreign key(emp_id) references employee on delete set null on update cascade);
employee(emp_id, desktop_no)desktop(desktop_no, emp_id*)
Here?
Or here?
1-N, Both Entities Mandatory
department
has
employee
Every employee works in exactly one department,and every department has at least one employee
1
N
create table department (dept_no integer, dept_name char(20), primary key(dept_no));create table employee (emp_id char(10), emp_name char(20), dept_no integer not null, primary key(emp_id), foreign key(dept_no) references department on delete set default on update cascade);department(dept_no, dept_name)
employee(emp_name, dept_no*)
1-N, One Mandatory, One Optional
department
publishes
report
Each department publishes one or more reports. Agiven report may not necessarily be published bya department.
1
N
create table department (dept_no integer, dept_name char(20), primary key(dept_no));create table report (report_no integer, dept_no integer, primary key(report_no), foreign key(dept_no) references department on delete set null on update cascade);
department(dept_no, dept_name)report(report_no, dept_no*)
N-N, Both Entities Optional
engineer
belongs-to
prof-assoc
Every professional association has none, one or manyengineer members. Likewise for engineers and associations.
N
M
create table engineer (emp_id char(10), primary key(emp_id));create table prof_assoc (assoc_name varchar(256), primary key(assoc_name));create table belongs_to (emp_id char(10), assoc_name varchar(256), primary key(emp_id, assoc_name), foreign key(emp_id) references engineer on delete cascade on update cascade foreign key(assoc_name) references prof_assoc on delete cascade on update cascade);
engineer(emp_id)prof_assoc(assoc_name)belongs_to(emp_id*, assoc_name*)
Recursive Relationships
professor is-coauthor-withN
professor
coauthor
M
2
N
create table professor (prof_id char(10), prof_name char(20), primary key(prof_id));create table coauthor (author_id (char(10), coauthor_id char(10), primary key(author_id, coauthor_id), foreign key(author_id) references professor on delete cascade on update cascade foreign key(coauthor_id) references professor on delete cascade on update cascade);
Weak Entities
dependent employeeis-dependent-of
N 1
create table employee (emp_id char(10), emp_name char(20), primary key(emp_id));create table dependent (dep_id char(10), dep_name(char(20), emp_id char(10), primary key(dep_id, emp_id), foreign key(emp_id) references employee on delete cascade on update cascade); employee(emp_id, emp_name)
dependent(dep_id, emp_id*, dep_name)
Normalization
The issues here are database Integrity Performance Maintainability
“Normalization” is a theory that pushes integrity, but it recognizes the other virtues.
Normalization & Redundancy
product_name order_no
cust_name
cust_addr credit
date sales_name
vacuum cleaner
1458 Bachmann
Austin 6 5-5-02 Bloch
computer 2730 Huang Mt. View 10 5-6-02 Hanss
refrigerator 2460 Stolarchuck
Ann Arbor 8 7-3-02 Phillips
television 519 Honeyman
Detroit 3 9-5-02 Remley
radio 1986 Antonelli Chicago 7 9-18-02 Metz
CD player 1817 Raviman Bombay 8 1-3-03 Basile
vacuum cleaner
1865 Antonelli Chicago 7 4-18-03 Bloch
vacuum cleaner
1885 Blower Detroit 8 5-13-03 Bloch
refrigerator 1943 Bachmann
Austin 6 6-19-03 Phillips
television 2315 Bachmann
Austin 6 7-15-03 Remley
A “flat file” database
Some Problems If we ask “Which customers ordered vacuum
cleaners last month?” we have to search the entire table.
If Bachmann’s address changes, there are lots of updates to be done.
If we delete the order of Huang (who buys expensive things), we lose her address and credit rating.
Can we do better? Yes! Let’s break up the table into non-redundant
chunks, without losing information. This is “normalization”.
First Normal Form: No Repeating Groups
paper_num
paper_title
auth_id auth_name editor dept_no dept_name
03-01-02 Buy Low 001 Black Woolf 15 Design
03-01-02 Buy Low 002 Scholes Woolf 15 Design
04-21-02 Sell High 002 Scholes Koenig 27 Analysis
04-21-02 Sell High 001 Black koenig 27 Analysis
paper_num
paper_title
auth1_id
auth1_name
auth2_id
auth2_name
03-01-02 Buy Low 001 Black 002 Scholes
04-21-02 Sell High 002 Scholes 001 Black
editor dept_no dept_name
Woolf 15 Design
Keonig 27 Analysis
Non-First Normal Form
Equivalent First Normal Form
This is a different,simpler example.
Primary Keys A superkey is a set of one or
more attributes that, collectively, uniquely identifies each row.
A candidate key is a minimal (w.r.t. inclusion) superkey.
A primary key is selected arbitrarily from the collection of candidate keys; it is used as an index for the table.
Primary Keys
In this example, the pair {paper_num, auth_id} is a candidate key.
Is it the only one, i.e., must it be the primary key? (What assumptions?)
paper_num
paper_title
auth_id auth_name
editor dept_no dept_name
03-01-02 Buy Low 001 Black Woolf 15 Design
03-01-02 Buy Low 002 Scholes Woolf 15 Design
04-21-02 Sell High 002 Scholes Koenig 27 Analysis
04-21-02 Sell High 001 Black Koenig 27 Analysis
Second Normal Form If attribute A uniquely determines the
value of attribute B, then B is functionally dependent on A; we write AB
A table is in second normal form (2NF) if it is in first normal form and every non-key attribute is fully dependent on the primary key (transitive dependencies allowed).
We “decompose” a table into several new tables in order to achieve 2NF.
Conversion to 2NF
paper_num
paper_title
editor dept_no dept_name
03-01-02 Buy Low woolf 15 Design
04-21-02 Sell High koenig 27 Analysisauth_id auth_nam
e
001 Black
002 Scholes
paper_num
auth_id
03-01-02 001
03-01-02 002
04-21-02 001
04-21-02 002
paper_num
paper_title
auth_id auth_name
editor dept_no dept_name
03-01-02 Buy Low 001 Black woolf 15 Design
03-01-02 Buy Low 002 Scholes woolf 15 Design
04-21-02 Sell High 002 Scholes koenig 27 Analysis
04-21-02 Sell High 001 Black koenig 27 Analysis
Third Normal Form
A table is in third normal form (3NF) if it is in second normal form and there are no transitive dependencies.
In our example, we have
So decompose further
report_no editor, dept_nodept_no dept_name
Example in 3NF
paper_num paper_title
editor dept_no
03-01-02 Buy Low woolf 15
04-21-02 Sell High koenig
27
auth_id auth_name
001 Black
002 Scholes
paper_num
auth_id
03-01-02 001
03-01-02 002
04-21-02 001
04-21-02 002dept_no dept_nam
e
15 Design
27 Analysis There is absolutely no loss ofinformation here. It’s just abit more work to answer somekinds of questions.
Why 3NF? It avoids update anomalies
No duplication of report number, editor and department number, so we can’t forget to update every instance
It avoids insert anomalies In the 1NF table, a new department can’t be
added unless it has originated a paper It avoids delete anomalies
In the 1NF table, if we delete a row, information about, e.g., department number and department name may be lost.
“Steve, this is fun. We want more normal forms!”
Boyce-Codd normal form Fourth normal form Fifth normal form Even higher! Nobody ever really uses them, it
seems Examples violating them are
contrived.
Generalization
Just like the O-O idea of inheritance.
A manager is an employee, as are engineers, technicians and secretaries.
employee
manager engineer technician secretary
d
“d” means “distinct.There is also “o” foroverlapping
Generalization
individual
employee customer
o
create table individual (indiv_id char(10), indiv_name char(20), indiv_addr char(20), primary key(indiv_id));create table employee (emp_id char(10), job_title char(15), primary key (emp_id), foreign key (emp_id) references individual on delete cascade on update cascade);create table customer (cust_no char(10), cust_credit char(12), primary key (cust_no), foreign key (cust_no) references individual on delete cascade on update cascade);
“overlapping”