1 csci485 – file & database management systems bahram zartoshty office: sal 346 phone: tba...
TRANSCRIPT
1
CSCI485 – File & Database Management Systems
Bahram ZartoshtyOffice: SAL 346
Phone: TBAOffice Hours: TTH 1:15-2:50pm
Note: Parts of this lecture were developed by Professor Ghandeharizadeh
Logistics
Required text book: Database System Concepts, Silberschatz,
Korth & Sudarshan, Fifth edition.
Pre-req for the course: CS201: Data Structures Knowledge of an object-oriented
programming language such as C++, Java, C#
3
Teaching Assistant
Shahin ShayandeOffice: (Microsoft Lab)SAL 200C
Office Hours: TBA
4
Grading
Midterm 1: 35% Midterm 2: 35% Project & Assignments: 30%
5
What to do immediately?
Register with the web sitehttp://dblab.usc.edu/csci485
6
Database Management System (DBMS)
Database: An integrated collection of data, usually stored on secondary storage, typically describing the activities of one or more related organizations.
Database management system (DBMS): A collection of software/programs designed to assist in maintaining and utilizing large collections of data.
DBMS contains information about a particular enterprise
Used almost on a daily basis for either individual or business use.
Relational database vendors were one of the fastest growing sectors during the .COM boom!
7
BEFORE DBMSIn the early days, database applications were built on top of file systems
Data
DataUser
1
User 2
Application programs
Application programs
8
Data managed by DBMS
AFTER DBMS
User 1
User 2
DBMS
Application programs
Application programs
9
WHY A DBMS?1. Reduced application development time2. Data independence: Application programs not dependent
on data representation and storage details3. Data sharing: data is better utilized (discovered and
reused), redundancy of data is minimized4. Data integrity and consistency: one may enforce
consistency constraints on data, e.g., number of seats sold ≤ number of seats on the plane × 1.1
5. Centralized control: DBA tunes the database to balance user's needs
6. Security: mechanisms to prevent unauthorized access. These mechanisms are based on content instead of file-oriented approach.
7. Concurrency control: avoids undesirable race conditions that arise with simultaneous access/updates to data
8. Crash recovery: ensures the integrity of data in the presence of failures
10
DATABASE MANAGEMENT SYSTEMS ARCHITECTURE
User 1
User n
Conceptual schema
Conceptual schema
Physical data
DBDBMS
11
Data Models
A collection of tools for describing Data Data relationships Data semantics Data constraints
Relational model Entity-Relationship data model (mainly for database
design) Object-based data models (Object-oriented and Object-
relational) Semistructured data model (XML) Other older models:
Network model Hierarchical model
ChallengesConceptual
Logical
Physical
Abstraction, Inheritance, Encapsulation
Reduction to tables with minimal: data duplication, potential for data loss and update anomalies
Effective use of a DBMS, management of mismatch between tables and OO constructs, Index structures, CC & Crash recovery, Optimization techniques
Conceptual Data Models
Entity-Relationship (ER) data model Entities, Attributes, Relationships
Emp
SS#
name
address
Conceptual Data Models
Entity-Relationship (ER) data model Entities, Attributes, Relationships
Enrolledin
Emp
SS#
name
address
Health Plan name
Co-Pay
Conceptual Data Models
Entity-Relationship (ER) data model Entities, Attributes, Relationships Recursive relationships
Marriedto
Emp
SS#
name
address
Conceptual Data Models
Entity-Relationship (ER) data model Entities, Attributes, Relationships Recursive relationships
Worksfor
Emp
SS#
name
address
Conceptual Data Models
Entity-Relationship (ER) data model Entities, Attributes, Relationships Recursive relationships
Worksfor
Emp
SS#
name
address
date
Conceptual Data Models
Entity-Relationship (ER) data model Entities, Attributes, Relationships Recursive relationships Inheritance
sidstudent
name
ISA
graduateUndergrad
SpecializationGeneralization
Conceptual Data Models
Abstraction, Inheritance, Encapsulation
Exercise these concepts using in-class examples and homework assignments A library database contains a listing of authors who have written
books on various subjects (one author per book). It also contains information about libraries that carry books on various subjects.
Conceptual Data Models
Abstraction, Inheritance, Encapsulation
Exercise these concepts using in-class examples and homework assignments A library database contains a listing of authors who have written
books on various subjects (one author per book). It also contains information about libraries that carry books on various subjects.
Entity sets: authors, subjects, books, libraries Relationship sets: wrote, carry, indexed
Conceptual Data Models
Abstraction, Inheritance, Encapsulation
Exercise these concepts using in-class examples and homework assignments A library database contains a listing of authors who have written
books on various subjects (one author per book). It also contains information about libraries that carry books on various subjects.
carry
books indexwrote subjectauthorsSS#
name
title isbn
Subjectmatter
librariesaddress
Data Models
Logical
Physical
WorksforEmp
SS#nameaddress
Relational Data Model
Prevalent in today’s market place. Why? Performance!
Everything is a table!
Logical data design is the process of reducing an ER diagram to a collection of tables.
Logical Data Design
Trivial reduction: An entity set = a table A relationship set = a table
Pitfalls: Duplication of data Unintentional loss of data Data ambiguity that impacts software
design, resulting in update anomalies
Data Duplication
Worksfor
Emp
SS#
name
address
396 Shahram Seattle
400 Asoke Chicago
200 Joe New York
396 400
200 400
120 400
SS# Name Address SS# MGRSS#
Data Duplication
The SS# column is duplicated!
Worksfor
Emp
SS#
name
address
396 Shahram Seattle
400 Asoke Chicago
200 Joe New York
396 400
200 400
120 400
SS# Name Address SS# MGRSS#
Data Duplication: Solution Merge the two tables into one:
396 Shahram Seattle 400
400 Asoke Chicago NULL
200 Joe New York 400
SS# Name Address MGRSS#
Worksfor
Emp
SS#
name
address
Data Loss
Ford maintains warehouses containing different automobile parts
Records are inserted and deleted based on availability of a part at a warehouse
123 Piston Tijuana
203 Cylinder Michigan
877 Bumper Michigan
389 Seats Arizona
Part# Description Location
Data Loss (Cont…)
When a warehouse becomes empty, it is lost from the database:
Solution: utilize two different tables
123 Piston Tijuana
389 Seats Arizona
Part# Description Location
123 Piston 12
389 Seats 45
Part# Description WHID
12 Tijuana
45 Arizona
WHID Location
Data Ambiguity
Represent faculty of a department as:
A change of address for a faculty might be for the entire department. This cannot be differentiated with this table design!
Ghandeharizadeh Comp Sci SAL
Zartoshty Comp Sci SAL
Bohem Comp Sci SAL
Faculty Department Location
Data Ambiguity
Utilize two tables:
Ghandeharizadeh Comp Sci
Zartoshty Comp Sci
Jenkins Bio Medical
Bohem Comp Sci
Faculty Department
Comp Sci SAL
Sex Ed BOVARD
Bio Medical HEDCO
Department Location
Data Ambiguity (Cont…)
Employees of a bi-lingual company having different skills.
Update anomalies!
Asoke Teach Hindi
Asoke Cook French
Asoke Null German
Asoke Program English
Employee Skill Language
Data Ambiguity: Solution
Utilize two tables:
Asoke Teach
Asoke Cook
Asoke Program
Employee SkillAsoke Hindi
Asoke French
Asoke German
Asoke English
Employee Language
Logical Data Design
A quest to flatten objects with minimal data duplication, loss of data, and update anomalies!
William Kent, “A Simple Guide to Five Normal Forms in Relational Database Theory”, Communications of the ACM 26(2), Feb 1983, 120-125.
Data Models
Physical
WorksforEmp
SS#nameaddress
Logical Data Design
396 Shahram Seattle 400
400 Asoke Chicago Null
SS# Name Address MGR SS#
Physical Implementation
Reconstruct main memory objects for manipulation and presentation: Specify class definitions
Typically correspond to entity-sets Populate an instance of a class by issuing
SQL queries to a DBMS Update instances in memory Flush dirty instances back to DBMS
Potential use of transactions
Type Mismatch
A column of a row must be a primitive such as an integer, real, etc. It may NOT be an array of integers or
object pointers A property (attribute) of a class might
be of a multi-valued type, e.g., an array, a vector, etc.
Changes in software may impact the design of tables. (Management of type mismatch by the system designer.)
Implementation
Set operators in the DBMS Does set A contain set B? Does value v1 appear in set A?
Aggregates in the DBMS Compute average employee salary Count the number of employees Find the oldest employee
ChallengesConceptual
Logical
Physical
Abstraction, Inheritance, Encapsulation
Reduction to tables with minimal: data duplication, potential for data loss and update anomalies
Effective use of a DBMS, management of mismatch between tables and OO constructs, Index structures, CC & Crash recovery, Optimization techniques
40
Entity-Relationship Model
Example of schema in the entity-relationship model
41
Entity Relationship Model (Cont.)
E-R model of real world Entities (objects)
E.g. customers, accounts, bank branch Relationships between entities
E.g. Account A-101 is held by customer Johnson
Relationship set depositor associates customers with accounts
Widely used for database design Database design in E-R model usually converted
to design in the relational model (coming up next) which is used for storage and processing
42
Relational Model
Example of tabular data in the relational model
Attributes
43
A Sample Relational Database
44
DATA INDEPENDENCE1. Physical data independence: modify the
physical scheme (data structures, e.g., B-tree or hash index) without causing application programs to be rewritten. These modifications are necessary to enhance performance and new software releases. Most relational vendors support this kind of data independence.
2. Logical data independence: Modify the conceptual scheme (e.g., add a new attribute to a table, rename an attribute) without causing application programs to be rewritten. This kind of data independence is harder to achieve.
45
DATABASE LANGUAGES
There are several languages associated with a database:1. Data Definition Language (DDL): The database scheme is
specified by a set of definitions that are expressed by a special language named DDL. The result of compiling DDL statements is a set of tables stored in a file called data dictionary. This file contains meta-data (data about the data stored in the database).
2. Data Manipulation Language (DML): a language that enables users to access or manipulate data (retrieve, insert, replace, delete) as organized by a certain data model. We will look at a commercial DML named SQL. In general, there are two types of DML:
• Procedural: Describes what data is needed and how to get it: e.g., relational algebra
• Non-procedural: Describes what data is needed without specifying how to get it: e.g., tuple relational calculus
46
SQL SQL: widely used non-procedural language
E.g. find the name of the customer with customer-id 192-83-7465select customer.customer-namefrom customerwhere customer.customer-id = ‘192-83-7465’
E.g. find the balances of all accounts held by the customer with customer-id 192-83-7465
select account.balancefrom depositor, accountwhere depositor.customer-id = ‘192-83-7465’ and depositor.account-number = account.account-
number
Application programs generally access databases through one of Language extensions to allow embedded SQL Application program interface (e.g. ODBC/JDBC) which allow SQL
queries to be sent to a database
47
Storage Management
Storage manager is a program module that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system.
The storage manager is responsible to the following tasks: Interaction with the file manager Efficient storing, retrieving and updating of data
Issues: Storage access File organization Indexing and hashing
48
Query Processing
1. Parsing and translation2. Optimization3. Evaluation
49
Query Processing (Cont.) Alternative ways of evaluating a given query
Equivalent expressions Different algorithms for each operation
Cost difference between a good and a bad way of evaluating a query can be enormous
Need to estimate the cost of operations Depends critically on statistical information
about relations which the database must maintain
Need to estimate statistics for intermediate results to compute cost of complex expressions
50
Transaction Management A transaction is a collection of operations that
performs a single logical function in a database application
Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures.
Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database.
51
SYSTEM USERS
There are several kind of users associated with a system:
Database administrator: defines schemas, storage structures and access method definitions, physical organization, authorization, integrity constraints.
Application programmers: they write a program and make it available to the end-users
Sophisticated users: they use a query language (SQL) to access the database interactively
Naive (end) users: they invoke the application programs
52
Overall System Structure