databasesdatabases intd 388 – spring 2008 by a.guercio and e. walker

29
Databases Databases INTD 388 – Spring 2008 INTD 388 – Spring 2008 by A.Guercio and E. Walker

Upload: lorena-pitts

Post on 14-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

DatabasesDatabasesDatabasesDatabases

INTD 388 – Spring 2008INTD 388 – Spring 2008

by A.Guercio and E. Walker

Definitions • A database is a collection of related data

• A database can be of any size and of varying complexity

• A database management system (DBMS) is a collection of programs that enables users to define, create and maintain a database

DatabaseNomenclature

Data Mart: A searchable databasesystem organized according to the user's needs. It has data that focus onspecific to a particular research project

Data Repository: A database used as an information storage facility, with minimal analysis or querying functionality.Used to mirror existing databasefor concurrent access to data.

Data Warehouse: A central database,frequently large, that can provide authenticated researchers with access to all of an institution's information. It is populated of data from a variety of non-compatible sources.

Data Management

Data Life Cycle

Data Dictionary: Defines data format standards.Ex. Date of Birth: MM/DD/YYis tranformed to the standardData of Birth: DD/MM/YY

Creation & Acquisition:collection and insertionof data in the system

Use: Many, including simulation,statistical analysis, visualization,communication. Of course private data(i.e. not releasable) should be protected at this stage.

Archiving: Making data available for future use.Issues: Space Requirements, Completeness of Data, Longevity, Security, Archival Process, …

Repurposing: When data are archived, available data can be used for different purposes.Ex. Linear sequences captured to discover new genes are commonly repurposed to support 3D visualization of protein structures

Data Life Cycle

Centralized: Concentrates all organizational activity in one location.Disadvantages: When the central system crashes all the data are not available. Distributed: Supports a

variety of hardware and software and uses physically disparate storage media, each computationally independentDisadvantages: We need to develop interfaces between databases (n DB require 2n-1 interfaces)

Database Architecture

What is a Database System?

• Computerized System for the management of records.– Example: Oracle, Sybase, SQL Server, DB2…

• The user of the system can perform various operations:

•add a new file to the database;•insert new data in an existing file;•search for data;•update data;•remove data; remove file.

Advantages of Database Systems

• Redundancy Control• Data Consistency and Integrity• Data Sharing and Integration • Security• Improved Maintenance• Concurrency (without data loss)• Backup & Recovery services

Disadvantages…• Complexity• Size• Cost (Hardware & Software)• Performance • Conversion• Risk of failure

ExampleBin Wine Producer Year Bottles Ready

2 Chardonnay Buena Vista 88 1 91

3 Chardonnay Louis Martini 89 5 90

21 Fume Blanc Robt. Mondavi

88 4 91

22 Fume Blanc Mirassou 87 2 90

72 Gamay Robt. Mondavi

85 2 90

SearchSearchSearchSearch

SELECT WINE, BIN, PRODUCERSELECT WINE, BIN, PRODUCER

FROM CELLAR WHEREFROM CELLAR WHERE

READY=91;READY=91;

Result

Wine Bin Producer

Chardonnay 2 Buena Vista

Fume Blanc 21 Robt. Mondavi

InsertionInsertionInsertionInsertion

INSERT INTO CELLAR INSERT INTO CELLAR

VALUES (S3, ‘PINOT NOIR’,VALUES (S3, ‘PINOT NOIR’,

‘‘SAINTSBURY’, 87, 1, 93);SAINTSBURY’, 87, 1, 93);

UpdateUpdateUpdateUpdate

UPDATE CELLAR UPDATE CELLAR

SET BOTTLES=4SET BOTTLES=4

WHERE BIN=3;WHERE BIN=3;

DeletionDeletionDeletionDeletion

DELETEDELETE

FROM CELLARFROM CELLAR

WHERE BIN=2;WHERE BIN=2;

Data Models

The Relational Data Model

• The relational model was introduced by E.F. Codd in 1970

• The model is based on the mathematical structure known as the relation – Informally, a relation resembles a table– More precisely, a relation is a set of n-tuples r = {t1, t2, …, tm} (i.e. a set of rows) and there is no order in the rows or in the tuples.Therefore in order to distinguish one row from another you need a unique key

•Ex. of a customer record:S.S. First_Name Last_NameAddress

– S.S. is the unique key that distinguishes each customer record

Parts of a Relational Database

Schema

Table(relation)

name address age

Column*(attribute)

Row (tuple)

* Values of an attribute are chosen from its domain,e.g. the domain of “age” is integers > 0

Relational Model Constraints

• No two tuples are equal, thus some set of attributes must be unique, they are the key– If there are multiple keys in a schema, they are called candidate keys

– One of the candidate keys may be designated as a primary key and be used to identify tuples in the relation

– A foreign key is an attribute in a relation which is the primary key of another relation Ex. Customer(S.S., First_Name, Last_Name, Address)Checking Account(Account_Number, S.S., Balance)

Relational Model Constraints

• The entity integrity constraint states that no primary key value can be null– We don’t store descriptions of entities we can’t identify• Ex. We do not allow to insert a customer without inserting the S.S.

• The referential integrity constraint is used to maintain consistency among the tuples of two relations

• Ex. We do not insert the checking account of a customer for whom there is no data in the Customer table

• The integrity constraints may be displayed diagrammatically by drawing an arrow from the relation with the foreign key to the relation with the primary key

What is good in relational DB?

• Advantage: We have data organized efficiently and in a way that saves space.– Ex. We do need to repeat all the information about the customer who has 500 accounts in our bank!

– Flat files with their independent tables would not allow that.

Flat File (One Table)

Name Class Location Professor

Minnie Mouse

English 101

BC H. Higgins

Minnie Mouse

Psych 240

Gerst. 120 Lucy van Pelt

Minnie Mouse

Accounting 110

Hins. 209 Scrooge McD

Goofy Psych 240

Gerst. 120 Lucy van Pelt

Goofy English 101

BC H. Higgins

Relational Database Has Multiple Tables

Class Location

Prof

English 101

BC H. Higgins

Psych 240

Gerst. 120

Lucy van Pelt

Accounting 110

Hins. 209

Scrooge McD

Name Class

Minnie Mouse

English 101

Minnie Mouse

Psych 240

Minnie Mouse

Accounting 110

Goofy Psych 240

Goofy English 101

Schedule Enrollment

SQL A Relational DB Language

• SQL (Structured Query Language) was developed at IBM as part of an experimental relational system called System R

• Along with QUEL (based on the relational calculus), SQL is one of the earliest DB languages

• SQL includes statements to define, manipulate and interrogate the DB

• Ex: create table Student ( Name char[20], GradYear integer, )

Basic SQL Select statement

• SELECT attribute• FROM table(s)• WHERE conditions

• Examples– Select Class from Enrollment where Name = ‘Minnie Mouse’

– Select Class from Schedule where Location = ‘BC’

More Complex Queries

• Multiple Tables joined together• More conditions in where, joined by and, or, not

– Select Name, Class, Locationfrom Schedule, Enrollmentwhere Name = Goofy and

Schedule.Class = Enrollment.Class

– Select Prof, Namefrom Schedule, Enrollmentwhere Prof = ‘Scrooge McD’ and

Schedule.Class = Enrollment.Class

More WHERE conditions

SELECT staffNO, fName, lName, salaryFROM StaffWHERE…

…salary BETWEEN 20,000 and 30,000…position IN {‘Manager’, ‘Supervisor’}…fName LIKE {‘Al%n’} Allen or Alan or Allison …

…lName LIKE {‘_ _ _ _’} all 4-letter names…mName IS NOT NULL has a middle name

Sorting the Results• List all rooms, sorted by price (most expensive first)SELECT hotelName, roomNo, roomType, priceFROM Hotel, RoomWHERE Hotel.hotelNo = Room.hotelNoORDER BY price DESC;

• Multiple sort keys can be specified, e.g.…ORDER BY lName, fName ASC;

Databases in Bioinformatics

• NCBI– GenBank (sequences)– PubMed (Literature)– Human Genome– Etc!