introduction zachary g. ives university of pennsylvania cis 550 – database & information...
Post on 19-Dec-2015
213 views
TRANSCRIPT
Introduction
Zachary G. IvesUniversity of Pennsylvania
CIS 550 – Database & Information Systems
September 9, 2004
Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
2
Welcome to CIS 550, Database and Information Systems!
Instructor: Zachary Ives, zives@cis 576 Levine Hall North Office hours: Tuesday, 3:00-4:00PM (after class)
TA: T.J. Green, tjgreen@cis Office hours: Thursday, 3:00-4:00PM
Newsgroup: upenn.cis.cis550Home page: www.seas.upenn.edu/~zives/cis550/
Texts and readings: Ramakrishnan & Gerke, Database Systems, 3rd ed. Supplementary papers (to be handed out in class) Other books may be useful, esp. Brundage’s Using
XQuery
3
Course Format and Grading
Roughly one major topic area per week to two weeks Readings in the text & research papers Occasionally, summaries/commentary on papers (5%) Homework assignment for each topic area (30%)
One midterm (10%), one final exam (20%) Project (30%) – groups of 3-4:
Build a “GMail”/Hotmail clone on top of a database, or Build a P2P system for synchronizing tables (Or propose your own idea)
General participation, discussion, intangibles (5%)
4
Why This Course?
Most CS courses concentrate on code – our interest is managing and representing data
Warning: this course doesn’t focus on teaching SQL or how to be an Oracle DBA (though it will get you started)
… So what in the world are we studying for 14 weeks???
7
Example: An Encyclopedia Entry(www.wikipedia.com)
A database is an information set with a regular structure. Its front-end allows data access, searching and sorting routines. Its back-end affords data inputting and updating. A database is usually but not necessarily stored in some machine-readable format accessed by a computer. There are a wide variety of databases, from simple tables stored in a single file to very large databases with many millions of records, stored in rooms full of disk drives or other peripheral electronic storage devices.
Databases resembling modern versions were first developed in the 1960s. A pioneer in the field was Charles Bachman.
The most useful way of classifying databases is by the programming model associated with the database. Several models have been in wide use for some time. Historically, the hierarchical model was implemented first, then the network model, then the relational model overcame with the so-called flat model accompanying it for low-end usage…
8
Example: To-Do List
Buy school supplies due 9/7 Go to orientation on 9/7 Exercise every M/W/F Buy Philly postcards
How does this differ from the plain text model? What might you do with it that you couldn’t?
9
Example: Your PDA/Cell Phone
Event Day When Who WhereLunch 10/24 1pm Zack Cavanaugh’sAdvice10/25 9am Dr. Smith 599 LevineBiking 10/26 9am Jane PottruckDinner 10/26 6PM Jane Food Court
Calendar
Who Phone Email OfficeZack 6-2789 zives 576 Levine NDr. Smith 6-1234 drsmith 599 LevineJane 543-2198 jane 2220 Walnut St.
Contacts
10
What If We Want to Include Contact Info on Our Calendar?
Do we also want to keep e-mail addresses, telephone numbers etc.?
Should we expand the number of columns in our table:
Event When Who-name Who-email Who-tel …. Where
Lunch 1pm Zack zives 6-2789…. Cav……
What is the trade-off in terms of entering data?
11
“Link” Calendar with Contacts?
Why can’t we “link” calendar entries with contact info, and show the results of the two? The link could be based on something as simple as
the person's name (What’s the danger here? What else might work
better?)
This brings up an issue – how to “follow links” If we were to do this in Java, how might it be done?
12
Another Kind of Link: Classes and Subclasses
Person has attributes: ssn PennID set of user IDs given name family name …
Student IS A person who: takes courses is given grades is taught listens to lectures in class, OR over the Web, OR on videotape
This is yet another kind of information How have you previously seen such relationships encoded?
13
Data Representation and Modeling
All of the data we’ve seen have an implicit data model
The data model includes some basic assumptions about what’s an “item” of data, how to interpret it, and so on
The relational data model was the first model for data that is independent of its data structures and implementation A theory of normalization guides you in designing
relations Concepts from the relational data model have been
adapted to form object-oriented data models (with classes and subclasses), XML models, etc.
There are “sibling” fields to databases that consider: natural language models (how to understand words) document models (how to match words and documents) ontologies (how to define relationships between classes)
14
The DBMS Provides an Interfaceover the Database
A database (DB) is a large, integrated collection of data Generally is cohesive in “some” way
A DB models a real-world organization or unit A database management system (DBMS) is a
software package designed to store and manage databases Reliable storage & recovery of 100s of GB Querying/updating interface and API (for applications and
Web pages) Support for many concurrent users
Why do we need a DBMS, instead of coding in Java?
15
DBMS Benefit #1: Generality and Declarativity
Don’t require the programmer or user to know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc.
Instead, the programmer/user programs with a logical model in mind
The DBMS “makes it happen” based on an understanding of relative costs of different methods
16
Benefit #2: Efficiency and Scale
Size of personal address book is probably less than 100 entries, but there are things we'd like to do quickly and efficiently: “Give me all appointments on 10/28” “When am I next meeting Jim?”
“Program” these as quickly as possible (and make them resilient to data format changes)
Scale to a corporate calendar with hundreds of thousands of entries
17
Benefit #3: Management of Concurrency and Reliability
Suppose other people are allowed access to your calendar and are allowed to modify it? How do we stop two people changing the file at the same time and leaving it in a physical (or logical) mess?
Suppose the system crashes while we are changing the calendar. How do we recover our work?
This requires a basic concept…
18
Transactions
Key concept for concurrency is that of a transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry).
Key concept for recoverability is that of a log: keeping track of all actions carried out by the db.
19
The Layers of the DBMS
API/GUI
Optimizer
Storage Mgr
Exec. Engine
Storage
Catalog
Query
Physical plan
Pages
RequestsData
Pages
Stats
Schemas
(Simplification!)
Buffer Mgr
Index/file/rec MgrData/etc Requests
RequestsData/etc
Logging, recovery
Red = logicalBlue = physical
20
The Database Abstraction Provided by the DBMS
We think of databases at two levels: Logical structure:
What users/programmers see – program or query interface
Physical structure: Organization on disk, indices, etc.
The logical level is further split into: Overall database design (conceptual; seen by
the DB designer) Views that various users get to see
21
The Three-level Architecture forDatabases
View 1 View 2 … View N
Physical Level(file organization, indexing)
Schema Logical,Conceptual Level
22
Data Independence
A user of a relational database system should be able to use the database without knowing about how the precisely how data is stored, e.g.
After all, you don't worry IEEE floating-point when you do division in a Java program or with a calculator
SELECT When, WhereFROM CalendarWHERE Who = “Jane"
23
More on Data Independence
Logical data independence Protects the user from changes in the logical structure of the data:
could reorganize the calendar “schema” without changing how we query it
Physical data independenceProtects the user from changes in the physical structure of data:
could add an index on who (or sort by when) without changing how the user would write the query, but the query would execute faster (query optimization)
24
Presentation Layer (4th Tier): Data-Driven Web Sites
“Data driven web sites” also add an HTML “presentation” layer on top of what we’ve seen
Or they use XML plus “style sheets” to get the same effect
view
HTML
Processing
25
An Issue: 80% of the World’s Data is Not in a DB!
Examples: scientific data
(large images, complex programs that analyze the data) personal data WWW and email
(some of it is stored in something resembling a DBMS)
Data management is expanding to tackle these problems
Flexibility – data management imposes many constraints to make problems solvable
Must deal with entities outside our control
In this course, we’ll start by focusing on databases, but eventually look “outside the box” at the Web and at gluing together data from many places
26
Combining Databases with Mediators(a kind of middleware)
A layer above the three-tiered architecture, to combine multiple databases/sources on the Web Some of these are databases over which we have no control Some must be accessed in special ways We generally need to think about how to translate between
different database formats
XML
“Mediated Schema”
27
How Does One Build a Database?
Start with a conceptual model Design & implement schema Write applications using DBMS and other tools
Many ways of doing this where the hard problems are taken care of by other people (DBMS, API writers, library authors, web server, etc.)
Common applications include PHP/JSP/servlet-driven web sites
The DBMS takes care of query optimization and execution
29
Designing a Schema (Set of Relations)
Convert to tables +constraints
Then need to do “physical” design: the layout on disk, indices, etc.
sid name
1 Jill
2 Bo
3 Maya
fid name
1 Ives
2 Saul
8 Roth
sid cid
1 550-0103
1 700-1003
3 500-0103
cid name sem
550-0103 DB F03
700-1003 AI S03
501-0103 Arch F03
fid cid
1 550-0103
2 700-1003
8 501-0103
STUDENT Takes COURSE
PROFESSOR Teaches
30
Applications Use Queries in SQL
<html><body> <!-- hypotheticalEmbeddedSQL: SELECT *
FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid --></body></html>
Structured Query Language Based on restricted first-order logic expressions over relations Not procedural – defines constraints on the output
Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine
31
Processing the Query
SELECT * FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid
STUDENT
Takes COURSE
Merge
Hash
by cid by cidOptimizer
ExecutionEngine
StorageSubsystem
Web Server / UI / etc
32
DBMSs in the Real World
A huge industry for 20% of the world’s data! Big, mature relational databases
IBM, Oracle, Microsoft “Middleware” above these
SAP, PeopleSoft, dozens of special-purpose apps “Application servers” Integration and warehousing systems Current trends:
Web services; XML everywhere Smarter, self-tuning systems
33
So What about Database Research?
Not focusing on the problems of Oracle… Understanding what’s possible to do with XML
Better query processing Better languages for meta-info (e.g., constraints)
Data streams Peer-to-peer architectures Integrating data from different formats Lots of theory and systems-building
You’ll see familiar concepts in this course from operating systems and from complexity theory/logic
… And from programming languages, AI planning, …
34
In this Course...
Study relational databases, their design, how to query, what forms of indices to use.
Beyond relational algebra: a logical model of data (Datalog), recursion
XML and semi-structured data models Understanding DB internals
How DBs are built Performance implications
Integrating and mediating between databases (a huge problem today)