modern databases willem visser rw334. the web is changing the game databases used to be the domain...

37
Modern Databases Willem Visser RW334

Upload: calvin-rich

Post on 16-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Modern Databases

Willem VisserRW334

Page 2: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

The Web is Changing the Game

• Databases used to be the domain of corporations with limited amounts of data and limited amounts of users– Very valuable information, but not a lot of it– Important users, but not many of them

• In the modern web-driven world– Enormous amounts of data are being generated– Millions of users are interested in that data

Page 3: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

What is Wrong here?

Page 4: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

What is Wrong here?

How to make the DB scale?

Page 5: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Partition and Distribute the Data

Page 6: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

6

Distributed Database

• A single logical database spread physically across computers in multiple locations that are connected by a data communications link

Page 7: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

7

Major Objectives

• Location Transparency – User does not have to know the location of the data– Data requests automatically forwarded to appropriate

sites

• Local Autonomy – Local site can operate with its database when network

connections fail– Each site controls its own data, security, logging,

recovery

Page 8: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

8

Distributed Databases Advantages

• Increased reliability/availability• Local control over data• Modular growth• Lower communication costs• Faster response for certain queries

Page 9: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

9

Distributed Database Disadvantages

• Software cost and complexity• Processing overhead• Data integrity exposure• Slower response for certain queries

Page 10: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

10

Options forDistributing a Database

• Data replication – Copies of data distributed to different sites

• Horizontal partitioning/Sharding– Different rows of a table distributed to different sites

• Vertical partitioning– Different columns of a table distributed to different sites

• Combinations of the above

Page 11: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

11

Data Replication

• Advantages: – Reliability– Fast response– May avoid complicated distributed transaction

integrity routines (if replicated data is refreshed at scheduled intervals)

– Decouples nodes (transactions proceed even if some nodes are down)

– Reduced network traffic at prime time (if updates can be delayed)

Page 12: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

12

Data Replication (cont.)

• Disadvantages: – Additional requirements for storage space– Additional time for update operations– Complexity and cost of updating– Integrity exposure of getting incorrect data if

replicated data is not updated simultaneously

Therefore, better when used for non-volatile Therefore, better when used for non-volatile (read-only) data(read-only) data

Page 13: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

13

Factors in Choice ofDistributed Strategy

• Funding, autonomy, security• Site data referencing patterns• Growth and expansion needs• Technological capabilities• Costs of managing complex technologies• Need for reliable service

Page 14: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

14

Distributed DBMS• Distributed database requires distributed DBMS• Functions of a distributed DBMS:

– Locate data with a distributed data dictionary– Determine location from which to retrieve data and process query

components– DBMS translation between nodes with different local DBMSs (using

middleware)– Data management functions: security, concurrency, deadlock control, query

optimization, failure recovery– Data consistency (via multiphase commit protocols)– Global primary key control– Scalability– Data and stored procedure replication– Allowing for different DBMSs and application code at different nodes

Page 15: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

15

Distributed DBMSTransparency Objectives

• Location Transparency– User/application does not need to know where data resides

• Replication Transparency– User/application does not need to know about duplication

• Failure Transparency– Either all or none of the actions of a transaction are committed– Each site has a transaction manager

• Logs transactions and before and after images• Concurrency control scheme to ensure data integrity

– Requires special commit protocol

Page 16: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

16

Query Optimization• In a query involving a multi-site join and, possibly, a distributed

database with replicated files, the distributed DBMS must decide where to access the data and how to proceed with the join. Three step process:

1. Query decomposition–rewritten and simplified2. Data localization–query fragmented so that fragments reference

data at only one site3. Global optimization–

• Order in which to execute query fragments• Data movement between sites• Where parts of the query will be executed

• Semi join operation: only the joining attribute of the query is sent from one site to the other, rather than all selected attributes

Page 17: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Brewer’s CAP Theorem• Eric Brewer, Keynote at ACM Symposium on the

Principles of Distributed Computing 2000• You cannot have all three of:

– Consistency– Availability– Partition Tolerance

• Nothing short of complete network failure and the system must keep functioning

• Theorem proven in 2002 by Gilbert and Lynch• See http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

Page 18: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Dealing with CAP?• Drop Partitioning Tolerance

– Don’t partition, but then you have serious scalability issues, which is probably why you want to partition in the first place

• Drop Availability– Wait for all the partitions to sync before allowing any usage– This is as bad for scalability as having no partitioning

• Drop Consistency– Eventual Consistency seems to work in most cases– If you have to drop one, this is the preferred option– Flies against most DB principles

Page 19: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Relational DB?

• Seems like we are assuming the DB must still be relational

• Web also forces a new concept– Not all data look the same anymore!

• Email messages, Images, News documents, Facebook updates, Tweets,…

• Relations are too rigid• Semi-structured data

Page 20: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

The Information-Integration Problem• Related data exists in many places and could, in principle,

work together.• But different databases differ in:

– Model (relational, object-oriented?).– Schema (normalized/ not normalized?).– Terminology: are consultants employees? Retirees?

Subcontractors?– Conventions (meters versus feet?).

• How do we model information residing in heterogeneous sources (if we cannot combine it all in a single new database)?

Page 21: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Example• Suppose we are integrating information about bars in some

town.• Every bar has a database.

– One may use a relational DBMS; another keeps the menu in an MS-Word document.

– One stores the phones of distributors, another does not.– One distinguishes ales from other beers, another doesn’t.– One counts beer inventory by bottles, another by cases.

Page 22: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Semi-structured Data

• Purpose: represent data from independent sources more flexibly than either relational or object-oriented models.

• Think of objects, but with the type of each object its own business, not that of its “class.”

• Labels to indicate meaning of substructures.• Data is self-describing: structural information is

part of the data.

Page 23: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Graphs of Semistructured Data

• Nodes = objects.• Labels on arcs (attributes, relationships).• Atomic values at leaf nodes (nodes with no arcs

out).• Flexibility: no restriction on:

– Labels out of a node.– Number of successors with a given label.

Page 24: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Example: Data Graph

Bud

A.B.

Gold1995

MapleJoe’s

M’lob

beer beerbar

manfmanf

servedAt

name

namename

addr

prize

year award

root

The bar objectfor Joe’s Bar

The beer objectfor Bud

Notice anew kindof data.

Page 25: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

XML• XML = Extensible Markup Language.• While HTML uses tags for formatting (e.g., “italic”), XML uses tags for

semantics (e.g., “this is an address”).• Key idea: create tag sets for a domain (e.g., bars), and translate all

data into properly tagged XML documents.• Well formed XML - XML which is syntactically correct

– tags and their nesting totally arbitrary.

• Valid XML - XML which has DTD (document type definition)– imposes some structure on the tags, but much more flexible than relational

database schema.

• DTD and XML Schema– Meta-data for XML– Describe what are valid XML structures

Page 26: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

XML and Semi-structured Data

• Well-Formed XML with nested tags is exactly the same idea as trees of semi-structured data.

• XML also enables non-tree structures (with references to IDs of nodes), as does the semi-structured data model.

Page 27: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

27

Example: Well-Formed XML<?xml version = “1.0” standalone = “yes” ?><BARS>

<BAR><NAME>Joe’s Bar</NAME><BEER><NAME>Bud</NAME>

<PRICE>2.50</PRICE></BEER><BEER><NAME>Miller</NAME>

<PRICE>3.00</PRICE></BEER></BAR><BAR> …

</BARS>

A NAMEsubelement

A BEERsubelement

Root tag

Tags surroundinga BAR element

Page 28: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Example• The <BARS> XML document is:

Joe’s Bar

Bud 2.50 Miller 3.00

PRICE

BAR

BAR

BARS

NAME . . .

BAR

PRICENAME

BEERBEER

NAME

Page 29: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

29

DTD Elements

• The description of an element consists of its name (tag), and a parenthesized description of any nested tags.– Includes order of subtags and their multiplicity.

• Leaves (text elements) have #PCDATA (Parsed Character DATA ) in place of nested tags.

Page 30: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

30

Example: DTD

<!DOCTYPE BARS [<!ELEMENT BARS (BAR*)><!ELEMENT BAR (NAME, BEER+)><!ELEMENT NAME (#PCDATA)><!ELEMENT BEER (NAME, PRICE)><!ELEMENT PRICE (#PCDATA)>

]>

A BARS object haszero or more BAR’snested within.

A BAR has oneNAME and oneor more BEERsubobjects.

A BEER has aNAME and aPRICE.

NAME and PRICEare text.

Page 31: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

Querying XML• Why query XML-documents?

– special XML databases– major DBMSs “speak” XML;

• Does the world need a new query language?• Most of the world's business data is stored in

relational databases;• The relational language SQL is mature and well-

established;• Can SQL be adapted to query XML data?

– Leverage existing software– Leverage existing user skills

Page 32: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

XML vs Relational Data

• Relational data is "flat”: rows and columns;• XML data is nested: and its depth may be

irregular and unpredictable;• Relations can represent hierarchic data by

foreign keys or by structured datatypes;• In XML it is natural to search for objects at

• unknown levels of the hierarchy: • "Find all the red things“;

Page 33: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

XML vs Relational Data (cont.)• Relational data is uniform and repetitive;

• All bank accounts are similar in structure;• Metadata can be factored out to a system catalog;

• XML data is highly variable;• Every web page is different;• Each XML object needs to be self-describing;• Metadata is distributed throughout the document;

• Queries may access metadata as well as data: "Find elements whose name is the same as their content“:

• //*[name(.) =string(.)]

Page 34: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

XML vs Relational Data (cont.)

• Relational queries return uniform sets of rows;• The results of an XML query may have mixed

types and complex structures;• "Red things": a flag, a cherry, a stopsign, ...

• Elements can be mixed with atomic values

• XML queries need to be able to perform structural transformations

• Example: invert a hierarchy;

Page 35: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

XML vs Relational Data (cont.)

• The rows of a relation are unordered• Any desired output ordering must be derived from

values;

• The elements in an XML document are ordered• Implications for query:

• Preserve input order in query results• Specify an output ordering at multiple levels

• "Find the fifth step“

• "Find all the tools used before the hammer“;

Page 36: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

XML vs Relational Data (cont.)

• Relational data is "dense“• Every row has a value in every column;• A "null" value is needed for missing or inapplicable

data

• XML data can be "sparse“• Missing or inapplicable elements can be "empty“ or

"not there“

• This gives XML a degree of freedom not present in relational databases

Page 37: Modern Databases Willem Visser RW334. The Web is Changing the Game Databases used to be the domain of corporations with limited amounts of data and limited

'Modern' Databases

XPATH and XQUERY

• XPATH is a language for describing paths in XML documents.– Really think of the semi-structured data graph and its

paths.– Why do we need path description language: can’t get at

the data using just Relation.Attribute expressions.• XQUERY is a full query language for XML documents with

power similar to OQL (Object Query Language, query language for object-oriented databases).