name of the subject : database management system subject code : cs8492 … · 2021. 2. 26. · this...

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

NAME OF THE SUBJECT : DATABASE MANAGEMENT SYSTEM

SUBJECT CODE : CS8492

REGULATION : 2017

UNIT V - ADVANCED TOPICS

UNIT V

ADVANCED TOPICS

5.1 DISTRIBUTED DATABASES:

In a distributed database system, the database is stored on several computers.

The computers in a distributed system communicate with one another through various

communication media, such as high-speed networks or telephone lines.

They do not share main memory or disk.

Types of Distributed Databases

They are classified as:

(a) Homogeneous Distributed Database

In this, all sites have identical database management system software, are aware of one

another, and agree to cooperate in processing user’s requests.

In such a system, local sites surrender a portion of their autonomy in terms of their right to

change schemas or database management system software.

This software must also cooperate with other sites in exchanging information about

transactions, to make transaction processing possible across multiple sites.

It appears to user as a single system.

(b) Heterogeneous Distributed Databases

In this, different sites may use different schemas, and different database

management system software.

The sites may not be aware of one another, and they may provide only limited

facilities for cooperation in transaction processing.

The differences in schemas are often a major problem for query processing, while

the difference in software becomes a major problem for transaction processing.

ARCHITECTURE:

Distributed database system consists of loosely coupled sites that share no physical

component.

The computers in distributed system are referred by names such as sites or nodes.

Database system that run on each site are independent of each other.

Transactions may access data at one or more sites.

DATA STORAGE:

Consider a relation r that is to be stored in the database. There are two approaches to storing

this relation in the distributed database:

1. Replication. The system maintains several identical replicas (copies) of the relation, and

stores each replica at a different site.

2. Fragmentation. The system partitions the relation into several fragments, and stores each

fragment at a different site.

Data Replication If relation r is replicated, a copy of relation r is stored in two or more sites. In full

replication a copy is stored in every site in the system.

There are a number of advantages and disadvantages to replication.

Availability. If one of the sites containing relation r fails, then the relation r can be found in

another site.

Increased parallelism. The several sites can process queries involving r in parallel. The

more replicas of r there are, the greater the chance that the needed data will be found in the

site where the transaction is executing. Hence, data replication minimizes movement of data

between sites.

Increased overhead on update. The system must ensure that all replicas of a relation r are

consistent; otherwise, erroneous computations may result. Thus, whenever r is updated, the

update must be propagated to all sites containing replicas. The result is increased overhead.

Data Fragmentation If relation r is fragmented, r is divided into a number of fragments r1, r2, . . . , rn. These

fragments contain sufficient information to allow reconstruction of the original relation r.

There are two different schemes for fragmenting a relation: horizontal fragmentation and

vertical fragmentation.

Horizontal fragmentation splits the relation by assigning each tuple of r to one or more

fragments.

Vertical fragmentation splits the relation by decomposing the scheme R of relation r.

In horizontal fragmentation, a relation r is partitioned into a number of subsets, r1, r2, . . .

, rn. Each tuple of relation r must belong to at least one of the fragments, so that the original

relation can be reconstructed, if needed.

Example:

The account relation can be divided into several different fragments, each of which consists

of tuples of accounts belonging to a particular branch. If the banking system has only two

branches—Hillside and Valleyview —then there are two different fragments:

account1 = branch name = “Hillside” (account)

account2 = branch name = “Valleyview” (account)

Vertical fragmentation of r(R) involves the definition of several subsets of attributes R1,

R2, . . . , Rn of the schema R so that: R = R1 ∪ R2 ∪ · · · ∪ Rn. Each fragment ri of r is defined

by:ri = Ri (r ) . The fragmentation should be done in such a way that we can reconstruct relation r

from the fragments by taking the natural join: r = r1 r2 r3 · · · rn

Example:

Consider a university database with a relation employee info that stores, for each employee,

employee id, name, designation, and salary. For privacy reasons, this relation may be fragmented

into a relation employee private info containing employee id and salary, and another relation

employee public info containing attributes employee id, name, and designation. These may be stored

at different sites, again, possibly for security reasons.

Transparency:

The user of a distributed database system should not be required to know where the data are

physically located nor how the data can be accessed at the specific local site. This characteristic,

called data transparency, can take several forms:

• Fragmentation transparency. Users are not required to know how a relation has been

fragmented.

• Replication transparency. Users view each data object as logically unique.

• Location transparency. Users are not required to know the physical location of the data.

TRANSACTION PROCESSING:

Access to the various data items in a distributed system is usually accomplished through

transactions, which must preserve the ACID properties.

There are two types of transaction.

Local transactions

Global transactions

The local transactions are those that access and update data in only one local database.

The global transactions are those that access and update data in several local databases.

System Structure:

Each site has its own local transaction manager, whose function is to ensure the ACID

properties of those transactions that execute at that site.

The various transaction managers cooperate to execute global transactions.

An abstract model of a transaction system, in which each site contains two subsystems:

• The transaction manager manages the execution of those transactions that access

data stored in a local site.

• The transaction coordinator coordinates the execution of the various transactions

initiated at that site.

Each transaction manager is responsible for: • Maintaining a log for recovery purposes.

• Maintaining a log for recovery purposes.

• Participating in an appropriate concurrency-control scheme to coordinate the concurrent

execution of the transactions executing at that site.

For each such transaction, the coordinator is responsible for:

• Starting the execution of the transaction.

• Breaking the transaction into a number of subtransactions and distributing these

subtransactions to the appropriate sites for execution.

• Coordinating the termination of the transaction, which may result in the transaction being

committed at all sites or aborted at all sites.

System Failure Modes

A distributed system may suffer from the some types of failure.

The basic failure types are:

• Failure of a site.

• Loss of messages.

• Failure of a communication link.

• Network partition.

5.2 OBJECT-BASED DATABASES:

Loosely speaking, an object corresponds to an entity in the E-R model.

The object-oriented paradigm is based on encapsulating code and data related to an

object into single unit.

The object-oriented data model is a logical data model (like the E-R model).

Adaptation of the object-oriented programming paradigm (e.g., Smalltalk, C++) to

database systems.

OBJECT DATABASE CONCEPTS:

Object Structure An object has associated with it:

A set of variables that contain the data for the object. The value of each variable

is itself an object.

A set of messages to which the object responds; each message may have zero,

one, or more parameters.

A set of methods, each of which is a body of code to implement a message; a

method returns a value as the response to the message

The physical representation of data is visible only to the implementor of the object

Messages and responses provide the only external interface to an object.

The term message does not necessarily imply physical message passing. Messages can

be implemented as procedure invocations.

Messages and Methods Methods are programs written in general-purpose language with the following features

only variables in the object itself may be referenced directly

data in other objects are referenced only by sending messages.

Methods can be read-only or update methods

Read-only methods do not change the value of the object

Strictly speaking, every attribute of an entity must be represented by a variable and two

methods, one to read and the other to update the attribute

e.g., the attribute address is represented by a variable address and two messages

get-address and set-address.

For convenience, many object-oriented data models permit direct access to

variables of other objects.

Object Classes Similar objects are grouped into a class; each such object is called an instance of its class

All objects in a class have the same

Variables, with the same types

message interface

methods

The may differ in the values assigned to variables

Example: Group objects for people into a person class

Classes are analogous to entity sets in the E-R model

Class Definition Example

class employee {

/*Variables */

string name;

string address;

date start-date;

int salary;

/* Messages */

int annual-salary();

string get-name();

string get-address();

int set-address(string new-address);

int employment-length();

};

Fig. Definition of the class employee

Methods to read and set the other variables are also needed with strict encapsulation

Methods are defined separately

string get-address( ) {

return address;

}

int set-address(string new-address) {

address = new-address;

}

While the method employment-length( ) would be defined as:

int employment-length() {

return today( ) – start-date;

}

OBJECT-RELATIONAL FEATURES:

Inheritance

E.g., class of bank customers is similar to class of bank employees, although there are

differences

both share some variables and messages, e.g., name and address.

But there are variables and messages specific to each class e.g., salary for employees and

credit-rating for customers.

Every employee is a person; thus employee is a specialization of person

Similarly, customer is a specialization of person.

Create classes person, employee and customer

variables/messages applicable to all persons associated with class person.

variables/messages specific to employees associated with class employee; similarly for

customer

Place classes into a specialization/IS-A hierarchy

variables/messages belonging to class person are inherited by class employee as well as

customer

Result is a class hierarchy

Fig. 9.1 Class hierarchy for the bank example

Note analogy with ISA Hierarchy in the E-R model

Class Hierarchy Definition

class person{

string name;

string address:

string get-name( );

string get-address( );

int set-address(string new-address);

};

class customer isa person {

int credit-rating;

};

class employee isa person {

date start-date;

int salary;

int annual-salary( )

int employment-length( );

};

class officer isa employee {

int office-number,

int expense-account-number,

};

class teller isa employee {

int hours-per-week;

int station-number;

};

Class secretary isa employee {

int hours-per-week;

string manager;

};

Full variable list for objects in the class officer:

office-number, expense-account-number: defined locally

start-date, salary: inherited from employee

name, address: inherited from person

Methods inherited similar to variables.

Substitutability — any method of a class, say person, can be invoked equally well with any

object belonging to any subclass, such as subclass officer of person.

Class extent: set of all objects in the class. Two options:

1. Class extent of employee includes all officer, teller and secretary objects.

2. Class extent of employee includes only employee objects that are not in a subclass such as

officer, teller, or secretary

This is the usual choice in OO systems

Can access extents of subclasses to find all objects of

subtypes of employee

Multiple Inheritances

With multiple inheritances a class may have more than one superclass.

The class/subclass relationship is represented by a directed acyclic graph (DAG)

Particularly useful when objects can be classified in more than one way, which

are independent of each other

Example of Multiple Inheritance

Fig. 9.2 Class DAG for banking example.

A class inherits variables and methods from all its superclasses

There is potential for ambiguity when a variable/message N with the same name is

inherited from two super classes A and B

No problem if the variable/message is defined in a shared superclass

Otherwise, do one of the following

flag as an error,

rename variables (A.N and B.N)

choose one.

Object Identity An object retains its identity even if some or all of the values of variables or definitions

of methods change over time.

Object identity is a stronger notion of identity than in programming languages or data

models not based on object orientation.

Value – data value; e.g. primary key value used in relational systems.

Name – supplied by user; used for variables in procedures.

Built-in – identity built into data model or programming language.

no user-supplied identifier is required.

Is the form of identity used in object-oriented systems.

Object Identifiers Object identifiers used to uniquely identify objects

Object identifiers are unique:

no two objects have the same identifier

each object has only one object identifier

E.g., the spouse field of a person object may be an identifier of another person

object.

can be stored as a field of an object, to refer to another object.

Can be

system generated (created by database) or

external (such as social-security number)

System generated identifiers:

Are easier to use, but cannot be used across database systems

May be redundant if unique identifier already exists

Object Containment

Each component in a design may contain other components

Can be modeled as containment of objects. Objects containing; other objects are called

composite objects.

Multiple levels of containment create a containment hierarchy

links interpreted as is-part-of, not is-a.

Allows data to be viewed at different granularities by different users.

Fig. 9.3 Containment hierarchy for bicycle design database

ODMG OBJECT MODEL, ODL, OQL: ODMG (Object Data Management Group) is a proposed standard that is known as the

ODMG-93 or ODMG 1.0 standard.

The standard is made up of several parts, including the object model, the object definition

language (ODL), the object query language (OQL), and the bindings to object-oriented

programming languages

Object Model of ODMG: The ODMG object model is the data model upon which the object definition language

(ODL) and object query language (OQL) are based.

It is meant to provide a standard data model for object databases, just as SQL describes a

standard data model for relational databases.

It also provides a standard terminology in a field where the same terms were sometimes

used to describe different concepts.

Objects and Literals

Objects and literals are the basic building blocks of the object model.

The main difference between the two is that an object has both an object identifier and

a state (or current value), whereas a literal has a value (state) but no object identifier.

An object has five aspects: Identifier, name, lifetime, structure, and creation

IDENTIFIER:

The object identifier is a unique system-wide identifier (or Object_id).

Every object must have an object identifier.

NAMES:

ODMS—this name can be used to locate the object and the system should return the object

given that name.

LIFETIME:

The lifetime of an object specifies whether it is a persistent object.

STRUCTURE:

The structure of an object specifies how the object is constructed by using the type

constructors.

CREATION:

Object creation refers to the manner in which an object can be created.

There are three types of literals: atomic, structured, and collection.

ATOMIC LITERALS:

Atomic literals correspond to the values of basic data types and are predefined. The basic

data types of the object model include long, short, and unsigned integer numbers regular and

double precision floating point numbers (float, double), Boolean values (boolean), single

characters (char), character strings (string), and enumeration types (enum), among others.

STRUCTURED LITERALS:

Structured literals correspond roughly to values that are constructed using the tuple

constructor. The built-in structured literals include Date, Interval, Time, and Timestamp

COLLECTION LITERALS:

Collection literals specify a literal value that is a collection of objects or values but the

ollection itself does not have an Object_id.

Inheritance in the Object Model of ODMG

In the ODMG object model, two types of inheritance relationships exist:

Behavior - only inheritance

State plus behavior inheritance.

Behavior inheritance is also known as ISA or interface inheritance, and is specified by the

colon (:) notation.

The other inheritance relationship, called EXTENDS inheritance, is specified by the

keyword extends. It is used to inherit both state and behavior strictly among classes, so both the

super type and the subtype must be classes.

Built-in Interfaces and Classes in the Object Model:

All interfaces, such as Collection, Date, and Time, inherit the basic Object interface. In the

object model, there is a distinction between collection objects, whose state contains multiple objects

or literals, versus atomic (and structured) objects, whose state is an individual object or literal.

The ODMG object model uses exceptions for reporting errors or particular conditions.

ATOMIC(USER-DEFINED)INTERFACE:

In the object model, any user-defined object that is not a collection object is called

an atomic object.

Extents, Keys, and Factory Objects

1.EXTENDS

In the ODMG object model, the database designer can declare an extent for any object type

that is defined via a class declaration

2. KEY

A key consists of one or more properties (attributes or relationships) whose values are

constrained to be unique for each object in the extent.

3. FACTORY OBJECT:

Factory object—an object that can be used to generate or create individual objects via its

operations.

Object-Oriented Data Model(ODL):

The ODL is designed to support the semantic constructs of the ODMG object model and is

independent of any particular programming language.

Its main use is to create object specifications—that is, classes and interfaces. Hence, ODL is

not a full programming language. A user can specify a database schema in ODL independently of

any programming language, and then use the specific language bindings to specify how ODL

constructs can be mapped to constructs in specific programming languages, such as C++, Smalltalk,

and Java.

The object-oriented paradigm is based on encapsulating code and data related to an object

into a single unit.

The object-oriented data model is a logical model (like the E-R model). .

Object Query Language (OQL): The object query language OQL is the query language proposed for the ODMG object

model.

It is designed to work closely with the programming languages for which an ODMG binding

is defined, such as C++, Smalltalk, and Java.

The OQL syntax for queries is similar to the syntax of the relational standard query language

SQL with additional features for ODMG concepts, such as object identity, complex objects,

operations, inheritance, polymorphism, and relationships.

1. Simple OQL Queries, Database Entry Points, and Iterator Variables

The basic OQL syntax is a select ... from ... where ... structure, as it is for SQL.

For example, the query to retrieve the names of all departments in the college of ‘Engineering’ can

be written as follows:

Q0: select D .Dname

from D in DEPARTMENTS

where D.College = ‘Engineering’;

There are three syntactic options for specifying iterator variables:

D in DEPARTMENTS

DEPARTMENTS D

DEPARTMENTS AS D

Query Results and Path Expressions

In general, the result of a query can be of any type that can be expressed in the ODMG

object model.

A query does not have to follow the select ... from ... where ...structure. For example, the

query

Q1: DEPARTMENTS;

Returns a reference to the collection of all persistent DEPARTMENT objects.

Q1A: CS_DEPARTMENT;

Returns a reference to that individual object of type DEPARTMENT.

Q2: CS_DEPARTMENT.Chair;

Q2A: CS_DEPARTMENT.Chair.Rank;

Q2B: CS_DEPARTMENT.Has_faculty;

The first expression Q2 returns an object of type FACULTY, because that is the type of the

attribute Chair of the DEPARTMENT class

The second expression Q2A is similar, except that it returns the Rank of this FACULTY object (the

Computer Science chair) rather than the object reference

Path expressions Q2 and Q2A return single values, because the

attributes Chair (of DEPARTMENT) and Rank (of FACULTY) are both single-valued and they are

applied to a single object. The third expression, Q2B, is different; it returns an object of

type set<FACULTY> even when applied to a single object, because that is the type of the

relationship Has faculty of the DEPARTMENT class.

Other Features of OQL Specifying Views as Named Queries. The view mechanism in OQL uses the concept of

a named query. The define keyword is used to specify an identifier of the named query, which

must be a unique name among all named objects, class names, method names, and function names

in the schema. If the identifier has the same name as an existing named query, then the new

definition replaces the previous def-inition. Once defined, a query definition is persistent until it is

redefined or deleted. A view can also have parameters (arguments) in its definition.

For example, the following view V1 defines a named query Has_minors to retrieve the set of

objects for students minoring in a given department:

V1: define Has_minors(Dept_name) as

select S

from S in STUDENTS

where S.Minors_in.Dname = Dept_name;

5.3 XML DATABASES: XML: Extensible Markup Language

Defined by the WWW Consortium (W3C)

Derived from SGML (Standard Generalized Markup Language), but simpler to use than

SGML

XML HIERARCHICAL MODEL:

Documents have tags giving extra information about sections of the document

E.g. <title> XML </title> <slide> Introduction …</slide>

Extensible, unlike HTML

Users can add new tags, and separately specify how the tag should be handled

for display

The ability to specify new tags, and to create nested tag structures make XML a great

way to exchange data, not just documents.

Much of the use of XML has been in data exchange applications, not as a

replacement for HTML

Tags make data (relatively) self-documenting

E.g.

<university>

<department>

<dept_name> Comp. Sci. </dept_name>

<building> Taylor </building>

<budget> 100000 </budget>

</department>

<course>

<course_id> CS-101 </course_id>

<title> Intro. to Computer Science </title>

<dept_name> Comp. Sci </dept_name>

<credits> 4 </credits>

</course>

</university>

Tag: label for a section of data

Element: section of data beginning with <tagname> and ending with matching

</tagname>

Elements must be properly nested

Proper nesting

<course> … <title> …. </title> </course>

Improper nesting

<course> … <title> …. </course> </title>

Formally: every start tag must have a unique matching end tag, that is in the

context of the same parent element.

Every document must have a single top-level element

Example of Nested Elements

<purchase_order>

<identifier> P-101 </identifier>

<purchaser> …. </purchaser>

<itemlist>

<item>

<identifier> RS1 </identifier>

<description> Atom powered rocket sled </description>

<quantity> 2 </quantity>

<price> 199.95 </price>

</item>

<item>

<identifier> SG2 </identifier>

<description> Superb glue </description>

<quantity> 1 </quantity>

<unit-of-measure> liter </unit-of-measure>

<price> 29.95 </price>

</item>

</itemlist>

</purchase_order>

Attributes

Elements can have attributes

<course course_id= “CS-101”>

<title> Intro. to Computer Science</title>

<dept name> Comp. Sci. </dept name>

<credits> 4 </credits>

</course>

Attributes are specified by name=value pairs inside the starting tag of an element

An element may have several attributes, but each attribute name can only occur once

<course course_id = “CS-101” credits=“4”>

XML SCHEMA and DTD:

Database schemas constrain what information can be stored, and the data types of stored

values

XML documents are not required to have an associated schema

However, schemas are very important for XML data exchange

Otherwise, a site cannot automatically interpret data received from another site

Two mechanisms for specifying XML schema

Document Type Definition (DTD)

XML Schema

DTD: The type of an XML document can be specified using a DTD

DTD constraints structure of XML data

What elements can occur

What attributes can/must an element have

What sub elements can/must occur inside each element, and how many times.

DTD does not constrain data types

All values represented as strings in XML

DTD syntax

<!ELEMENT element (subelements-specification) >

<!ATTLIST element (attributes) >

Element Specification in DTD

Sub elements can be specified as

names of elements, or

#PCDATA (parsed character data), i.e., character strings

EMPTY (no subelements) or ANY (anything can be a subelement)

Example

<! ELEMENT department (dept_name building, budget)>

<! ELEMENT dept_name (#PCDATA)>

<! ELEMENT budget (#PCDATA)>

Sub element specification may have regular expressions

<!ELEMENT university ( ( department | course | instructor | teaches )+)>

Notation:

– “|” - alternatives

– “+” - 1 or more occurrences

– “*” - 0 or more occurrences

University DTD

<!DOCTYPE university [

<!ELEMENT university ( (department|course|instructor|teaches)+)>

<!ELEMENT department ( dept name, building, budget)>

<!ELEMENT course ( course id, title, dept name, credits)>

<!ELEMENT instructor (IID, name, dept name, salary)>

<!ELEMENT teaches (IID, course id)>

<!ELEMENT dept name( #PCDATA )>

<!ELEMENT building( #PCDATA )>

<!ELEMENT budget( #PCDATA )>

<!ELEMENT course id ( #PCDATA )>

<!ELEMENT title ( #PCDATA )>

<!ELEMENT credits( #PCDATA )>

<!ELEMENT IID( #PCDATA )>

<!ELEMENT name( #PCDATA )>

<!ELEMENT salary( #PCDATA )>

]>

Attribute Specification in DTD

Attribute specification : for each attribute

Name

Type of attribute

CDATA

ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)

n more on this later

Whether

mandatory (#REQUIRED)

has a default value (value),

or neither (#IMPLIED)

Examples

<!ATTLIST course course_id CDATA #REQUIRED>, or

<!ATTLIST course

course_id ID #REQUIRED

dept_name IDREF #REQUIRED

instructors IDREFS #IMPLIED >

IDs and IDREFs

An element can have at most one attribute of type ID

The ID attribute value of each element in an XML document must be distinct

Thus the ID attribute value is an object identifier

An attribute of type IDREF must contain the ID value of an element in the same

document

An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value

must contain the ID value of an element in the same document

University DTD with Attributes

University DTD with ID and IDREF attribute types.

<!DOCTYPE university-3 [

<!ELEMENT university ( (department|course|instructor)+)>

<!ELEMENT department ( building, budget )>

<!ATTLIST department

dept_name ID #REQUIRED >

<!ELEMENT course (title, credits )>

<!ATTLIST course

course_id ID #REQUIRED

dept_name IDREF #REQUIRED

instructors IDREFS #IMPLIED >

<!ELEMENT instructor ( name, salary )>

<!ATTLIST instructor

IID ID #REQUIRED

dept_name IDREF #REQUIRED >

· · · declarations for title, credits, building,

budget, name and salary · · ·

]>

Limitations of DTDs

No typing of text elements and attributes

All values are strings, no integers, reals, etc.

Difficult to specify unordered sets of subelements

Order is usually irrelevant in databases (unlike in the document-layout

environment from which XML evolved)

(A | B)* allows specification of an unordered set, but

Cannot ensure that each of A and B occurs only once

IDs and IDREFs are untyped

The instructors attribute of an course may contain a reference to another course,

which is meaningless

instructors attribute should ideally be constrained to refer to instructor

elements

XML SCHEMA: XML Schema is a more sophisticated schema language which addresses the drawbacks

of DTDs. Supports

Typing of values

E.g. integer, string, etc

Also, constraints on min/max values

User-defined, comlex types

Many more features, including

uniqueness and foreign key constraints, inheritance

XML Schema is itself specified in XML syntax, unlike DTDs

More-standard representation, but verbose

XML Scheme is integrated with namespaces

BUT: XML Schema is significantly more complicated than DTDs.

XML Schema Version of Univ. DTD

<xs:schema xmlns:xs=“http://www.w3.org/2001/XMLSchema”>

<xs:element name=“university” type=“universityType” />

<xs:element name=“department”>

<xs:complexType>

<xs:sequence>

<xs:element name=“dept name” type=“xs:string”/>

<xs:element name=“building” type=“xs:string”/>

<xs:element name=“budget” type=“xs:decimal”/>

</xs:sequence>

</xs:complexType>

</xs:element>

….

<xs:element name=“instructor”>

<xs:complexType>

<xs:sequence>

<xs:element name=“IID” type=“xs:string”/>

<xs:element name=“name” type=“xs:string”/>

<xs:element name=“dept name” type=“xs:string”/>

<xs:element name=“salary” type=“xs:decimal”/>

</xs:sequence>

</xs:complexType>

</xs:element>

….

<xs:complexType name=“UniversityType”>

<xs:sequence>

<xs:element ref=“department” minOccurs=“0” maxOccurs=“unbounded”/>

<xs:element ref=“course” minOccurs=“0” maxOccurs=“unbounded”/>

<xs:element ref=“instructor” minOccurs=“0” maxOccurs=“unbounded”/>

<xs:element ref=“teaches” minOccurs=“0” maxOccurs=“unbounded”/>

</xs:sequence>

</xs:complexType>

</xs:schema>

Choice of “xs:” was ours -- any other namespace prefix could be chosen

Element “university” has type “universityType”, which is defined separately

xs:complexType is used later to create the named complex type

“UniversityType”

XQUERY: Translation of information from one XML schema to another

Querying on XML data

Above two are closely related, and handled by the same tools

Standard XML querying/translation languages

XPath: Simple language consisting of path expressions

XSLT: Simple language designed for translation from XML to XML and XML

to HTML

XQuery: An XML query language with a rich set of features

XPath

XPath is used to address (select) parts of documents using path expressions

A path expression is a sequence of steps separated by “/”

Think of file names in a directory hierarchy

Result of path expression: set of values that along with their containing

elements/attributes match the specified path

E.g. /university-3/instructor/name evaluated on the university-3 data we saw earlier

returns

<name>Srinivasan</name>

<name>Brandt</name>

E.g. /university-3/instructor/name/text( ) returns the same names, but without the

enclosing tags

The initial “/” denotes root of the document (above the top-level tag)

Path expressions are evaluated left to right

Each step operates on the set of instances produced by the previous step

Selection predicates may follow any step in a path, in [ ]

E.g. /university-3/course[credits >= 4]

returns account elements with a balance value greater than 400

/university-3/course[credits] returns account elements containing a

credits subelement

Attributes are accessed using “@”

E.g. /university-3/course[credits >= 4]/@course_id

returns the course identifiers of courses with credits >= 4

IDREF attributes are not dereferenced automatically (more on this later)

Functions in XPath

XPath provides several functions

The function count() at the end of a path counts the number of elements in the

set generated by the path

E.g. /university-2/instructor[count(./teaches/course)> 2]

– Returns instructors teaching more than 2 courses (on university-2

schema)

Also function for testing position (1, 2, ..) of node w.r.t. siblings

Boolean connectives and and or and function not() can be used in predicates

IDREFs can be referenced using function id()

id() can also be applied to sets of references such as IDREFS and even to strings

containing multiple references separated by blanks

E.g. /university-3/course/id(@dept_name)

returns all department elements referred to from the dept_name attribute

of course elements.

XQuery

XQuery is a general purpose query language for XML data

Currently being standardized by the World Wide Web Consortium (W3C)

The textbook description is based on a January 2005 draft of the standard. The final version

may differ, but major features likely to stay unchanged.

XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and

XML-QL

XQuery uses a

for … let … where … order by …result …

syntax

for SQL from

where SQL where

order by SQL order by

result SQL select

lets allow temporary variables, and has no equivalent in SQL

FLWOR Syntax in XQuery

For clause uses XPath expressions, and variable in for clause ranges over values in the

set returned by XPath

Simple FLWOR expression in XQuery

find all courses with credits > 3, with each result enclosed in an <course_id> ..

</course_id> tag

for $x in /university-3/course

let $courseId := $x/@course_id

where $x/credits > 3

return <course_id> { $courseId } </course id>

Items in the return clause are XML text unless enclosed in {}, in which case they

are evaluated

Let clause not really needed in this query, and selection can be done In XPath. Query

can be written as:

for $x in /university-3/course[credits > 3]

return <course_id> { $x/@course_id } </course_id>

Alternative notation for constructing elements:

return element course_id { element $x/@course_id }

XSLT

A stylesheet stores formatting options for a document, usually separately from document

E.g. an HTML style sheet may specify font colors and sizes for headings, etc.

The XML Stylesheet Language (XSL) was originally designed for generating HTML

from XML

XSLT is a general-purpose transformation language

Can translate XML to XML, and XML to HTML

XSLT transformations are expressed using rules called templates

Templates combine selection using XPath with construction of results

Storage of XML Data

XML data can be stored in

Non-relational data stores

Flat files

– Natural for storing XML

– But has all problems discussed in Chapter 1 (no concurrency, no

recovery, …)

XML database

– Database built specifically for storing XML data, supporting

DOM model and declarative querying

– Currently no commercial-grade systems

Relational databases

Data must be translated into relational form

Advantage: mature database systems

Disadvantages: overhead of translating data and queries

5.4 INFORMATION RETRIEVAL:

Information retrieval is the process of retrieving documents from a collection in response to

a query (or a search request) by a user.

IR CONCEPTS: IR systems use a user’s information need expressed as a free-form search request.

An IR system can be characterized at different levels: by types of users, types of data,

and the types of the information need, along with the size and scale of the information repository it

addresses.

Types of Users. The user may be an expert user , who is searching for specific information

that is clear in his/her mind and forms relevant queries for the task, or a layperson user with a

generic information need.

Types of Data. Search systems can be tailored to specific types of data. For example, the

problem of retrieving information about a specific topic may be handled more efficiently by

customized search systems that are built to collect and retrieve only information related to that

specific topic.

Types of Information Need. In the context of Web search, users’ information needs may be

defined as navigational, informational, or transactional.

▪ Navigational search refers to finding a particular piece of information that a user

needs quickly.

▪ The purpose of informational search is to find current information about a topic.

▪ The goal of transactional search is to reach a site where further interaction happens.

RETRIEVAL MODELS:

There are the three main statistical models—Boolean, vector space, and probabilistic—and

the semantic model.

Boolean Model In this model, documents are represented as a set of terms.

Queries are formulated as a combination of terms using the standard Boolean logic set-

theoretic operators such as AND, OR and NOT.

Retrieval and relevance are considered as binary concepts in this model, so the retrieved

elements are an “exact match” retrieval of relevant documents. There is no notion of ranking of

resulting documents. All retrieved documents are considered equally important.

Boolean retrieval models lack sophisticated ranking algorithms and are among the earliest

and simplest information retrieval models.

These models make it easy to associate metadata information and write queries that match

the contents of the documents as well as other properties of documents, such as date of creation,

author, and type of document.

Vector Space Model The vector space model provides a framework in which term weighting, ranking of retrieved

documents, and relevance feedback are possible.

Documents are represented as features and weights of term features in an n-dimensional

vector space of terms.

Features are a subset of the terms in a set of documents that are deemed most relevant to an

IR search for this particular set of documents.

The process of selecting these important terms (features) and their properties as a sparse

(limited) list out of the very large number of available terms is independent of the model

specification.

The query is also specified as a terms vector (vector of features), and this is compared to the

document vectors for similarity/relevance assessment.

In the vector model, the document term weight wij (for term i in document j) is represented

based on some variation of the TF (term frequency) or TF-IDF (term frequency-inverse document

frequency) scheme (as we will describe below).

TF-IDF is a statistical weight measure that is used to evaluate the importance of a document

word in a collection of documents. The following formula is typically used:

TF-IDF uses the product of normalized frequency of a term i (TFij) in document Dj and the

inverse document frequency of the term i (IDFi) to weight a term in a document.

IDF values can be easily computed for a fixed collection of documents. In case of Web

search engines, taking a representative sample of documents approximates IDF computation. The

following formulas can be used:

The relevance of the document with respect to a query (rel(Dj,Q)) is directly measured as the sum

of the TF-IDF values of the terms in the Query Q:

Probabilistic Model In the probabilistic model, a more concrete and definitive approach is taken:

ranking documents by their estimated probability of relevance with respect to the query and the

document. This is the basis of the Probability Ranking Principle developed by Robertson.

In the probabilistic model, a more concrete and definitive approach is taken: ranking

documents by their estimated probability of relevance with respect to the query and the document.

This is the basis of the Probability Ranking Principle developed by Robertson.

In the probabilistic framework, the IR system has to decide whether the documents belong

to the relevant set or the nonrelevant set for a query.

To make this decision, it is assumed that a predefined relevant set and non relevant set exist

for the query, and the task is to calculate the probability that the document belongs to the relevant

set and compare that with the probability that the document belongs to the non relevant set.

Given the document representation D of a document, estimating the relevance R and

nonrelevance NR of that document involves computation of conditional probability P(R|D) and

P(NR|D).

These conditional probabilities can be calculated using Bayes’ Rule:

P(R|D) = P(D|R) × P(R)/P(D)

P(NR|D) = P(D|NR) × P(NR)/P(D)

A document D is classified as relevant if P(R|D) > P(NR|D). Discarding the constant P(D),

this is equivalent to saying that a document is relevant if:

P(D|R) × P(R) > P(D|NR) × P(NR)

The likelihood ratio P(D|R)/P(D|NR) is used as a score to determine the likelihood of the

document with representation D belonging to the relevant set.

The term independence or Naïve Bayes assumption is used to estimate P(D|R) using

computation of P(ti|R) for term ti. The likelihood ratios P(D|R)/P(D|NR) of documents are used as a

proxy for ranking based on the assumption that highly ranked documents will have a high

likelihood of belonging to the relevant set.

Semantic Model

Semantic approaches include different levels of analysis, such as morphological, syntactic,

and semantic analysis, to retrieve documents more effectively.

In morphological analysis, roots and affixes are analyzed to determine the parts of speech

of the words.

Following morphological analysis, syntactic analysis follows to parse and analyze complete

phrases in documents.

Finally, the semantic methods have to resolve word ambiguities and/or generate relevant

synonyms based on the semantic relationships between levels of structural entities in documents.

The development of a sophisticated semantic system requires complex knowledge bases of

semantic information as well as retrieval heuristics.

These systems often require techniques from artificial intelligence and expert systems.

Knowledge bases like Cyc and WordNet have been developed for use in knowledge-based IR

systems based on semantic models.

QUERIES IN IR SYSTEMS:

Keyword Queries

Keyword-based queries are the simplest and most commonly used forms of IR queries: the

user just enters keyword combinations to retrieve documents.

The query keyword terms are implicitly connected by a logical AND operator. A query

such as ‘database concepts’ retrieves documents that contain both the words ‘database’

and ‘concepts’ at the top of the retrieved results. In addition, most systems also

retrieve documents that contain only ‘database’ or only ‘concepts’ in their text. Some

systems remove most commonly occurring words (such as a, the, of, and so on,

called stopwords) as a preprocessing step before sending the filtered query keywords

to the IR engine. Most IR systems do not pay attention to the ordering of

these words in the query. All retrieval models provide support for keyword queries.

27.3.2 Boolean Queries

Some IR systems allow using the AND, OR, NOT, ( ), + , and – Boolean operators in

combinations of keyword formulations. AND requires that both terms be found.

OR lets either term be found. NOT means any record containing the second term

will be excluded. ‘( )’ means the Boolean operators can be nested using parentheses.

‘+’ is equivalent to AND, requiring the term; the ‘+’ should be placed directly in front

of the search term. ‘–’ is equivalent to AND NOT and means to exclude the term; the ‘–’ should be placed directly in front of the search term not wanted. Complex

Boolean queries can be built out of these operators and their combinations, and

they are evaluated according to the classical rules of Boolean algebra. No ranking is

possible, because a document either satisfies such a query (is “relevant”) or does not

satisfy it (is “nonrelevant”). A document is retrieved for a Boolean query if the

query is logically true as an exact match in the document.Users generally do not use

combinations of these complex Boolean operators, and IR systems support a

restricted version of these set operators. Boolean retrieval models can directly support

different Boolean operator implementations for these kinds of queries.

27.3.3 Phrase Queries When documents are represented using an inverted keyword index for searching,

the relative order of the terms in the document is lost. In order to perform exact

phrase retrieval, these phrases should be encoded in the inverted index or implemented

differently (with relative positions of word occurrences in documents). A

phrase query consists of a sequence of words that makes up a phrase. The phrase is

generally enclosed within double quotes. Each retrieved document must contain at

least one instance of the exact phrase. Phrase searching is a more restricted and specific

version of proximity searching that we mention below. For example, a phrase

searching query could be ‘conceptual database design’. If phrases are indexed by the

retrieval model, any retrieval model can be used for these query types. A phrase thesaurus

may also be used in semantic models for fast dictionary searching for

phrases.

27.3.4 Proximity Queries

Proximity search refers to a search that accounts for how close within a record multiple

terms should be to each other. The most commonly used proximity search

option is a phrase search that requires terms to be in the exact order. Other proximity

operators can specify how close terms should be to each other. Some will also

specify the order of the search terms. Each search engine can define proximity operators

differently, and the search engines use various operator names such as NEAR,

ADJ(adjacent), or AFTER. In some cases, a sequence of single words is given,

together with a maximum allowed distance between them.Vector space models that

also maintain information about positions and offsets of tokens (words) have

robust implementations for this query type. However, providing support for complex

proximity operators becomes computationally expensive because it requires

the time-consuming preprocessing of documents, and is thus suitable for smaller

document collections rather than for the Web.

27.3.5 Wildcard Queries

Wildcard searching is generally meant to support regular expressions and pattern

matching-based searching in text. In IR systems, certain kinds of wildcard search

support may be implemented—usually words with any trailing characters (for

example, ‘data*’ would retrieve data, database, datapoint, dataset, and so on).

Providing support for wildcard searches in IR systems involves preprocessing overhead

and is not considered worth the cost by many Web search engines today.

Retrieval models do not directly provide support for this query type.

27.3.6 Natural Language Queries

There are a few natural language search engines that aim to understand the structure

and meaning of queries written in natural language text, generally as a question

or narrative. This is an active area of research that employs techniques like shallow

semantic parsing of text, or query reformulations based on natural language understanding.

The system tries to formulate answers for such queries from retrieved

results. Some search systems are starting to provide natural language interfaces to

provide answers to specific types of questions, such as definition and factoid questions,

which ask for definitions of technical terms or common facts that can be

retrieved from specialized databases. Such questions are usually easier to answer

because there are strong linguistic patterns giving clues to specific types of sentences—

for example, ‘defined as’ or ‘refers to’. Semantic models can provide support

for this query type.