name of the subject : database management system subject code : cs8492 … · 2021. 2. 26. · this...
TRANSCRIPT
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
NAME OF THE SUBJECT : DATABASE MANAGEMENT SYSTEM
SUBJECT CODE : CS8492
REGULATION : 2017
UNIT V - ADVANCED TOPICS
UNIT V
ADVANCED TOPICS
5.1 DISTRIBUTED DATABASES:
In a distributed database system, the database is stored on several computers.
The computers in a distributed system communicate with one another through various
communication media, such as high-speed networks or telephone lines.
They do not share main memory or disk.
Types of Distributed Databases
They are classified as:
(a) Homogeneous Distributed Database
In this, all sites have identical database management system software, are aware of one
another, and agree to cooperate in processing user’s requests.
In such a system, local sites surrender a portion of their autonomy in terms of their right to
change schemas or database management system software.
This software must also cooperate with other sites in exchanging information about
transactions, to make transaction processing possible across multiple sites.
It appears to user as a single system.
(b) Heterogeneous Distributed Databases
In this, different sites may use different schemas, and different database
management system software.
The sites may not be aware of one another, and they may provide only limited
facilities for cooperation in transaction processing.
The differences in schemas are often a major problem for query processing, while
the difference in software becomes a major problem for transaction processing.
ARCHITECTURE:
Distributed database system consists of loosely coupled sites that share no physical
component.
The computers in distributed system are referred by names such as sites or nodes.
Database system that run on each site are independent of each other.
Transactions may access data at one or more sites.
DATA STORAGE:
Consider a relation r that is to be stored in the database. There are two approaches to storing
this relation in the distributed database:
1. Replication. The system maintains several identical replicas (copies) of the relation, and
stores each replica at a different site.
2. Fragmentation. The system partitions the relation into several fragments, and stores each
fragment at a different site.
Data Replication If relation r is replicated, a copy of relation r is stored in two or more sites. In full
replication a copy is stored in every site in the system.
There are a number of advantages and disadvantages to replication.
Availability. If one of the sites containing relation r fails, then the relation r can be found in
another site.
Increased parallelism. The several sites can process queries involving r in parallel. The
more replicas of r there are, the greater the chance that the needed data will be found in the
site where the transaction is executing. Hence, data replication minimizes movement of data
between sites.
Increased overhead on update. The system must ensure that all replicas of a relation r are
consistent; otherwise, erroneous computations may result. Thus, whenever r is updated, the
update must be propagated to all sites containing replicas. The result is increased overhead.
Data Fragmentation If relation r is fragmented, r is divided into a number of fragments r1, r2, . . . , rn. These
fragments contain sufficient information to allow reconstruction of the original relation r.
There are two different schemes for fragmenting a relation: horizontal fragmentation and
vertical fragmentation.
Horizontal fragmentation splits the relation by assigning each tuple of r to one or more
fragments.
Vertical fragmentation splits the relation by decomposing the scheme R of relation r.
In horizontal fragmentation, a relation r is partitioned into a number of subsets, r1, r2, . . .
, rn. Each tuple of relation r must belong to at least one of the fragments, so that the original
relation can be reconstructed, if needed.
Example:
The account relation can be divided into several different fragments, each of which consists
of tuples of accounts belonging to a particular branch. If the banking system has only two
branches—Hillside and Valleyview —then there are two different fragments:
account1 = branch name = “Hillside” (account)
account2 = branch name = “Valleyview” (account)
Vertical fragmentation of r(R) involves the definition of several subsets of attributes R1,
R2, . . . , Rn of the schema R so that: R = R1 ∪ R2 ∪ · · · ∪ Rn. Each fragment ri of r is defined
by:ri = Ri (r ) . The fragmentation should be done in such a way that we can reconstruct relation r
from the fragments by taking the natural join: r = r1 r2 r3 · · · rn
Example:
Consider a university database with a relation employee info that stores, for each employee,
employee id, name, designation, and salary. For privacy reasons, this relation may be fragmented
into a relation employee private info containing employee id and salary, and another relation
employee public info containing attributes employee id, name, and designation. These may be stored
at different sites, again, possibly for security reasons.
Transparency:
The user of a distributed database system should not be required to know where the data are
physically located nor how the data can be accessed at the specific local site. This characteristic,
called data transparency, can take several forms:
• Fragmentation transparency. Users are not required to know how a relation has been
fragmented.
• Replication transparency. Users view each data object as logically unique.
• Location transparency. Users are not required to know the physical location of the data.
TRANSACTION PROCESSING:
Access to the various data items in a distributed system is usually accomplished through
transactions, which must preserve the ACID properties.
There are two types of transaction.
Local transactions
Global transactions
The local transactions are those that access and update data in only one local database.
The global transactions are those that access and update data in several local databases.
System Structure:
Each site has its own local transaction manager, whose function is to ensure the ACID
properties of those transactions that execute at that site.
The various transaction managers cooperate to execute global transactions.
An abstract model of a transaction system, in which each site contains two subsystems:
• The transaction manager manages the execution of those transactions that access
data stored in a local site.
• The transaction coordinator coordinates the execution of the various transactions
initiated at that site.
Each transaction manager is responsible for: • Maintaining a log for recovery purposes.
• Maintaining a log for recovery purposes.
• Participating in an appropriate concurrency-control scheme to coordinate the concurrent
execution of the transactions executing at that site.
For each such transaction, the coordinator is responsible for:
• Starting the execution of the transaction.
• Breaking the transaction into a number of subtransactions and distributing these
subtransactions to the appropriate sites for execution.
• Coordinating the termination of the transaction, which may result in the transaction being
committed at all sites or aborted at all sites.
System Failure Modes
A distributed system may suffer from the some types of failure.
The basic failure types are:
• Failure of a site.
• Loss of messages.
• Failure of a communication link.
• Network partition.
5.2 OBJECT-BASED DATABASES:
Loosely speaking, an object corresponds to an entity in the E-R model.
The object-oriented paradigm is based on encapsulating code and data related to an
object into single unit.
The object-oriented data model is a logical data model (like the E-R model).
Adaptation of the object-oriented programming paradigm (e.g., Smalltalk, C++) to
database systems.
OBJECT DATABASE CONCEPTS:
Object Structure An object has associated with it:
A set of variables that contain the data for the object. The value of each variable
is itself an object.
A set of messages to which the object responds; each message may have zero,
one, or more parameters.
A set of methods, each of which is a body of code to implement a message; a
method returns a value as the response to the message
The physical representation of data is visible only to the implementor of the object
Messages and responses provide the only external interface to an object.
The term message does not necessarily imply physical message passing. Messages can
be implemented as procedure invocations.
Messages and Methods Methods are programs written in general-purpose language with the following features
only variables in the object itself may be referenced directly
data in other objects are referenced only by sending messages.
Methods can be read-only or update methods
Read-only methods do not change the value of the object
Strictly speaking, every attribute of an entity must be represented by a variable and two
methods, one to read and the other to update the attribute
e.g., the attribute address is represented by a variable address and two messages
get-address and set-address.
For convenience, many object-oriented data models permit direct access to
variables of other objects.
Object Classes Similar objects are grouped into a class; each such object is called an instance of its class
All objects in a class have the same
Variables, with the same types
message interface
methods
The may differ in the values assigned to variables
Example: Group objects for people into a person class
Classes are analogous to entity sets in the E-R model
Class Definition Example
class employee {
/*Variables */
string name;
string address;
date start-date;
int salary;
/* Messages */
int annual-salary();
string get-name();
string get-address();
int set-address(string new-address);
int employment-length();
};
Fig. Definition of the class employee
Methods to read and set the other variables are also needed with strict encapsulation
Methods are defined separately
string get-address( ) {
return address;
}
int set-address(string new-address) {
address = new-address;
}
While the method employment-length( ) would be defined as:
int employment-length() {
return today( ) – start-date;
}
OBJECT-RELATIONAL FEATURES:
Inheritance
E.g., class of bank customers is similar to class of bank employees, although there are
differences
both share some variables and messages, e.g., name and address.
But there are variables and messages specific to each class e.g., salary for employees and
credit-rating for customers.
Every employee is a person; thus employee is a specialization of person
Similarly, customer is a specialization of person.
Create classes person, employee and customer
variables/messages applicable to all persons associated with class person.
variables/messages specific to employees associated with class employee; similarly for
customer
Place classes into a specialization/IS-A hierarchy
variables/messages belonging to class person are inherited by class employee as well as
customer
Result is a class hierarchy
Fig. 9.1 Class hierarchy for the bank example
Note analogy with ISA Hierarchy in the E-R model
Class Hierarchy Definition
class person{
string name;
string address:
string get-name( );
string get-address( );
int set-address(string new-address);
};
class customer isa person {
int credit-rating;
};
class employee isa person {
date start-date;
int salary;
int annual-salary( )
int employment-length( );
};
class officer isa employee {
int office-number,
int expense-account-number,
};
class teller isa employee {
int hours-per-week;
int station-number;
};
Class secretary isa employee {
int hours-per-week;
string manager;
};
Full variable list for objects in the class officer:
office-number, expense-account-number: defined locally
start-date, salary: inherited from employee
name, address: inherited from person
Methods inherited similar to variables.
Substitutability — any method of a class, say person, can be invoked equally well with any
object belonging to any subclass, such as subclass officer of person.
Class extent: set of all objects in the class. Two options:
1. Class extent of employee includes all officer, teller and secretary objects.
2. Class extent of employee includes only employee objects that are not in a subclass such as
officer, teller, or secretary
This is the usual choice in OO systems
Can access extents of subclasses to find all objects of
subtypes of employee
Multiple Inheritances
With multiple inheritances a class may have more than one superclass.
The class/subclass relationship is represented by a directed acyclic graph (DAG)
Particularly useful when objects can be classified in more than one way, which
are independent of each other
Example of Multiple Inheritance
Fig. 9.2 Class DAG for banking example.
A class inherits variables and methods from all its superclasses
There is potential for ambiguity when a variable/message N with the same name is
inherited from two super classes A and B
No problem if the variable/message is defined in a shared superclass
Otherwise, do one of the following
flag as an error,
rename variables (A.N and B.N)
choose one.
Object Identity An object retains its identity even if some or all of the values of variables or definitions
of methods change over time.
Object identity is a stronger notion of identity than in programming languages or data
models not based on object orientation.
Value – data value; e.g. primary key value used in relational systems.
Name – supplied by user; used for variables in procedures.
Built-in – identity built into data model or programming language.
no user-supplied identifier is required.
Is the form of identity used in object-oriented systems.
Object Identifiers Object identifiers used to uniquely identify objects
Object identifiers are unique:
no two objects have the same identifier
each object has only one object identifier
E.g., the spouse field of a person object may be an identifier of another person
object.
can be stored as a field of an object, to refer to another object.
Can be
system generated (created by database) or
external (such as social-security number)
System generated identifiers:
Are easier to use, but cannot be used across database systems
May be redundant if unique identifier already exists
Object Containment
Each component in a design may contain other components
Can be modeled as containment of objects. Objects containing; other objects are called
composite objects.
Multiple levels of containment create a containment hierarchy
links interpreted as is-part-of, not is-a.
Allows data to be viewed at different granularities by different users.
Fig. 9.3 Containment hierarchy for bicycle design database
ODMG OBJECT MODEL, ODL, OQL: ODMG (Object Data Management Group) is a proposed standard that is known as the
ODMG-93 or ODMG 1.0 standard.
The standard is made up of several parts, including the object model, the object definition
language (ODL), the object query language (OQL), and the bindings to object-oriented
programming languages
Object Model of ODMG: The ODMG object model is the data model upon which the object definition language
(ODL) and object query language (OQL) are based.
It is meant to provide a standard data model for object databases, just as SQL describes a
standard data model for relational databases.
It also provides a standard terminology in a field where the same terms were sometimes
used to describe different concepts.
Objects and Literals
Objects and literals are the basic building blocks of the object model.
The main difference between the two is that an object has both an object identifier and
a state (or current value), whereas a literal has a value (state) but no object identifier.
An object has five aspects: Identifier, name, lifetime, structure, and creation
IDENTIFIER:
The object identifier is a unique system-wide identifier (or Object_id).
Every object must have an object identifier.
NAMES:
ODMS—this name can be used to locate the object and the system should return the object
given that name.
LIFETIME:
The lifetime of an object specifies whether it is a persistent object.
STRUCTURE:
The structure of an object specifies how the object is constructed by using the type
constructors.
CREATION:
Object creation refers to the manner in which an object can be created.
There are three types of literals: atomic, structured, and collection.
ATOMIC LITERALS:
Atomic literals correspond to the values of basic data types and are predefined. The basic
data types of the object model include long, short, and unsigned integer numbers regular and
double precision floating point numbers (float, double), Boolean values (boolean), single
characters (char), character strings (string), and enumeration types (enum), among others.
STRUCTURED LITERALS:
Structured literals correspond roughly to values that are constructed using the tuple
constructor. The built-in structured literals include Date, Interval, Time, and Timestamp
COLLECTION LITERALS:
Collection literals specify a literal value that is a collection of objects or values but the
ollection itself does not have an Object_id.
Inheritance in the Object Model of ODMG
In the ODMG object model, two types of inheritance relationships exist:
Behavior - only inheritance
State plus behavior inheritance.
Behavior inheritance is also known as ISA or interface inheritance, and is specified by the
colon (:) notation.
The other inheritance relationship, called EXTENDS inheritance, is specified by the
keyword extends. It is used to inherit both state and behavior strictly among classes, so both the
super type and the subtype must be classes.
Built-in Interfaces and Classes in the Object Model:
All interfaces, such as Collection, Date, and Time, inherit the basic Object interface. In the
object model, there is a distinction between collection objects, whose state contains multiple objects
or literals, versus atomic (and structured) objects, whose state is an individual object or literal.
The ODMG object model uses exceptions for reporting errors or particular conditions.
ATOMIC(USER-DEFINED)INTERFACE:
In the object model, any user-defined object that is not a collection object is called
an atomic object.
Extents, Keys, and Factory Objects
1.EXTENDS
In the ODMG object model, the database designer can declare an extent for any object type
that is defined via a class declaration
2. KEY
A key consists of one or more properties (attributes or relationships) whose values are
constrained to be unique for each object in the extent.
3. FACTORY OBJECT:
Factory object—an object that can be used to generate or create individual objects via its
operations.
Object-Oriented Data Model(ODL):
The ODL is designed to support the semantic constructs of the ODMG object model and is
independent of any particular programming language.
Its main use is to create object specifications—that is, classes and interfaces. Hence, ODL is
not a full programming language. A user can specify a database schema in ODL independently of
any programming language, and then use the specific language bindings to specify how ODL
constructs can be mapped to constructs in specific programming languages, such as C++, Smalltalk,
and Java.
The object-oriented paradigm is based on encapsulating code and data related to an object
into a single unit.
The object-oriented data model is a logical model (like the E-R model). .
Object Query Language (OQL): The object query language OQL is the query language proposed for the ODMG object
model.
It is designed to work closely with the programming languages for which an ODMG binding
is defined, such as C++, Smalltalk, and Java.
The OQL syntax for queries is similar to the syntax of the relational standard query language
SQL with additional features for ODMG concepts, such as object identity, complex objects,
operations, inheritance, polymorphism, and relationships.
1. Simple OQL Queries, Database Entry Points, and Iterator Variables
The basic OQL syntax is a select ... from ... where ... structure, as it is for SQL.
For example, the query to retrieve the names of all departments in the college of ‘Engineering’ can
be written as follows:
Q0: select D .Dname
from D in DEPARTMENTS
where D.College = ‘Engineering’;
There are three syntactic options for specifying iterator variables:
D in DEPARTMENTS
DEPARTMENTS D
DEPARTMENTS AS D
Query Results and Path Expressions
In general, the result of a query can be of any type that can be expressed in the ODMG
object model.
A query does not have to follow the select ... from ... where ...structure. For example, the
query
Q1: DEPARTMENTS;
Returns a reference to the collection of all persistent DEPARTMENT objects.
Q1A: CS_DEPARTMENT;
Returns a reference to that individual object of type DEPARTMENT.
Q2: CS_DEPARTMENT.Chair;
Q2A: CS_DEPARTMENT.Chair.Rank;
Q2B: CS_DEPARTMENT.Has_faculty;
The first expression Q2 returns an object of type FACULTY, because that is the type of the
attribute Chair of the DEPARTMENT class
The second expression Q2A is similar, except that it returns the Rank of this FACULTY object (the
Computer Science chair) rather than the object reference
Path expressions Q2 and Q2A return single values, because the
attributes Chair (of DEPARTMENT) and Rank (of FACULTY) are both single-valued and they are
applied to a single object. The third expression, Q2B, is different; it returns an object of
type set<FACULTY> even when applied to a single object, because that is the type of the
relationship Has faculty of the DEPARTMENT class.
Other Features of OQL Specifying Views as Named Queries. The view mechanism in OQL uses the concept of
a named query. The define keyword is used to specify an identifier of the named query, which
must be a unique name among all named objects, class names, method names, and function names
in the schema. If the identifier has the same name as an existing named query, then the new
definition replaces the previous def-inition. Once defined, a query definition is persistent until it is
redefined or deleted. A view can also have parameters (arguments) in its definition.
For example, the following view V1 defines a named query Has_minors to retrieve the set of
objects for students minoring in a given department:
V1: define Has_minors(Dept_name) as
select S
from S in STUDENTS
where S.Minors_in.Dname = Dept_name;
5.3 XML DATABASES: XML: Extensible Markup Language
Defined by the WWW Consortium (W3C)
Derived from SGML (Standard Generalized Markup Language), but simpler to use than
SGML
XML HIERARCHICAL MODEL:
Documents have tags giving extra information about sections of the document
E.g. <title> XML </title> <slide> Introduction …</slide>
Extensible, unlike HTML
Users can add new tags, and separately specify how the tag should be handled
for display
The ability to specify new tags, and to create nested tag structures make XML a great
way to exchange data, not just documents.
Much of the use of XML has been in data exchange applications, not as a
replacement for HTML
Tags make data (relatively) self-documenting
E.g.
<university>
<department>
<dept_name> Comp. Sci. </dept_name>
<building> Taylor </building>
<budget> 100000 </budget>
</department>
<course>
<course_id> CS-101 </course_id>
<title> Intro. to Computer Science </title>
<dept_name> Comp. Sci </dept_name>
<credits> 4 </credits>
</course>
</university>
Tag: label for a section of data
Element: section of data beginning with <tagname> and ending with matching
</tagname>
Elements must be properly nested
Proper nesting
<course> … <title> …. </title> </course>
Improper nesting
<course> … <title> …. </course> </title>
Formally: every start tag must have a unique matching end tag, that is in the
context of the same parent element.
Every document must have a single top-level element
Example of Nested Elements
<purchase_order>
<identifier> P-101 </identifier>
<purchaser> …. </purchaser>
<itemlist>
<item>
<identifier> RS1 </identifier>
<description> Atom powered rocket sled </description>
<quantity> 2 </quantity>
<price> 199.95 </price>
</item>
<item>
<identifier> SG2 </identifier>
<description> Superb glue </description>
<quantity> 1 </quantity>
<unit-of-measure> liter </unit-of-measure>
<price> 29.95 </price>
</item>
</itemlist>
</purchase_order>
Attributes
Elements can have attributes
<course course_id= “CS-101”>
<title> Intro. to Computer Science</title>
<dept name> Comp. Sci. </dept name>
<credits> 4 </credits>
</course>
Attributes are specified by name=value pairs inside the starting tag of an element
An element may have several attributes, but each attribute name can only occur once
<course course_id = “CS-101” credits=“4”>
XML SCHEMA and DTD:
Database schemas constrain what information can be stored, and the data types of stored
values
XML documents are not required to have an associated schema
However, schemas are very important for XML data exchange
Otherwise, a site cannot automatically interpret data received from another site
Two mechanisms for specifying XML schema
Document Type Definition (DTD)
XML Schema
DTD: The type of an XML document can be specified using a DTD
DTD constraints structure of XML data
What elements can occur
What attributes can/must an element have
What sub elements can/must occur inside each element, and how many times.
DTD does not constrain data types
All values represented as strings in XML
DTD syntax
<!ELEMENT element (subelements-specification) >
<!ATTLIST element (attributes) >
Element Specification in DTD
Sub elements can be specified as
names of elements, or
#PCDATA (parsed character data), i.e., character strings
EMPTY (no subelements) or ANY (anything can be a subelement)
Example
<! ELEMENT department (dept_name building, budget)>
<! ELEMENT dept_name (#PCDATA)>
<! ELEMENT budget (#PCDATA)>
Sub element specification may have regular expressions
<!ELEMENT university ( ( department | course | instructor | teaches )+)>
Notation:
– “|” - alternatives
– “+” - 1 or more occurrences
– “*” - 0 or more occurrences
University DTD
<!DOCTYPE university [
<!ELEMENT university ( (department|course|instructor|teaches)+)>
<!ELEMENT department ( dept name, building, budget)>
<!ELEMENT course ( course id, title, dept name, credits)>
<!ELEMENT instructor (IID, name, dept name, salary)>
<!ELEMENT teaches (IID, course id)>
<!ELEMENT dept name( #PCDATA )>
<!ELEMENT building( #PCDATA )>
<!ELEMENT budget( #PCDATA )>
<!ELEMENT course id ( #PCDATA )>
<!ELEMENT title ( #PCDATA )>
<!ELEMENT credits( #PCDATA )>
<!ELEMENT IID( #PCDATA )>
<!ELEMENT name( #PCDATA )>
<!ELEMENT salary( #PCDATA )>
]>
Attribute Specification in DTD
Attribute specification : for each attribute
Name
Type of attribute
CDATA
ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)
n more on this later
Whether
mandatory (#REQUIRED)
has a default value (value),
or neither (#IMPLIED)
Examples
<!ATTLIST course course_id CDATA #REQUIRED>, or
<!ATTLIST course
course_id ID #REQUIRED
dept_name IDREF #REQUIRED
instructors IDREFS #IMPLIED >
IDs and IDREFs
An element can have at most one attribute of type ID
The ID attribute value of each element in an XML document must be distinct
Thus the ID attribute value is an object identifier
An attribute of type IDREF must contain the ID value of an element in the same
document
An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value
must contain the ID value of an element in the same document
University DTD with Attributes
University DTD with ID and IDREF attribute types.
<!DOCTYPE university-3 [
<!ELEMENT university ( (department|course|instructor)+)>
<!ELEMENT department ( building, budget )>
<!ATTLIST department
dept_name ID #REQUIRED >
<!ELEMENT course (title, credits )>
<!ATTLIST course
course_id ID #REQUIRED
dept_name IDREF #REQUIRED
instructors IDREFS #IMPLIED >
<!ELEMENT instructor ( name, salary )>
<!ATTLIST instructor
IID ID #REQUIRED
dept_name IDREF #REQUIRED >
· · · declarations for title, credits, building,
budget, name and salary · · ·
]>
Limitations of DTDs
No typing of text elements and attributes
All values are strings, no integers, reals, etc.
Difficult to specify unordered sets of subelements
Order is usually irrelevant in databases (unlike in the document-layout
environment from which XML evolved)
(A | B)* allows specification of an unordered set, but
Cannot ensure that each of A and B occurs only once
IDs and IDREFs are untyped
The instructors attribute of an course may contain a reference to another course,
which is meaningless
instructors attribute should ideally be constrained to refer to instructor
elements
XML SCHEMA: XML Schema is a more sophisticated schema language which addresses the drawbacks
of DTDs. Supports
Typing of values
E.g. integer, string, etc
Also, constraints on min/max values
User-defined, comlex types
Many more features, including
uniqueness and foreign key constraints, inheritance
XML Schema is itself specified in XML syntax, unlike DTDs
More-standard representation, but verbose
XML Scheme is integrated with namespaces
BUT: XML Schema is significantly more complicated than DTDs.
XML Schema Version of Univ. DTD
<xs:schema xmlns:xs=“http://www.w3.org/2001/XMLSchema”>
<xs:element name=“university” type=“universityType” />
<xs:element name=“department”>
<xs:complexType>
<xs:sequence>
<xs:element name=“dept name” type=“xs:string”/>
<xs:element name=“building” type=“xs:string”/>
<xs:element name=“budget” type=“xs:decimal”/>
</xs:sequence>
</xs:complexType>
</xs:element>
….
<xs:element name=“instructor”>
<xs:complexType>
<xs:sequence>
<xs:element name=“IID” type=“xs:string”/>
<xs:element name=“name” type=“xs:string”/>
<xs:element name=“dept name” type=“xs:string”/>
<xs:element name=“salary” type=“xs:decimal”/>
</xs:sequence>
</xs:complexType>
</xs:element>
….
<xs:complexType name=“UniversityType”>
<xs:sequence>
<xs:element ref=“department” minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element ref=“course” minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element ref=“instructor” minOccurs=“0” maxOccurs=“unbounded”/>
<xs:element ref=“teaches” minOccurs=“0” maxOccurs=“unbounded”/>
</xs:sequence>
</xs:complexType>
</xs:schema>
Choice of “xs:” was ours -- any other namespace prefix could be chosen
Element “university” has type “universityType”, which is defined separately
xs:complexType is used later to create the named complex type
“UniversityType”
XQUERY: Translation of information from one XML schema to another
Querying on XML data
Above two are closely related, and handled by the same tools
Standard XML querying/translation languages
XPath: Simple language consisting of path expressions
XSLT: Simple language designed for translation from XML to XML and XML
to HTML
XQuery: An XML query language with a rich set of features
XPath
XPath is used to address (select) parts of documents using path expressions
A path expression is a sequence of steps separated by “/”
Think of file names in a directory hierarchy
Result of path expression: set of values that along with their containing
elements/attributes match the specified path
E.g. /university-3/instructor/name evaluated on the university-3 data we saw earlier
returns
<name>Srinivasan</name>
<name>Brandt</name>
E.g. /university-3/instructor/name/text( ) returns the same names, but without the
enclosing tags
The initial “/” denotes root of the document (above the top-level tag)
Path expressions are evaluated left to right
Each step operates on the set of instances produced by the previous step
Selection predicates may follow any step in a path, in [ ]
E.g. /university-3/course[credits >= 4]
returns account elements with a balance value greater than 400
/university-3/course[credits] returns account elements containing a
credits subelement
Attributes are accessed using “@”
E.g. /university-3/course[credits >= 4]/@course_id
returns the course identifiers of courses with credits >= 4
IDREF attributes are not dereferenced automatically (more on this later)
Functions in XPath
XPath provides several functions
The function count() at the end of a path counts the number of elements in the
set generated by the path
E.g. /university-2/instructor[count(./teaches/course)> 2]
– Returns instructors teaching more than 2 courses (on university-2
schema)
Also function for testing position (1, 2, ..) of node w.r.t. siblings
Boolean connectives and and or and function not() can be used in predicates
IDREFs can be referenced using function id()
id() can also be applied to sets of references such as IDREFS and even to strings
containing multiple references separated by blanks
E.g. /university-3/course/id(@dept_name)
returns all department elements referred to from the dept_name attribute
of course elements.
XQuery
XQuery is a general purpose query language for XML data
Currently being standardized by the World Wide Web Consortium (W3C)
The textbook description is based on a January 2005 draft of the standard. The final version
may differ, but major features likely to stay unchanged.
XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and
XML-QL
XQuery uses a
for … let … where … order by …result …
syntax
for SQL from
where SQL where
order by SQL order by
result SQL select
lets allow temporary variables, and has no equivalent in SQL
FLWOR Syntax in XQuery
For clause uses XPath expressions, and variable in for clause ranges over values in the
set returned by XPath
Simple FLWOR expression in XQuery
find all courses with credits > 3, with each result enclosed in an <course_id> ..
</course_id> tag
for $x in /university-3/course
let $courseId := $x/@course_id
where $x/credits > 3
return <course_id> { $courseId } </course id>
Items in the return clause are XML text unless enclosed in {}, in which case they
are evaluated
Let clause not really needed in this query, and selection can be done In XPath. Query
can be written as:
for $x in /university-3/course[credits > 3]
return <course_id> { $x/@course_id } </course_id>
Alternative notation for constructing elements:
return element course_id { element $x/@course_id }
XSLT
A stylesheet stores formatting options for a document, usually separately from document
E.g. an HTML style sheet may specify font colors and sizes for headings, etc.
The XML Stylesheet Language (XSL) was originally designed for generating HTML
from XML
XSLT is a general-purpose transformation language
Can translate XML to XML, and XML to HTML
XSLT transformations are expressed using rules called templates
Templates combine selection using XPath with construction of results
Storage of XML Data
XML data can be stored in
Non-relational data stores
Flat files
– Natural for storing XML
– But has all problems discussed in Chapter 1 (no concurrency, no
recovery, …)
XML database
– Database built specifically for storing XML data, supporting
DOM model and declarative querying
– Currently no commercial-grade systems
Relational databases
Data must be translated into relational form
Advantage: mature database systems
Disadvantages: overhead of translating data and queries
5.4 INFORMATION RETRIEVAL:
Information retrieval is the process of retrieving documents from a collection in response to
a query (or a search request) by a user.
IR CONCEPTS: IR systems use a user’s information need expressed as a free-form search request.
An IR system can be characterized at different levels: by types of users, types of data,
and the types of the information need, along with the size and scale of the information repository it
addresses.
Types of Users. The user may be an expert user , who is searching for specific information
that is clear in his/her mind and forms relevant queries for the task, or a layperson user with a
generic information need.
Types of Data. Search systems can be tailored to specific types of data. For example, the
problem of retrieving information about a specific topic may be handled more efficiently by
customized search systems that are built to collect and retrieve only information related to that
specific topic.
Types of Information Need. In the context of Web search, users’ information needs may be
defined as navigational, informational, or transactional.
▪ Navigational search refers to finding a particular piece of information that a user
needs quickly.
▪ The purpose of informational search is to find current information about a topic.
▪ The goal of transactional search is to reach a site where further interaction happens.
RETRIEVAL MODELS:
There are the three main statistical models—Boolean, vector space, and probabilistic—and
the semantic model.
Boolean Model In this model, documents are represented as a set of terms.
Queries are formulated as a combination of terms using the standard Boolean logic set-
theoretic operators such as AND, OR and NOT.
Retrieval and relevance are considered as binary concepts in this model, so the retrieved
elements are an “exact match” retrieval of relevant documents. There is no notion of ranking of
resulting documents. All retrieved documents are considered equally important.
Boolean retrieval models lack sophisticated ranking algorithms and are among the earliest
and simplest information retrieval models.
These models make it easy to associate metadata information and write queries that match
the contents of the documents as well as other properties of documents, such as date of creation,
author, and type of document.
Vector Space Model The vector space model provides a framework in which term weighting, ranking of retrieved
documents, and relevance feedback are possible.
Documents are represented as features and weights of term features in an n-dimensional
vector space of terms.
Features are a subset of the terms in a set of documents that are deemed most relevant to an
IR search for this particular set of documents.
The process of selecting these important terms (features) and their properties as a sparse
(limited) list out of the very large number of available terms is independent of the model
specification.
The query is also specified as a terms vector (vector of features), and this is compared to the
document vectors for similarity/relevance assessment.
In the vector model, the document term weight wij (for term i in document j) is represented
based on some variation of the TF (term frequency) or TF-IDF (term frequency-inverse document
frequency) scheme (as we will describe below).
TF-IDF is a statistical weight measure that is used to evaluate the importance of a document
word in a collection of documents. The following formula is typically used:
TF-IDF uses the product of normalized frequency of a term i (TFij) in document Dj and the
inverse document frequency of the term i (IDFi) to weight a term in a document.
IDF values can be easily computed for a fixed collection of documents. In case of Web
search engines, taking a representative sample of documents approximates IDF computation. The
following formulas can be used:
The relevance of the document with respect to a query (rel(Dj,Q)) is directly measured as the sum
of the TF-IDF values of the terms in the Query Q:
Probabilistic Model In the probabilistic model, a more concrete and definitive approach is taken:
ranking documents by their estimated probability of relevance with respect to the query and the
document. This is the basis of the Probability Ranking Principle developed by Robertson.
In the probabilistic model, a more concrete and definitive approach is taken: ranking
documents by their estimated probability of relevance with respect to the query and the document.
This is the basis of the Probability Ranking Principle developed by Robertson.
In the probabilistic framework, the IR system has to decide whether the documents belong
to the relevant set or the nonrelevant set for a query.
To make this decision, it is assumed that a predefined relevant set and non relevant set exist
for the query, and the task is to calculate the probability that the document belongs to the relevant
set and compare that with the probability that the document belongs to the non relevant set.
Given the document representation D of a document, estimating the relevance R and
nonrelevance NR of that document involves computation of conditional probability P(R|D) and
P(NR|D).
These conditional probabilities can be calculated using Bayes’ Rule:
P(R|D) = P(D|R) × P(R)/P(D)
P(NR|D) = P(D|NR) × P(NR)/P(D)
A document D is classified as relevant if P(R|D) > P(NR|D). Discarding the constant P(D),
this is equivalent to saying that a document is relevant if:
P(D|R) × P(R) > P(D|NR) × P(NR)
The likelihood ratio P(D|R)/P(D|NR) is used as a score to determine the likelihood of the
document with representation D belonging to the relevant set.
The term independence or Naïve Bayes assumption is used to estimate P(D|R) using
computation of P(ti|R) for term ti. The likelihood ratios P(D|R)/P(D|NR) of documents are used as a
proxy for ranking based on the assumption that highly ranked documents will have a high
likelihood of belonging to the relevant set.
Semantic Model
Semantic approaches include different levels of analysis, such as morphological, syntactic,
and semantic analysis, to retrieve documents more effectively.
In morphological analysis, roots and affixes are analyzed to determine the parts of speech
of the words.
Following morphological analysis, syntactic analysis follows to parse and analyze complete
phrases in documents.
Finally, the semantic methods have to resolve word ambiguities and/or generate relevant
synonyms based on the semantic relationships between levels of structural entities in documents.
The development of a sophisticated semantic system requires complex knowledge bases of
semantic information as well as retrieval heuristics.
These systems often require techniques from artificial intelligence and expert systems.
Knowledge bases like Cyc and WordNet have been developed for use in knowledge-based IR
systems based on semantic models.
QUERIES IN IR SYSTEMS:
Keyword Queries
Keyword-based queries are the simplest and most commonly used forms of IR queries: the
user just enters keyword combinations to retrieve documents.
The query keyword terms are implicitly connected by a logical AND operator. A query
such as ‘database concepts’ retrieves documents that contain both the words ‘database’
and ‘concepts’ at the top of the retrieved results. In addition, most systems also
retrieve documents that contain only ‘database’ or only ‘concepts’ in their text. Some
systems remove most commonly occurring words (such as a, the, of, and so on,
called stopwords) as a preprocessing step before sending the filtered query keywords
to the IR engine. Most IR systems do not pay attention to the ordering of
these words in the query. All retrieval models provide support for keyword queries.
27.3.2 Boolean Queries
Some IR systems allow using the AND, OR, NOT, ( ), + , and – Boolean operators in
combinations of keyword formulations. AND requires that both terms be found.
OR lets either term be found. NOT means any record containing the second term
will be excluded. ‘( )’ means the Boolean operators can be nested using parentheses.
‘+’ is equivalent to AND, requiring the term; the ‘+’ should be placed directly in front
of the search term. ‘–’ is equivalent to AND NOT and means to exclude the term; the ‘–’ should be placed directly in front of the search term not wanted. Complex
Boolean queries can be built out of these operators and their combinations, and
they are evaluated according to the classical rules of Boolean algebra. No ranking is
possible, because a document either satisfies such a query (is “relevant”) or does not
satisfy it (is “nonrelevant”). A document is retrieved for a Boolean query if the
query is logically true as an exact match in the document.Users generally do not use
combinations of these complex Boolean operators, and IR systems support a
restricted version of these set operators. Boolean retrieval models can directly support
different Boolean operator implementations for these kinds of queries.
27.3.3 Phrase Queries When documents are represented using an inverted keyword index for searching,
the relative order of the terms in the document is lost. In order to perform exact
phrase retrieval, these phrases should be encoded in the inverted index or implemented
differently (with relative positions of word occurrences in documents). A
phrase query consists of a sequence of words that makes up a phrase. The phrase is
generally enclosed within double quotes. Each retrieved document must contain at
least one instance of the exact phrase. Phrase searching is a more restricted and specific
version of proximity searching that we mention below. For example, a phrase
searching query could be ‘conceptual database design’. If phrases are indexed by the
retrieval model, any retrieval model can be used for these query types. A phrase thesaurus
may also be used in semantic models for fast dictionary searching for
phrases.
27.3.4 Proximity Queries
Proximity search refers to a search that accounts for how close within a record multiple
terms should be to each other. The most commonly used proximity search
option is a phrase search that requires terms to be in the exact order. Other proximity
operators can specify how close terms should be to each other. Some will also
specify the order of the search terms. Each search engine can define proximity operators
differently, and the search engines use various operator names such as NEAR,
ADJ(adjacent), or AFTER. In some cases, a sequence of single words is given,
together with a maximum allowed distance between them.Vector space models that
also maintain information about positions and offsets of tokens (words) have
robust implementations for this query type. However, providing support for complex
proximity operators becomes computationally expensive because it requires
the time-consuming preprocessing of documents, and is thus suitable for smaller
document collections rather than for the Web.
27.3.5 Wildcard Queries
Wildcard searching is generally meant to support regular expressions and pattern
matching-based searching in text. In IR systems, certain kinds of wildcard search
support may be implemented—usually words with any trailing characters (for
example, ‘data*’ would retrieve data, database, datapoint, dataset, and so on).
Providing support for wildcard searches in IR systems involves preprocessing overhead
and is not considered worth the cost by many Web search engines today.
Retrieval models do not directly provide support for this query type.
27.3.6 Natural Language Queries
There are a few natural language search engines that aim to understand the structure
and meaning of queries written in natural language text, generally as a question
or narrative. This is an active area of research that employs techniques like shallow
semantic parsing of text, or query reformulations based on natural language understanding.
The system tries to formulate answers for such queries from retrieved
results. Some search systems are starting to provide natural language interfaces to
provide answers to specific types of questions, such as definition and factoid questions,
which ask for definitions of technical terms or common facts that can be
retrieved from specialized databases. Such questions are usually easier to answer
because there are strong linguistic patterns giving clues to specific types of sentences—
for example, ‘defined as’ or ‘refers to’. Semantic models can provide support
for this query type.