inexact querying of xml. xml data may be irregular relational data is regular and organized. xml may...
TRANSCRIPT
Inexact Querying of XMLInexact Querying of XML
XML Data May be IrregularXML Data May be Irregular
• Relational data is regular and organized. XML may be
very different.
– Data is incomplete: Missing values of attributes in elements
– Data has structural variations: Relationships between
elements are represented differently in different parts of the
document
– Data has ontology variations: Different labels are used to
describe nodes of the same type
• (Note: In some of the upcoming slides, we have labels
on edges instead of on nodes.)
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26 27 2829
T.V. Series
Film
ActorActor
TitleName Name
Name
Title
Title Title
31 3234 35
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
36
Year
1984
24
Year
21
Actor
Name
30
Mark Hamill
Léon
Movie
13
Title
33Magnolia
The movie has a year attribute
Incomplete DataIncomplete Data
The year of the movie is missing
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26 27 2829
T.V. Series
Film
ActorActor
TitleName Name
Name
Title
Title Title
31 3234 35
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
36
Year
1984
24
Year
Actor
Name
30
Mark Hamill
Léon
Movie
13
Title
33Magnolia
Variations in StructureVariations in Structure
11
Movie below Actor
29
14
2121
Actor below Movie
1
11 12 13
Movie Database
Movie
Movie
Actor
22 23 25 26 27 2829
T.V. Series
Film
ActorActor
TitleName Name
Name
Title
Title Title
31 3233 34
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
35
Year
1984
24
Year
21
Actor
Name
30
Mark Hamill
Léon
Movie
13
Title
34Magnolia
A movie label A film label
Ontology VariationsOntology Variations
The description of the
schema is large
(e.g., a DTD of XML)
The description of the
schema is large
(e.g., a DTD of XML)
It is difficult to use the schema when formulating queries
It is difficult to use the schema when formulating queries
Data is contributedby many users in a variety of designs
Data is contributedby many users in a variety of designs
The query should deal with differentstructures of data
The query should deal with differentstructures of data
The structure of the
database is changed
frequently
The structure of the
database is changed
frequently
Queries should be rewritten frequentlyQueries should be rewritten frequently
Need to allow the user to write an “approximate query” and have the query processor deal with it
Need to allow the user to write an “approximate query” and have the query processor deal with it
The ProblemThe Problem
• In many different domains, we are given the option
to query some source of information
• Usually, the user only gets results if the query can
be completely answered (satisfied)
• In many domains, this is not appropriate, e.g.,
– The user is not familiar with the database
– The database does not contain complete information
– There is a mismatch between the ontology of the user
and that of the database
Example 1Example 1
ישוב: באר שבע 03איזור חיוג :
היישוב הנבחר אינו מופיע באיזור החיוג הנבחר!
עלייה: חיפה – טכניוןירידה: אילת
אין קו ישיר המחבר בין הנקודות הנבחרות
עלייה: ירידה: אילת
פרטי המקצוע: בסיסי נתונים
לא נמצאו מקצועות מתאימים
What Do Users Need?What Do Users Need?
• Users need a way to get interesting partial answers
to their queries, especially if a complete answer does
not exist
• These partial answers should contain maximal
information
• Problem:
– It is easy to define when an answer satisfies a query
– Hard to say when an answer that does not satisfy a query is
of interest
– Hard to say which incomplete answers are better than others
Modeling a Database and a Modeling a Database and a QueryQuery
• It is useful to model both databases and
queries as labeled directed graphs
– Clean mathematical modeling!
– Captures the essentials of XPath, XQuery
University DatabaseUniversity Database
Technion
University
NameDept Dept
Name Faculty Name Faculty
Professor
Name Teaches Teaches
Lecturer
Name Teaches
ComputerScience
ChanaIsraeli
Databases Bioinformatics AviLevy
Biology
MolecularBiology
QueryQuery University
Dept
Faculty
Name
• Exact answers are
defined by exact
matchings, i.e.,
subgraph
homorphisms
• This query asks for the
names of all faculty
members (of any type)How would you write
this in XPath?
Exact AnswersExact Answers
Technion
University
NameDept Dept
Name Faculty Name Faculty
Professor
Name Teaches Teaches
Lecturer
NameTeaches
ComputerScience
ChanaIsraeli
Databases Bioinformatics AviLevy
Biology
MolecularBiology
University
Dept
Faculty
Name
Exact AnswersExact Answers
Technion
University
NameDept Dept
Name Faculty Name Faculty
Professor
Name Teaches Teaches
Lecturer
NameTeaches
ComputerScience
ChanaIsraeli
Databases Bioinformatics AviLevy
Biology
MolecularBiology
University
Dept
Faculty
Name
Slightly More Complex QuerySlightly More Complex Query
University
Dept
Faculty
Name
• Returns faculty
members only from the
Biology Department
Biology
Exact Answers Are Not Always Exact Answers Are Not Always UsefulUseful
• Problems with exact answers:
– labels are not always known
– content may be unknown, misspelled, etc.
– structure may be unknown, or may vary from one
representation to another
– we may actually want to perform a search, since the
query is a vague hypothesis
– do not allow users to get partial/vague answers
where none better exist
Manually Adding InexactnessManually Adding Inexactness
• One can use language constructs in order to
get more flexible queries
• Example: Suppose we want to find courses,
with teachers that teach them but we don’t
know which hierarchy exists in the database:
– for each teacher, there is a list of courses or
– for each course, there is a list of teachers
– or both…
Technion
University
NameDept Dept
Name Faculty Name Faculty
Teacher
Name Course Course
Teacher
NameCourse
ComputerScience
ChanaIsraeli
Databases Bioinformatics AviLevy
Biology
MolecularBiology
Teacher
Course
Query Needed:
Technion
University
NameDept Dept
Name Faculty Name Faculty
Course
Name Teacher Teacher
Course
Name
ComputerScience
Bioinformatics ChanaIsraeli
Avi Levy
Biology
MolecularBiology
Course
Teacher
Query Needed:
Manually Adding Inexactness Manually Adding Inexactness (cont.)(cont.)
• If we don’t know the hierarchy, we need
Teacher
Course
Course
Teacher
Union
Manually Adding Inexactness Manually Adding Inexactness (cont.)(cont.)
• If we don’t know the hierarchy, we need:
• If we don’t know what exactly the labels are, we
might need:
Teacher
Course
Course
Teacher
Union
Teacher or Lecturer or Professor
Course or Seminar or Lab
UnionTeacher or Lecturer or
Professor
Course or Seminar or Lab
Help!Help!
IntuitionIntuition
• Users write regular queries, stating what
they are looking for
• The query processor uses a built-in strategy
to find answers that exactly satisfy the query
or inexactly satisfy the query
• Burden is on the query processor, not on the
user
Inexact AnswersInexact Answers
• Many different definitions have been given
– For each definition, query processing algorithms have been
defined
• Examples:
– Allow some of the nodes of the query to be unmatched
– Allow edges in the query to be matched to paths in the
database
– Allow nodes to be matched to nodes with labels that have a
similar meaning
• Be careful so that answers are meaningful!
Name
Area Code
City
Allow Unmatched Nodes: Bezeq Allow Unmatched Nodes: Bezeq QueryQuery
Phone Number
שמולביץ
באר שבע
03
Eilat
Matching Edges to Paths: Matching Edges to Paths: Egged QueryEgged Query
Source
Destination
Technion-Haifa
Similar Meaning LabelsSimilar Meaning Labels
Course
Name Details
בסיסי נתוניםבסיסי נתונים
Other Types of InexactnessOther Types of Inexactness
• Many other definitions have been given, e.g.,
– allow permutations of nodes in the query
– allow child nodes to be promoted
– interconnection
• Summary: Inexactness basically means that
we relax some of the query requirements!