georgios fakas department of computing and mathematics, manchester metropolitan university...
TRANSCRIPT
GEORGIOS FAKAS
Department of Computing and Mathematics, Manchester Metropolitan University
Manchester, UK. [email protected]
Automated Generation of Object Summaries from Relational Databases: A Novel Keyword
Searching Paradigm
Related Work:Keyword Search in Relational DBs
Full-text Search (e.g. Oracle 9i Text)Kw Searching in Relational DB (DISCOVER, BANKS)
Kw Search:Leverling, Peacock
Result: e3-o2-c2 e4-06-c2
Territories Region
Employees
OrdersCustomers
EmployeeTerritories
Order Details
Shippers
Products
Categories
Suppliers
e1
e2
e3
e4
et1
et2
et3
o1
o2
o3
o4
o5
o6
o7
od1
od2
od3
p1
ca1
su1
s1
s2
s3
c1
c2
c3
p2
od4
od5
od6
et4
t1
t2
t3
t4
r1
r2
Related Work:Web Search Engines: Keyword Search
Kw Search: Peacock
Result: A ranked set of web pages
Related Work:Web Search Engines: Keyword Search
Kw Search: Peacock
Result: A ranked set of web pages
A Novel Keyword Searching Paradigm: Object Summaries (OSs)
Kw Search: Peacock Result: A Ranked set of OSs
Result 1: 4 tuples out of 27
Result 2: 4 tuples out of 24
Result 3: 4 tuples out of 9
CustomerID
Quick
CompanyName
QUICK-Stop
ContactName
Margaret Peacock
Address
Taucherstae 10
...
...
Customer
Orders
OrderID
10418
ShipName
QUICK-Stop
FirstName
Margaret
ShipAddress
Taucherstae 10
...
...
LastName
Peacock
...
...
CompanyName
Speedy Express
...
...
... ...
Employees Shippers
Employee
EmployeeID
4
LastName
Peacock
FirstName
Margaret
Title
Sales Representative
TitleOfCourtesy
Mrs.
Address
4110 Old Redmond Rd.
...
...
EmployeeTerritories
TerritoryDescription
Rockville
ShipName
QUICK-Stop
Region
Eastern
OrderID
10418
ShipAddress
Taucherstae 10
...
...
... ...
OrderDate
1996-07-15
Region Orders
Employee
EmployeeID
3
LastName
Peacock
FirstName
Janet
Title
Sales Representative
TitleOfCourtesy
Ms.
Address
722 Moss Bay Blvd.
...
...
EmployeeTerritories
TerritoryDescription
Atlanta
ShipName
QUICK-Stop
Region
Southern
OrderID
10273
ShipAddress
Taucherstae 10
...
...
... ...
OrderDate
1996-08-05
Region Orders
A Novel Keyword Searching Paradigm: Object Summaries (OSs)
Employees
Employees (Reports To)
(e3)
(e2)
Territories, Region
(et1, t4, r2)
Orders
(o2)
Customers
(c2)
Shippers
Order Details
(s3)
(od1)
(p2)
Products
Kw Search: Peacock Result: A Ranked set of OSs
Result 1: 4 tuples out of 27
Result 2: 4 tuples out of 24
Result 3: 4 tuples out of 9
CustomerID
Quick
CompanyName
QUICK-Stop
ContactName
Margaret Peacock
Address
Taucherstae 10
...
...
Customer
Orders
OrderID
10418
ShipName
QUICK-Stop
FirstName
Margaret
ShipAddress
Taucherstae 10
...
...
LastName
Peacock
...
...
CompanyName
Speedy Express
...
...
... ...
Employees Shippers
Employee
EmployeeID
4
LastName
Peacock
FirstName
Margaret
Title
Sales Representative
TitleOfCourtesy
Mrs.
Address
4110 Old Redmond Rd.
...
...
EmployeeTerritories
TerritoryDescription
Rockville
ShipName
QUICK-Stop
Region
Eastern
OrderID
10418
ShipAddress
Taucherstae 10
...
...
... ...
OrderDate
1996-07-15
Region Orders
Employee
EmployeeID
3
LastName
Peacock
FirstName
Janet
Title
Sales Representative
TitleOfCourtesy
Ms.
Address
722 Moss Bay Blvd.
...
...
EmployeeTerritories
TerritoryDescription
Atlanta
ShipName
QUICK-Stop
Region
Southern
OrderID
10273
ShipAddress
Taucherstae 10
...
...
... ...
OrderDate
1996-08-05
Region Orders
Problems-Challenges:How can we automatically
(1) Generate and (2) Rank OSsliberating users from knowledge of:
(1) Schema and(2) Query Language
??
Territories Region
Employees
OrdersCustomers
EmployeeTerritories
Order Details
Shippers
Products
Categories
Suppliers
e1
e2
e3
e4
et1
et2
et3
o1
o2
o3
o4
o5
o6
o7
od1
od2
od3
p1
ca1
su1
s1
s2
s3
c1
c2
c3
p2
od4
od5
od6
et4
t1
t2
t3
t4
r1
r2
KW-ID = “Janet Leverling”
CustomerCustomerDemoOrders
Employees
Customers
Order Details Products
Categories
Suppliers
EmployeeTerritories
CustomerDemographics
Shippers
RegionTerritories
OS Generation - Methodology
tDS a central tuple containing the Kw; tuples around tDS contain additional information about the Data Subject.
RDS the corresponding central Relation; similarly Relations around contain additional information.
OS Generation - Methodology
Territories Region
Employees
OrdersCustomers
EmployeeTerritories
Order Details
Shippers
Products
Categories
Suppliers
e1
e2
e3
e4
et1
et2
et3
o1
o2
o3
o4
o5
o6
o7
od1
od2
od3
p1
ca1
su1
s1
s2
s3
c1
c2
c3
p2
od4
od5
od6
et4
t1
t2
t3
t4
r1
r2
KW-ID = “Janet Leverling”
CustomerCustomerDemoOrders
Employees
Customers
Order Details Products
Categories
Suppliers
EmployeeTerritories
CustomerDemographics
Shippers
RegionTerritories
tDS a central tuple containing the Kw; tuples around tDS contain additional information about the Data Subject.
RDS the corresponding central Relation; similarly Relations around contain additional information.
OS Generation - Methodology
Territories Region
Employees
OrdersCustomers
EmployeeTerritories
Order Details
Shippers
Products
Categories
Suppliers
e1
e2
e3
e4
et1
et2
et3
o1
o2
o3
o4
o5
o6
o7
od1
od2
od3
p1
ca1
su1
s1
s2
s3
c1
c2
c3
p2
od4
od5
od6
et4
t1
t2
t3
t4
r1
r2
KW-ID = “Janet Leverling”
CustomerCustomerDemoOrders
Employees
Customers
Order Details Products
Categories
Suppliers
EmployeeTerritories
CustomerDemographics
Shippers
RegionTerritories
CustomerCustomerDemoOrders
Employees
Customers
Order Details Products
Categories
Suppliers
EmployeeTerritories
CustomerDemographics
Shippers
RegionTerritories
tDS a central tuple containing the Kw; tuples around tDS contain additional information about the Data Subject.
RDS the corresponding central Relation; similarly Relations around contain additional information.
OS Generation - Methodology
Territories Region
Employees
OrdersCustomers
EmployeeTerritories
Order Details
Shippers
Products
Categories
Suppliers
e1
e2
e3
e4
et1
et2
et3
o1
o2
o3
o4
o5
o6
o7
od1
od2
od3
p1
ca1
su1
s1
s2
s3
c1
c2
c3
p2
od4
od5
od6
et4
t1
t2
t3
t4
r1
r2
KW-ID = “Janet Leverling”
CustomerCustomerDemoOrders
Employees
Customers
Order Details Products
Categories
Suppliers
EmployeeTerritories
CustomerDemographics
Shippers
RegionTerritories
tDS a central tuple containing the Kw; tuples around tDS contain additional information about the Data Subject.
RDS the corresponding central Relation; similarly Relations around contain additional information.
OS Generation - Methodology
Territories Region
Employees
OrdersCustomers
EmployeeTerritories
Order Details
Shippers
Products
Categories
Suppliers
e1
e2
e3
e4
et1
et2
et3
o1
o2
o3
o4
o5
o6
o7
od1
od2
od3
p1
ca1
su1
s1
s2
s3
c1
c2
c3
p2
od4
od5
od6
et4
t1
t2
t3
t4
r1
r2
KW-ID = “Janet Leverling”
CustomerCustomerDemoOrders
Employees
Customers
Order Details Products
Categories
Suppliers
EmployeeTerritories
CustomerDemographics
Shippers
RegionTerritories
OS for “Janet Leverling”Employees
Employees (Reports To)
(e3)
(e2)
Territories, Region
(et1, t4, r2)
Orders
(o2)
Customers
(c2)
Shippers
Order Details
(s3)
(od1)
(p2)
(ca1)
Products
Categories
CustomerCustomerDemo(0)
Products(49)
Customer(91)
Order Details(2155)
Shippers(3)
Orders(830)
Employees(9)
EmployeeTerritories(49)
Territories(53)
Region(4)
Employees(9)
Employees(2)
Suppliers(29)
Categories(8)
GD
S
OS Generation - Methodology GDS
Problem: Problem: Not all Relations in GDS are relevant:
How do I decide 1) What relations to select or not
2) When to Stop Traversing
Solution: Solution: Investigate Relational Semantics:
Schema Connectivity, Cardinality, Related Cardinality etc.
Quantify Affinity of Relations
CustomerCustomerDemo(0)
Products(49)
Customer(91)
Order Details(2155)
Shippers(3)
Orders(830)
Employees(9)
EmployeeTerritories(49)
Territories(53)
Region(4)
Employees(9)
Employees(2)
Suppliers(29)
Categories(8)
: Affinity of Relations to RDS in GDS
DSi RR
Af
DistancePhysical (fd), Logical (ld),
ld=fd-|M:N|
CustomerCustomerDemo(0)
Products(49)
Customer(91)
Order Details(2155)
Shippers(3)
Orders(830)
Employees(9)
EmployeeTerritories(49)
Territories(53)
Region(4)
fdi=0
fdi=1
fdi=2
fdi=3
Employees(9)
Employees(2)
Suppliers(29)
Categories(8)
fdi=4
: Affinity of Relations to RDS in GDS
DSi RR
Af
DistancePhysical (fd), Logical (ld),
ld=fd-|M:N|E.g. Orders closer than
Customer and CustomerDemo to Employees
CustomerCustomerDemo(0)
Products(49)
Customer(91)
Order Details(2155)
Shippers(3)
Orders(830)
Employees(9)
EmployeeTerritories(49)
Territories(53)
Region(4)
fdi=0
fdi=1
fdi=2
fdi=3
Employees(9)
Employees(2)
Suppliers(29)
Categories(8)
fdi=4
: Affinity of Relations to RDS in GDS
DSi RR
Af
DistancePhysical (fd), Logical (ld),
ld=fd-|M:N|E.g. Orders closer than
Customer and CustomerDemo to Employees
Hubs: spurious shortcutsRather irrelevant or lateral
information RC(R1, R2)
Nation(25)
Partsupp(800,000)
Supplier(10,000)
Region(5)
Customer(150,000)
Lineitem(6,001,215)
Part(200,00*)
Orders(1,500,000)
1 (400) 80 (1)
1 (5) 6000 (1) 1 (4) 7.5 (1)
10 (1)1 (400)
Orders(1,500,000)
Lineitem(6,001,215)
4 (1)
Lineitem(6,001,215)
4 (1)
2M:1
hub1:NDS RR...R
: Affinity of Relations to RDS in GDS
DSi RR
Af
ConnectivitySchema Connectivity (Coi)Data-graph Connectivity:
Relative Cardinality (RCi→j), i.e. the average number of tuples of Ri that are connected with each tuple from Rj
for 1:M RCi→j=|Ri|/|Rj|
for M:1 RCi→j=1 Reverse Relative Cardinality
(RRCi→j) is the reverse of RCi→j
i.e. RRCi→j=RCi→j).
CustomerCustomerDemo(0)
Products(49)
Customer(91)
Order Details(2155)
Shippers(3)
Orders(830)
Employees(9)
EmployeeTerritories(49)
Territories(53)
Region(4)
fdi=0
fdi=1
fdi=2
fdi=3
92.2 (1) 5.4 (1)
1 (9.1) 2.5 (1) 1 (276.6)
1 (27.9)
1 (0.9)
1 (13.2)
Employees(9)
Employees(2)
1 (1)
Suppliers(29)
Categories(8)
1 (2.6)1 (9.6)
fdi=4
1 (1)
: Affinity of Relations to RDS in GDS
DSi RR
Af
DAf(Ri)={(m1, w1), (m2, w2),.. (mn, wn)} m1=f1(ldi), m2=f1(log(10*RCi), m3=f1(log(10*RRCi),
m4=f1(log(10*Coi) f1(α)=(11- α)/10 For a hub-child m1=f1(ldi *hi) and m2=f1(RCi)
Formula 1 (Semantic Affinity):Formula 1 (Semantic Affinity): The affinity of Ri to RDS, denoted as , with respect to a
schema and a database conforming to the schema, can be calculated with the following formula:
Where is the affinity of the Ri’s Parent to RDS or is 1 if RParent≡RDS.□
DSi RR
Af
DSParent RR
Af
DSParent
DSi RRj
jjRR
AfwmAf
: Affinity of Relations to RDS in GDS
DSi RR
Af
CustomerCustomerDemoNull (0)
Products0.74 (49)
Customer0.85 (91)
Order Details0.84 (2155)
Shippers0.85 (3)
Orders0.90 (830)
Employees1 (9)
EmployeeTerritories(49)
Territories0.96 (53)
Region0.91 (4)
fdi=0
fdi=1
fdi=2
fdi=3
92.2 (1) 5.4 (1)
1 (9.1) 2.5 (1) 1 (276.6)
1 (27.9)
1 (0.9)
1 (13.2)
Employees0.98 (9)
Employees0.98 (2)
1 (1)
Suppliers0.62 (29)
Categories0.63 (8)
1 (2.6)1 (9.6)
fdi=4
1 (1) GDS
(θ)
Employees
Employees (Reports To)
(e3)
(e2)
Territories, Region
(et1, t4, r2)
Orders
(o2)
Customers
(c2)
Shippers
Order Details
(s3)
(od1)
(p2)
Products
A Ranked set of Partial OSs - A complete OS
Result 1: 4 tuples out of 27
Result 2: 4 tuples out of 24
Result 3: 4 tuples out of 9
CustomerID
Quick
CompanyName
QUICK-Stop
ContactName
Margaret Peacock
Address
Taucherstae 10
...
...
Customer
Orders
OrderID
10418
ShipName
QUICK-Stop
FirstName
Margaret
ShipAddress
Taucherstae 10
...
...
LastName
Peacock
...
...
CompanyName
Speedy Express
...
...
... ...
Employees Shippers
Employee
EmployeeID
4
LastName
Peacock
FirstName
Margaret
Title
Sales Representative
TitleOfCourtesy
Mrs.
Address
4110 Old Redmond Rd.
...
...
EmployeeTerritories
TerritoryDescription
Rockville
ShipName
QUICK-Stop
Region
Eastern
OrderID
10418
ShipAddress
Taucherstae 10
...
...
... ...
OrderDate
1996-07-15
Region Orders
Employee
EmployeeID
3
LastName
Peacock
FirstName
Janet
Title
Sales Representative
TitleOfCourtesy
Ms.
Address
722 Moss Bay Blvd.
...
...
EmployeeTerritories
TerritoryDescription
Atlanta
ShipName
QUICK-Stop
Region
Southern
OrderID
10273
ShipAddress
Taucherstae 10
...
...
... ...
OrderDate
1996-08-05
Region Orders
OS Ranking
OS Ranking- Problems and Challenges
Existing Keyword Searching ranking semantics the smaller size the higher ranking
In contrast, in the proposed paradigm an OS containing many and well connected tuples should have certainly greater importance than an OS with less tuples.
For instance, a Customer or Employee OS involved in many Orders or an Author authored many important papers and books.
OS Ranking- Importance
Im(OS)=
ti is a tuple of OS
Im(ti) is the Importance of ti (i.e. PageRank)
|OS| is the amount of tuples in OS, AfR(ti) is the affinity of R that ti belongs to
1|)OSlog(|
)t(Af)tIm( iRi
Experimental Evaluation
MS Northwind and TPC-H DBsPrecision, Recall, F-ScoreCompare GDSs and OSs produced by 12 GDS(θ) v
GDS(h)GDS(h) was proposed by 10 participantsGDS: average F-score 86.77, OS aver F-score 83
GDS Precision, Recall and F-score (Averages)<0.5, 0.4, 0.05, 0.05>
0
20
40
60
80
100
Cus
tom
ers
Em
ploy
ees
Sup
plie
rs
Shi
pper
s
Ord
ers
Pro
duct
s
Cus
tom
er
Sup
plie
r
Par
ts
Ord
ers
Nat
ion
Reg
ion
Northwind TPC-H
Precision
Recall
F-Score
OSs Precision, Recall and F-score (Averages) <0.5, 0.4, 0.05, 0.05>
0
20
40
60
80
100
Cu
sto
me
rs
Em
plo
yee
s
Su
pp
liers
Sh
ipp
ers
Ord
ers
Pro
du
cts
Cu
sto
me
r
Su
pp
lier
Pa
rts
Ord
ers
Na
tion
Re
gio
n
Northwind TPC-H
Precision
Recall
F-Score
Conclusions –Future Work
Top-k OS results Top-k size of an OS
Challenge: the weights of new tuples are not monotonic
(since a tuple’s PageRank may increase while its Affinity decrease).
Alternative to PageRank weighting systems are currently investigated; i.e. ObjectRanks
Conclusions -Novel Contributions
The formal definition of the novel Searching Paradigm which automatically produces a ranked set of OSs for a Data Subject.
minimum contribution from the user (i.e. only a Kw) no prior knowledge of the DB schema or query language needed. Excellent Precision, Recall and F-score results
The formal definition and quantification of Relation’s Affinity in the context of GDS
consider both Schema Design and Data distributions
A novel ranking paradigm to calculate Im(OS). The quantification of tuples’ and OSs’ Importance. A Combine Function that considers:
the weight (e.g. PageRank) of tuples, Affinity and size of OS
: Affinity of Relations to RDS in GDS
DSi RR
Af
Employees Customer Order Shipper RDS
Ri
ldi, RCi,
iRC , Coi
m1..m4 AfRi AfRi (rRi) AfRi (rRi) AfRi (rRi)
Employees RDS RDS 1.00 0.88 (3) 0.97 (4) 0.82 (4) Employees (ReportsTo) 1, 1, 0.9, 4 1, 1, 1, 0.7 0.98 0.78 (5) 0.91 (5) 0.73 (5)
Employees (ReportedBy) 1, 0.9, 1, 4 1, 1, 1, 0.7 0.98 0.70(7) 0.85 (7) 0.66 (7) Territories 1, 5.4, 1, 2 1, 0.9, 1, 0.9 0.96 0.55 (10) 0.66 (10) 0.51 (10)
Region 2, 1, 13.2, 1 0.9, 1, 0.88, 1 0.91 0.46 (11) 0.59 (11) 0.43 (11) Order 1, 92.2, 1, 4 1, 0.8, 1, 0.7 0.90 0.94 (1) 1 (RDS) 0.89 (1)
Customer 2, 1, 9.1, 2 0.9, 1, 0.9, 0.9 0.85 1 (RDS) 0.99 (1) 0.83 (2) Shipper 2, 1, 276.6, 1 0.9, 1, 0.75, 1 0.85 0.88 (2) 0.98 (2) 1 (RDS)
OrderDetails 2, 2.5, 1, 2 0.9, 0.96, 1, 0.9 0.84 0.88 (4) 0.97 (3) 0.82 (3) Product 3, 1, 43.9, 4 0.8, 1, 0.83, 0.8 0.74 0.77 (6) 0.91 (6) 0.73 (6) Supplier 4, 1, 1.6, 1 0.7, 1, 0.9, 1 0.63 0.65 (8) 0.82 (8) 0.62 (8)
Categories 4, 1, 6.1, 1 0.7, 1, 0.92, 1 0.62 0.65 (9) 0.81 (9) 0.61 (9) CustDemographics 3, null, null, 1 0.8, null, null, 1 Null Null Null Null
Affinity Ranking Correctness (Average)
Affinity Ranking Correctness (Averages)
0102030405060708090
100C
ust
om
ers
Em
plo
yee
s
Su
pp
liers
Sh
ipp
ers
Ord
ers
Pro
du
cts
Cu
sto
me
r
Su
pp
lier
Pa
rts
Ord
ers
Na
tion
Re
gio
n
Northwind TPC-H
i
rrd hRi
AfRi ),(*100
100