georgios fakas department of computing and mathematics, manchester metropolitan university...

Post on 13-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

GEORGIOS FAKAS

Department of Computing and Mathematics, Manchester Metropolitan University

Manchester, UK. g.fakas@mmu.ac.uk

Automated Generation of Object Summaries from Relational Databases: A Novel Keyword

Searching Paradigm

Related Work:Keyword Search in Relational DBs

Full-text Search (e.g. Oracle 9i Text)Kw Searching in Relational DB (DISCOVER, BANKS)

Kw Search:Leverling, Peacock

Result: e3-o2-c2 e4-06-c2

Territories Region

Employees

OrdersCustomers

EmployeeTerritories

Order Details

Shippers

Products

Categories

Suppliers

e1

e2

e3

e4

et1

et2

et3

o1

o2

o3

o4

o5

o6

o7

od1

od2

od3

p1

ca1

su1

s1

s2

s3

c1

c2

c3

p2

od4

od5

od6

et4

t1

t2

t3

t4

r1

r2

Related Work:Web Search Engines: Keyword Search

Kw Search: Peacock

Result: A ranked set of web pages

Related Work:Web Search Engines: Keyword Search

Kw Search: Peacock

Result: A ranked set of web pages

A Novel Keyword Searching Paradigm: Object Summaries (OSs)

Kw Search: Peacock Result: A Ranked set of OSs

Result 1: 4 tuples out of 27

Result 2: 4 tuples out of 24

Result 3: 4 tuples out of 9

CustomerID

Quick

CompanyName

QUICK-Stop

ContactName

Margaret Peacock

Address

Taucherstae 10

...

...

Customer

Orders

OrderID

10418

ShipName

QUICK-Stop

FirstName

Margaret

ShipAddress

Taucherstae 10

...

...

LastName

Peacock

...

...

CompanyName

Speedy Express

...

...

... ...

Employees Shippers

Employee

EmployeeID

4

LastName

Peacock

FirstName

Margaret

Title

Sales Representative

TitleOfCourtesy

Mrs.

Address

4110 Old Redmond Rd.

...

...

EmployeeTerritories

TerritoryDescription

Rockville

ShipName

QUICK-Stop

Region

Eastern

OrderID

10418

ShipAddress

Taucherstae 10

...

...

... ...

OrderDate

1996-07-15

Region Orders

Employee

EmployeeID

3

LastName

Peacock

FirstName

Janet

Title

Sales Representative

TitleOfCourtesy

Ms.

Address

722 Moss Bay Blvd.

...

...

EmployeeTerritories

TerritoryDescription

Atlanta

ShipName

QUICK-Stop

Region

Southern

OrderID

10273

ShipAddress

Taucherstae 10

...

...

... ...

OrderDate

1996-08-05

Region Orders

A Novel Keyword Searching Paradigm: Object Summaries (OSs)

Employees

Employees (Reports To)

(e3)

(e2)

Territories, Region

(et1, t4, r2)

Orders

(o2)

Customers

(c2)

Shippers

Order Details

(s3)

(od1)

(p2)

Products

Kw Search: Peacock Result: A Ranked set of OSs

Result 1: 4 tuples out of 27

Result 2: 4 tuples out of 24

Result 3: 4 tuples out of 9

CustomerID

Quick

CompanyName

QUICK-Stop

ContactName

Margaret Peacock

Address

Taucherstae 10

...

...

Customer

Orders

OrderID

10418

ShipName

QUICK-Stop

FirstName

Margaret

ShipAddress

Taucherstae 10

...

...

LastName

Peacock

...

...

CompanyName

Speedy Express

...

...

... ...

Employees Shippers

Employee

EmployeeID

4

LastName

Peacock

FirstName

Margaret

Title

Sales Representative

TitleOfCourtesy

Mrs.

Address

4110 Old Redmond Rd.

...

...

EmployeeTerritories

TerritoryDescription

Rockville

ShipName

QUICK-Stop

Region

Eastern

OrderID

10418

ShipAddress

Taucherstae 10

...

...

... ...

OrderDate

1996-07-15

Region Orders

Employee

EmployeeID

3

LastName

Peacock

FirstName

Janet

Title

Sales Representative

TitleOfCourtesy

Ms.

Address

722 Moss Bay Blvd.

...

...

EmployeeTerritories

TerritoryDescription

Atlanta

ShipName

QUICK-Stop

Region

Southern

OrderID

10273

ShipAddress

Taucherstae 10

...

...

... ...

OrderDate

1996-08-05

Region Orders

Problems-Challenges:How can we automatically

(1) Generate and (2) Rank OSsliberating users from knowledge of:

(1) Schema and(2) Query Language

??

Territories Region

Employees

OrdersCustomers

EmployeeTerritories

Order Details

Shippers

Products

Categories

Suppliers

e1

e2

e3

e4

et1

et2

et3

o1

o2

o3

o4

o5

o6

o7

od1

od2

od3

p1

ca1

su1

s1

s2

s3

c1

c2

c3

p2

od4

od5

od6

et4

t1

t2

t3

t4

r1

r2

KW-ID = “Janet Leverling”

CustomerCustomerDemoOrders

Employees

Customers

Order Details Products

Categories

Suppliers

EmployeeTerritories

CustomerDemographics

Shippers

RegionTerritories

OS Generation - Methodology

tDS a central tuple containing the Kw; tuples around tDS contain additional information about the Data Subject.

RDS the corresponding central Relation; similarly Relations around contain additional information.

OS Generation - Methodology

Territories Region

Employees

OrdersCustomers

EmployeeTerritories

Order Details

Shippers

Products

Categories

Suppliers

e1

e2

e3

e4

et1

et2

et3

o1

o2

o3

o4

o5

o6

o7

od1

od2

od3

p1

ca1

su1

s1

s2

s3

c1

c2

c3

p2

od4

od5

od6

et4

t1

t2

t3

t4

r1

r2

KW-ID = “Janet Leverling”

CustomerCustomerDemoOrders

Employees

Customers

Order Details Products

Categories

Suppliers

EmployeeTerritories

CustomerDemographics

Shippers

RegionTerritories

tDS a central tuple containing the Kw; tuples around tDS contain additional information about the Data Subject.

RDS the corresponding central Relation; similarly Relations around contain additional information.

OS Generation - Methodology

Territories Region

Employees

OrdersCustomers

EmployeeTerritories

Order Details

Shippers

Products

Categories

Suppliers

e1

e2

e3

e4

et1

et2

et3

o1

o2

o3

o4

o5

o6

o7

od1

od2

od3

p1

ca1

su1

s1

s2

s3

c1

c2

c3

p2

od4

od5

od6

et4

t1

t2

t3

t4

r1

r2

KW-ID = “Janet Leverling”

CustomerCustomerDemoOrders

Employees

Customers

Order Details Products

Categories

Suppliers

EmployeeTerritories

CustomerDemographics

Shippers

RegionTerritories

CustomerCustomerDemoOrders

Employees

Customers

Order Details Products

Categories

Suppliers

EmployeeTerritories

CustomerDemographics

Shippers

RegionTerritories

tDS a central tuple containing the Kw; tuples around tDS contain additional information about the Data Subject.

RDS the corresponding central Relation; similarly Relations around contain additional information.

OS Generation - Methodology

Territories Region

Employees

OrdersCustomers

EmployeeTerritories

Order Details

Shippers

Products

Categories

Suppliers

e1

e2

e3

e4

et1

et2

et3

o1

o2

o3

o4

o5

o6

o7

od1

od2

od3

p1

ca1

su1

s1

s2

s3

c1

c2

c3

p2

od4

od5

od6

et4

t1

t2

t3

t4

r1

r2

KW-ID = “Janet Leverling”

CustomerCustomerDemoOrders

Employees

Customers

Order Details Products

Categories

Suppliers

EmployeeTerritories

CustomerDemographics

Shippers

RegionTerritories

tDS a central tuple containing the Kw; tuples around tDS contain additional information about the Data Subject.

RDS the corresponding central Relation; similarly Relations around contain additional information.

OS Generation - Methodology

Territories Region

Employees

OrdersCustomers

EmployeeTerritories

Order Details

Shippers

Products

Categories

Suppliers

e1

e2

e3

e4

et1

et2

et3

o1

o2

o3

o4

o5

o6

o7

od1

od2

od3

p1

ca1

su1

s1

s2

s3

c1

c2

c3

p2

od4

od5

od6

et4

t1

t2

t3

t4

r1

r2

KW-ID = “Janet Leverling”

CustomerCustomerDemoOrders

Employees

Customers

Order Details Products

Categories

Suppliers

EmployeeTerritories

CustomerDemographics

Shippers

RegionTerritories

OS for “Janet Leverling”Employees

Employees (Reports To)

(e3)

(e2)

Territories, Region

(et1, t4, r2)

Orders

(o2)

Customers

(c2)

Shippers

Order Details

(s3)

(od1)

(p2)

(ca1)

Products

Categories

CustomerCustomerDemo(0)

Products(49)

Customer(91)

Order Details(2155)

Shippers(3)

Orders(830)

Employees(9)

EmployeeTerritories(49)

Territories(53)

Region(4)

Employees(9)

Employees(2)

Suppliers(29)

Categories(8)

GD

S

OS Generation - Methodology GDS

Problem: Problem: Not all Relations in GDS are relevant:

How do I decide 1) What relations to select or not

2) When to Stop Traversing

Solution: Solution: Investigate Relational Semantics:

Schema Connectivity, Cardinality, Related Cardinality etc.

Quantify Affinity of Relations

CustomerCustomerDemo(0)

Products(49)

Customer(91)

Order Details(2155)

Shippers(3)

Orders(830)

Employees(9)

EmployeeTerritories(49)

Territories(53)

Region(4)

Employees(9)

Employees(2)

Suppliers(29)

Categories(8)

: Affinity of Relations to RDS in GDS

DSi RR

Af

DistancePhysical (fd), Logical (ld),

ld=fd-|M:N|

CustomerCustomerDemo(0)

Products(49)

Customer(91)

Order Details(2155)

Shippers(3)

Orders(830)

Employees(9)

EmployeeTerritories(49)

Territories(53)

Region(4)

fdi=0

fdi=1

fdi=2

fdi=3

Employees(9)

Employees(2)

Suppliers(29)

Categories(8)

fdi=4

: Affinity of Relations to RDS in GDS

DSi RR

Af

DistancePhysical (fd), Logical (ld),

ld=fd-|M:N|E.g. Orders closer than

Customer and CustomerDemo to Employees

CustomerCustomerDemo(0)

Products(49)

Customer(91)

Order Details(2155)

Shippers(3)

Orders(830)

Employees(9)

EmployeeTerritories(49)

Territories(53)

Region(4)

fdi=0

fdi=1

fdi=2

fdi=3

Employees(9)

Employees(2)

Suppliers(29)

Categories(8)

fdi=4

: Affinity of Relations to RDS in GDS

DSi RR

Af

DistancePhysical (fd), Logical (ld),

ld=fd-|M:N|E.g. Orders closer than

Customer and CustomerDemo to Employees

Hubs: spurious shortcutsRather irrelevant or lateral

information RC(R1, R2)

Nation(25)

Partsupp(800,000)

Supplier(10,000)

Region(5)

Customer(150,000)

Lineitem(6,001,215)

Part(200,00*)

Orders(1,500,000)

1 (400) 80 (1)

1 (5) 6000 (1) 1 (4) 7.5 (1)

10 (1)1 (400)

Orders(1,500,000)

Lineitem(6,001,215)

4 (1)

Lineitem(6,001,215)

4 (1)

2M:1

hub1:NDS RR...R

: Affinity of Relations to RDS in GDS

DSi RR

Af

ConnectivitySchema Connectivity (Coi)Data-graph Connectivity:

Relative Cardinality (RCi→j), i.e. the average number of tuples of Ri that are connected with each tuple from Rj

for 1:M RCi→j=|Ri|/|Rj|

for M:1 RCi→j=1 Reverse Relative Cardinality

(RRCi→j) is the reverse of RCi→j

i.e. RRCi→j=RCi→j).

CustomerCustomerDemo(0)

Products(49)

Customer(91)

Order Details(2155)

Shippers(3)

Orders(830)

Employees(9)

EmployeeTerritories(49)

Territories(53)

Region(4)

fdi=0

fdi=1

fdi=2

fdi=3

92.2 (1) 5.4 (1)

1 (9.1) 2.5 (1) 1 (276.6)

1 (27.9)

1 (0.9)

1 (13.2)

Employees(9)

Employees(2)

1 (1)

Suppliers(29)

Categories(8)

1 (2.6)1 (9.6)

fdi=4

1 (1)

: Affinity of Relations to RDS in GDS

DSi RR

Af

DAf(Ri)={(m1, w1), (m2, w2),.. (mn, wn)} m1=f1(ldi), m2=f1(log(10*RCi), m3=f1(log(10*RRCi),

m4=f1(log(10*Coi) f1(α)=(11- α)/10 For a hub-child m1=f1(ldi *hi) and m2=f1(RCi)

Formula 1 (Semantic Affinity):Formula 1 (Semantic Affinity): The affinity of Ri to RDS, denoted as , with respect to a

schema and a database conforming to the schema, can be calculated with the following formula:

Where is the affinity of the Ri’s Parent to RDS or is 1 if RParent≡RDS.□

DSi RR

Af

DSParent RR

Af

DSParent

DSi RRj

jjRR

AfwmAf

: Affinity of Relations to RDS in GDS

DSi RR

Af

CustomerCustomerDemoNull (0)

Products0.74 (49)

Customer0.85 (91)

Order Details0.84 (2155)

Shippers0.85 (3)

Orders0.90 (830)

Employees1 (9)

EmployeeTerritories(49)

Territories0.96 (53)

Region0.91 (4)

fdi=0

fdi=1

fdi=2

fdi=3

92.2 (1) 5.4 (1)

1 (9.1) 2.5 (1) 1 (276.6)

1 (27.9)

1 (0.9)

1 (13.2)

Employees0.98 (9)

Employees0.98 (2)

1 (1)

Suppliers0.62 (29)

Categories0.63 (8)

1 (2.6)1 (9.6)

fdi=4

1 (1) GDS

(θ)

Employees

Employees (Reports To)

(e3)

(e2)

Territories, Region

(et1, t4, r2)

Orders

(o2)

Customers

(c2)

Shippers

Order Details

(s3)

(od1)

(p2)

Products

A Ranked set of Partial OSs - A complete OS

Result 1: 4 tuples out of 27

Result 2: 4 tuples out of 24

Result 3: 4 tuples out of 9

CustomerID

Quick

CompanyName

QUICK-Stop

ContactName

Margaret Peacock

Address

Taucherstae 10

...

...

Customer

Orders

OrderID

10418

ShipName

QUICK-Stop

FirstName

Margaret

ShipAddress

Taucherstae 10

...

...

LastName

Peacock

...

...

CompanyName

Speedy Express

...

...

... ...

Employees Shippers

Employee

EmployeeID

4

LastName

Peacock

FirstName

Margaret

Title

Sales Representative

TitleOfCourtesy

Mrs.

Address

4110 Old Redmond Rd.

...

...

EmployeeTerritories

TerritoryDescription

Rockville

ShipName

QUICK-Stop

Region

Eastern

OrderID

10418

ShipAddress

Taucherstae 10

...

...

... ...

OrderDate

1996-07-15

Region Orders

Employee

EmployeeID

3

LastName

Peacock

FirstName

Janet

Title

Sales Representative

TitleOfCourtesy

Ms.

Address

722 Moss Bay Blvd.

...

...

EmployeeTerritories

TerritoryDescription

Atlanta

ShipName

QUICK-Stop

Region

Southern

OrderID

10273

ShipAddress

Taucherstae 10

...

...

... ...

OrderDate

1996-08-05

Region Orders

OS Ranking

OS Ranking- Problems and Challenges

Existing Keyword Searching ranking semantics the smaller size the higher ranking

In contrast, in the proposed paradigm an OS containing many and well connected tuples should have certainly greater importance than an OS with less tuples.

For instance, a Customer or Employee OS involved in many Orders or an Author authored many important papers and books.

OS Ranking- Importance

Im(OS)=

ti is a tuple of OS

Im(ti) is the Importance of ti (i.e. PageRank)

|OS| is the amount of tuples in OS, AfR(ti) is the affinity of R that ti belongs to

1|)OSlog(|

)t(Af)tIm( iRi

Experimental Evaluation

MS Northwind and TPC-H DBsPrecision, Recall, F-ScoreCompare GDSs and OSs produced by 12 GDS(θ) v

GDS(h)GDS(h) was proposed by 10 participantsGDS: average F-score 86.77, OS aver F-score 83

GDS Precision, Recall and F-score (Averages)<0.5, 0.4, 0.05, 0.05>

0

20

40

60

80

100

Cus

tom

ers

Em

ploy

ees

Sup

plie

rs

Shi

pper

s

Ord

ers

Pro

duct

s

Cus

tom

er

Sup

plie

r

Par

ts

Ord

ers

Nat

ion

Reg

ion

Northwind TPC-H

Precision

Recall

F-Score

OSs Precision, Recall and F-score (Averages) <0.5, 0.4, 0.05, 0.05>

0

20

40

60

80

100

Cu

sto

me

rs

Em

plo

yee

s

Su

pp

liers

Sh

ipp

ers

Ord

ers

Pro

du

cts

Cu

sto

me

r

Su

pp

lier

Pa

rts

Ord

ers

Na

tion

Re

gio

n

Northwind TPC-H

Precision

Recall

F-Score

Conclusions –Future Work

Top-k OS results Top-k size of an OS

Challenge: the weights of new tuples are not monotonic

(since a tuple’s PageRank may increase while its Affinity decrease).

Alternative to PageRank weighting systems are currently investigated; i.e. ObjectRanks

Conclusions -Novel Contributions

The formal definition of the novel Searching Paradigm which automatically produces a ranked set of OSs for a Data Subject.

minimum contribution from the user (i.e. only a Kw) no prior knowledge of the DB schema or query language needed. Excellent Precision, Recall and F-score results

The formal definition and quantification of Relation’s Affinity in the context of GDS

consider both Schema Design and Data distributions

A novel ranking paradigm to calculate Im(OS). The quantification of tuples’ and OSs’ Importance. A Combine Function that considers:

the weight (e.g. PageRank) of tuples, Affinity and size of OS

: Affinity of Relations to RDS in GDS

DSi RR

Af

Employees Customer Order Shipper RDS

Ri

ldi, RCi,

iRC , Coi

m1..m4 AfRi AfRi (rRi) AfRi (rRi) AfRi (rRi)

Employees RDS RDS 1.00 0.88 (3) 0.97 (4) 0.82 (4) Employees (ReportsTo) 1, 1, 0.9, 4 1, 1, 1, 0.7 0.98 0.78 (5) 0.91 (5) 0.73 (5)

Employees (ReportedBy) 1, 0.9, 1, 4 1, 1, 1, 0.7 0.98 0.70(7) 0.85 (7) 0.66 (7) Territories 1, 5.4, 1, 2 1, 0.9, 1, 0.9 0.96 0.55 (10) 0.66 (10) 0.51 (10)

Region 2, 1, 13.2, 1 0.9, 1, 0.88, 1 0.91 0.46 (11) 0.59 (11) 0.43 (11) Order 1, 92.2, 1, 4 1, 0.8, 1, 0.7 0.90 0.94 (1) 1 (RDS) 0.89 (1)

Customer 2, 1, 9.1, 2 0.9, 1, 0.9, 0.9 0.85 1 (RDS) 0.99 (1) 0.83 (2) Shipper 2, 1, 276.6, 1 0.9, 1, 0.75, 1 0.85 0.88 (2) 0.98 (2) 1 (RDS)

OrderDetails 2, 2.5, 1, 2 0.9, 0.96, 1, 0.9 0.84 0.88 (4) 0.97 (3) 0.82 (3) Product 3, 1, 43.9, 4 0.8, 1, 0.83, 0.8 0.74 0.77 (6) 0.91 (6) 0.73 (6) Supplier 4, 1, 1.6, 1 0.7, 1, 0.9, 1 0.63 0.65 (8) 0.82 (8) 0.62 (8)

Categories 4, 1, 6.1, 1 0.7, 1, 0.92, 1 0.62 0.65 (9) 0.81 (9) 0.61 (9) CustDemographics 3, null, null, 1 0.8, null, null, 1 Null Null Null Null

Affinity Ranking Correctness (Average)

Affinity Ranking Correctness (Averages)

0102030405060708090

100C

ust

om

ers

Em

plo

yee

s

Su

pp

liers

Sh

ipp

ers

Ord

ers

Pro

du

cts

Cu

sto

me

r

Su

pp

lier

Pa

rts

Ord

ers

Na

tion

Re

gio

n

Northwind TPC-H

i

rrd hRi

AfRi ),(*100

100

top related