atanu roy chandrima sarkar rafal a. angryk using taxonomies to perform aggregated querying over...

Atanu RoyChandrima Sarkar

Rafal A. Angryk

Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Presented by: Rafal A. AngrykDate: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Outlines of the Presentation

2

IdeaImprecisionMotivationLimitations of Previous WorkDefinitionsApproachExperimental Setup & ResultsConclusion and Future Work


Idea of the Project

3

This paper provides framework for answering queries over imprecise data found in the common databases.

We propose to solve this by classifying the data into taxonomical hierarchies and then capturing it in weighted hierarchical hypergraph.


Imprecision in Databases: An Example

4

ID Germination Time

Stem Cankers

R1 August Above-sec-node

R2 September Absent

R3 Fall Above-Sec-node

R4 July Absent

ID Germination Time

Stem Cankers

R1 August Above-sec-node

R2 September Absent

R3 Fall Above-Sec-node

R4 July Absent


5

ID

Germination Time

Stem Cankers

R1

August Above-sec-node

R2

September Absent

R3

Fall Above-Sec-node

R4

July AbsentGermination Time

Summer

June July

Fall

August September

ID

Germination Time

Stem Cankers

R1


R2

September Absent

R3

August Above-Sec-node

R4

July Absent

ID

Germination Time

Stem Cankers

R1


R2

September Absent

R3

September Above-Sec-node

R4

July Absent

Constraint: All soybean seeds with the same kind of stem canker should germinate in the same month of the season.

ID

Germination Time

Stem Cankers

R1


R2

September Absent

R3

September Above-Sec-node

R4

July Absent


MotivationSeveral recent papers have focused on

retrieval of imprecise data, where every fact can be a region, instead of a point, in a multi-dimensional space.

The most prominent one is [BDRV07]

They have solved it by constructing marginal databases (MDBs) from extended database (EDBs) with the help of constraint hypergraph.

6


Limitations of Previous Work

7

Creating Marginal Databases using weighted hierarchical Hypergraph, employs brute force method for retrieving connected facts (tuples).

This increases the overall time complexity and processing time of the queries.

[BDRV07] follows a data specific technique but we propose to follow a domain specific knowledge


Definitions

8

Background knowledge: Knowledge required to generate taxonomies.Expert knowledge: Domain-specific human expertise.Data-derived knowledge: Derived from historic

precise database and is used to generate mutually exclusive probabilities

Possible worlds: All the possible combinations that an imprecise record can assume.

Valid world: All the possible worlds which satisfies a given set of constraints.


9


Assignment of Probabilities

10

EDB Creation


11

Probability of a possible world is the product of the unconditional occurrences of all imprecise attributes.

Sum of probabilities of all possible worlds of an imprecise record is 1.

Probability assignment rule creates a set of tuples using


Hyperedge Creation

12


MDB Creation

13

Weighted hierarchical hypergraph is defined as H(L, E) where L represents the nodes and E is the set of hyperedges between different taxonomies.

Each hyperedge signifies a distinct combination of attribute values. The weight of a possible world assigned to a hyperedge [AC10] needs to preserve the a few properties.

All t-norms [AC10] (e.g. minimum, product) fulfill these requirements. We choose product for the purposes of our preliminary investigation.


EDB MDB

14

ID TempID

Germination Time

Stem Canker ProbEDB

R1 T1 August Above-sec-node 0.80

R1 T2 September Above-sec-node 0.20


R2 T4 August Absent 0.40

R3 T5 August Aove-sec-node 0.48



R3 T8 September Absent 0.08


ID TempID

Germination Time

Stem Canker ProbEDB










ID TempID

Germination Time

Stem Canker ProbMD

B











Aggregated Querying

15

We aggregate tuples for aggregated querying based on its uniqueness.

Group two tuples only when all their attributes values and the corresponding probabilities are the same.

Find the total no. of plants grown in august which have a Stem Canker above-sec-node

(44*0.9057) + (25*0.6429) ≈ 56

GID Germination Time

Stem Canker

Marginal probability

No. of plants

G1 August Above-sec-node 0.9057 44G1 August Above-sec-node 0.0943 44


Experimental Setup

16

Census-Income dataset from UCI Machine Learning repository.

Finally used 7 dimensions.Precise database has 191239 records.Test dataset has 99762 records.Randomly inserted imprecision into the

test dataset to make it imprecise.


Distribution of Imprecision

17

Attributes

age mic edu cbf cbm cbs wwy

Level 1 (Root)

150 (1) 107 (1) 136 (1) 100 (1) 100 (1) 100 (1) 375 (1)

Level 2 450 (3) 1393 (16)

409 (3) 144 (2) 144 (2) 144 (2) 1125 (3)

Level 3 900 (6) 8500 (24)

955 (7) 289 (4) 289 (4) 289 (4) 8500 (9)

Level 4 8500 (12)

8500 (17)

867 (12)

867 (12)

867 (12)

Level 5 8500 (42)

8500 (42)

8500 (42)

Total 10000 (22)

10000 (41)

10000 (28)

10000 (61)

10000 (61)

10000 (61)

10000 (13)


Imprecision Characteristics

18

1 2-10 11-50 51-100 101-250

251-500

>5000

5000

10000

15000

20000

25000

30000

35000

40000

45000

3470138485

16661

2979 2306 9783652

No. of Tuples


Scalability Test

19

10k 20k 30k 40k 50k 60k 70k 80k 90k 99.7k0

5000

10000

15000

20000

25000

30000

35000

40000

R² = 0.995651793028629

Runtime(In sec.) Linear Trendline


Extended Database Analysis

20

100

1000

10000

100000

1000000

10000000Tuples Linear TrendlineOutliers Linear (Outliers)


Influence of Imprecision

21

100000

150000

200000

250000

300000

350000

400000

450000

500000

R² = 0.997607922001129

No. of MDB tuples

Degree 4 Polynomial Trendline


Absolute Percentage Error

22

150 300 600 900 1200 15000

1

2

3

4

5

6

7

8

9Absolute Percentage Error


Conclusion and Future Work

23

In this research we significantly present a framework for efficient querying over imprecise data with an average of ≈ 94% accuracy

We intend to extend this research to include Ontology in place of Taxonomy.

We also intend to use Associative Weight Mining to assign weights to hyperedges.


Questions?

24

References[BDRV07]: Douglas Burdick, AnHai Doan,

Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP over Imprecise Data with Domain Constraints. VLDB 2007: 39-50

[AC10]: Rafal A. Angryk, Jacek Czerniak: Heuristic Algorithm for Interpretation of Multi-Valued Attributes in Similarity-based Fuzzy Relational Databases. International Journal of Approximate Reasoning 51: 895-911 (2010)

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Doan:AnHai.html



http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/r/Ramakrishnan:Raghu.html



http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/v/Vaithyanathan:Shivakumar.html



http://www.informatik.uni-trier.de/~ley/db/conf/vldb/vldb2007.html









atanu roy chandrima sarkar rafal a. angryk using taxonomies to perform aggregated querying over...

Documents