atanu roy chandrima sarkar rafal a. angryk using taxonomies to perform aggregated querying over...
TRANSCRIPT
Atanu RoyChandrima Sarkar
Rafal A. Angryk
Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Presented by: Rafal A. AngrykDate: 2010-12-14
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Outlines of the Presentation
2
IdeaImprecisionMotivationLimitations of Previous WorkDefinitionsApproachExperimental Setup & ResultsConclusion and Future Work
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Idea of the Project
3
This paper provides framework for answering queries over imprecise data found in the common databases.
We propose to solve this by classifying the data into taxonomical hierarchies and then capturing it in weighted hierarchical hypergraph.
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Imprecision in Databases: An Example
4
ID Germination Time
Stem Cankers
R1 August Above-sec-node
R2 September Absent
R3 Fall Above-Sec-node
R4 July Absent
ID Germination Time
Stem Cankers
R1 August Above-sec-node
R2 September Absent
R3 Fall Above-Sec-node
R4 July Absent
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
5
ID
Germination Time
Stem Cankers
R1
August Above-sec-node
R2
September Absent
R3
Fall Above-Sec-node
R4
July AbsentGermination Time
Summer
June July
Fall
August September
ID
Germination Time
Stem Cankers
R1
August Above-sec-node
R2
September Absent
R3
August Above-Sec-node
R4
July Absent
ID
Germination Time
Stem Cankers
R1
August Above-sec-node
R2
September Absent
R3
September Above-Sec-node
R4
July Absent
Constraint: All soybean seeds with the same kind of stem canker should germinate in the same month of the season.
ID
Germination Time
Stem Cankers
R1
August Above-sec-node
R2
September Absent
R3
September Above-Sec-node
R4
July Absent
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
MotivationSeveral recent papers have focused on
retrieval of imprecise data, where every fact can be a region, instead of a point, in a multi-dimensional space.
The most prominent one is [BDRV07]
They have solved it by constructing marginal databases (MDBs) from extended database (EDBs) with the help of constraint hypergraph.
6
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Limitations of Previous Work
7
Creating Marginal Databases using weighted hierarchical Hypergraph, employs brute force method for retrieving connected facts (tuples).
This increases the overall time complexity and processing time of the queries.
[BDRV07] follows a data specific technique but we propose to follow a domain specific knowledge
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Definitions
8
Background knowledge: Knowledge required to generate taxonomies.Expert knowledge: Domain-specific human expertise.Data-derived knowledge: Derived from historic
precise database and is used to generate mutually exclusive probabilities
Possible worlds: All the possible combinations that an imprecise record can assume.
Valid world: All the possible worlds which satisfies a given set of constraints.
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
9
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Assignment of Probabilities
10
EDB Creation
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
11
Probability of a possible world is the product of the unconditional occurrences of all imprecise attributes.
Sum of probabilities of all possible worlds of an imprecise record is 1.
Probability assignment rule creates a set of tuples using
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Hyperedge Creation
12
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
MDB Creation
13
Weighted hierarchical hypergraph is defined as H(L, E) where L represents the nodes and E is the set of hyperedges between different taxonomies.
Each hyperedge signifies a distinct combination of attribute values. The weight of a possible world assigned to a hyperedge [AC10] needs to preserve the a few properties.
All t-norms [AC10] (e.g. minimum, product) fulfill these requirements. We choose product for the purposes of our preliminary investigation.
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
EDB MDB
14
ID TempID
Germination Time
Stem Canker ProbEDB
R1 T1 August Above-sec-node 0.80
R1 T2 September Above-sec-node 0.20
R2 T3 August Above-sec-node 0.60
R2 T4 August Absent 0.40
R3 T5 August Aove-sec-node 0.48
R3 T6 August Absent 0.32
R3 T7 September Above-sec-node 0.12
R3 T8 September Absent 0.08
R4 T9 September Absent 1.00
ID TempID
Germination Time
Stem Canker ProbEDB
R1 T1 August Above-sec-node 0.80
R1 T2 September Above-sec-node 0.20
R2 T3 August Above-sec-node 0.60
R2 T4 August Absent 0.40
R3 T5 August Aove-sec-node 0.48
R3 T6 August Absent 0.32
R3 T7 September Above-sec-node 0.12
R3 T8 September Absent 0.08
R4 T9 September Absent 1.00
ID TempID
Germination Time
Stem Canker ProbMD
B
R1 T1 August Above-sec-node 0.9057
R1 T2 September Above-sec-node 0.0943
R2 T3 August Above-sec-node 0.6429
R2 T4 August Absent 0.3571
R3 T5 August Aove-sec-node 0.4983
R3 T6 August Absent 0.2768
R3 T7 September Above-sec-node 0.0519
R3 T8 September Absent 0.1730
R4 T9 September Absent 1.0000
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Aggregated Querying
15
We aggregate tuples for aggregated querying based on its uniqueness.
Group two tuples only when all their attributes values and the corresponding probabilities are the same.
Find the total no. of plants grown in august which have a Stem Canker above-sec-node
(44*0.9057) + (25*0.6429) ≈ 56
GID Germination Time
Stem Canker
Marginal probability
No. of plants
G1 August Above-sec-node 0.9057 44G1 August Above-sec-node 0.0943 44
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Experimental Setup
16
Census-Income dataset from UCI Machine Learning repository.
Finally used 7 dimensions.Precise database has 191239 records.Test dataset has 99762 records.Randomly inserted imprecision into the
test dataset to make it imprecise.
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Distribution of Imprecision
17
Attributes
age mic edu cbf cbm cbs wwy
Level 1 (Root)
150 (1) 107 (1) 136 (1) 100 (1) 100 (1) 100 (1) 375 (1)
Level 2 450 (3) 1393 (16)
409 (3) 144 (2) 144 (2) 144 (2) 1125 (3)
Level 3 900 (6) 8500 (24)
955 (7) 289 (4) 289 (4) 289 (4) 8500 (9)
Level 4 8500 (12)
8500 (17)
867 (12)
867 (12)
867 (12)
Level 5 8500 (42)
8500 (42)
8500 (42)
Total 10000 (22)
10000 (41)
10000 (28)
10000 (61)
10000 (61)
10000 (61)
10000 (13)
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Imprecision Characteristics
18
1 2-10 11-50 51-100 101-250
251-500
>5000
5000
10000
15000
20000
25000
30000
35000
40000
45000
3470138485
16661
2979 2306 9783652
No. of Tuples
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Scalability Test
19
10k 20k 30k 40k 50k 60k 70k 80k 90k 99.7k0
5000
10000
15000
20000
25000
30000
35000
40000
R² = 0.995651793028629
Runtime(In sec.) Linear Trendline
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Extended Database Analysis
20
100
1000
10000
100000
1000000
10000000Tuples Linear TrendlineOutliers Linear (Outliers)
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Influence of Imprecision
21
100000
150000
200000
250000
300000
350000
400000
450000
500000
R² = 0.997607922001129
No. of MDB tuples
Degree 4 Polynomial Trendline
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Absolute Percentage Error
22
150 300 600 900 1200 15000
1
2
3
4
5
6
7
8
9Absolute Percentage Error
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Conclusion and Future Work
23
In this research we significantly present a framework for efficient querying over imprecise data with an average of ≈ 94% accuracy
We intend to extend this research to include Ontology in place of Taxonomy.
We also intend to use Associative Weight Mining to assign weights to hyperedges.
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data
Questions?
24
References[BDRV07]: Douglas Burdick, AnHai Doan,
Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP over Imprecise Data with Domain Constraints. VLDB 2007: 39-50
[AC10]: Rafal A. Angryk, Jacek Czerniak: Heuristic Algorithm for Interpretation of Multi-Valued Attributes in Similarity-based Fuzzy Relational Databases. International Journal of Approximate Reasoning 51: 895-911 (2010)