seedeep: a system for exploring and querying deep web data sources
DESCRIPTION
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources. Gagan Agrawal Fan Wang, Tantan Liu Ohio State University. The Deep Web. The definition of “the deep web” from Wikipedia. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/1.jpg)
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Gagan Agrawal
Fan Wang, Tantan Liu
Ohio State University
![Page 2: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/2.jpg)
The Deep Web
The definition of “the deep web” from Wikipedia
The deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.
![Page 3: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/3.jpg)
The Deep Web is Huge
500 times larger than the surface web
7500 terabytes of information (19 terabytes in the surface web)
550 billion documents (1 billion in the surface web)
More than 200,000 deep web sites
Relevant to every domain: scientific, e-commerce, market
![Page 4: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/4.jpg)
The Deep Web is Informative
Deeper content than surface web Surface web: text format Deep web: specific and relational information
More than half of the deep web content in topic-specific databases Biology, Chemistry, Medical, Travel, Business,
Academia, and many more… 95 percent of the deep web is publicly
accessible
![Page 5: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/5.jpg)
Hard to Use the Deep Web Challenges for Integration
Self-maintained and created Heterogeneous and hidden metadata Dynamically updated metadata
Challenges for Searching Standard input format Data redundancy and data source ranking Data source dependency
Challenges for Performance Network latency and caching mechanism Fault tolerance issue
![Page 6: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/6.jpg)
Motivating Example (1)
Biologists have identified the gene X and protein Y are contributors of a disease. They want to examine the SNPs (Single Nucleotide Polymorphisms) located in the genes that share the same functions as either X or Y.
Particularly, for all SNPs located in each such gene functions similar to either X or Y, and those have a heterozygosity value greater than 0.01, biologists want to know the maximal SNP frequency in the Asian population.
![Page 7: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/7.jpg)
Motivating Example (2)
N C B IG E N E GO
Q u ery P lan P art 1 (su b q u ery 1 )
gen e X fu n c tio n gene
nam e
H um anP ro tein
N C B IG E N E GO
Q u ery P lan P art 2 (su b q u ery 2 )gen en am e
p ro te in Y fu n ctio n genenam e
The gene has the same functions as XThe gene has the same functions as YThe frequency information of the SNPs located in these genes and filtered by heterozygosity values
S N P 5 0 0C a nc e r
d bS N P
Q u ery P lan P art 3 (m ain q u ery)
fi l te rin g b y H etero zygo s ity
S N P freq u en cy
![Page 8: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/8.jpg)
Motivating Example (3)
N C B IG E N E GO
Q u ery P lan P art 1 (su b q u ery 1 )
gen e X fu n c tio n gene
nam e
H um anP ro tein
N C B IG E N E GO
Q u ery P lan P art 2 (su b q u ery 2 )gen en am e
p ro te in Y fu n ctio n genenam eS N P 5 0 0C a nc e r
d bS N P
Q u ery P lan P art 3 (m ain q u ery)
fi l te rin g b y H etero zygo s ity
S N P freq u en cy
How do you know NCBI Gene could provide gene function information given the gene name?
Do NCBI Gene and GO data source both use “function” to represent the meaning of “gene function”?
Three data sources, dbSNP, Alfred, and Seattle, could provide SNP frequency data, why do you choose dbSNP?
I cannot filter SNP by heterozygosity values on dbSNPA path clearly guides the search
What if SNP500Cancer data source is unavailable?
![Page 9: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/9.jpg)
Our Contribution: SEEDEEP System
DeepW eb
S chem a M in ing S chem aM atch ing
Source Input Ouput Constraint
S1 A1 B1,B2 C 2
S2 A1 B2,B3 C 1
D1
D 2
D3
D 4
Data S ource M odel Data S ourceDependency M odel
S ys tem M odels
QueryP lanning
P lan B ase
P lan R euseIncrem enta l
P lan Generation
Query
ResultsP lan E xecution
E xploring P art of S E E DE E P Q uerying P art of S E E DE E P
Discover data source metadataDiscover data source inter-dependencyGenerate query plans for search
Query caching mechanismFault Tolerance mechanism
![Page 10: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/10.jpg)
Outline
Introduction and Motivation System Core
Query planning problem Query planning algorithms
Other system components Query caching Fault tolerance Schema mining
Other Issues
![Page 11: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/11.jpg)
What queries does our system support?They want to examine the SNPs located in the genes that share the same functions as either X or Y. Particularly, for all SNPs located in each such gene functions similar to either X or Y, and those have a heterozygosity value greater than 0.01, biologists want to know the maximal SNP frequency in the Asian population.
Selection-Projection-Join (SPJ) queriesAggregation-Groupby queries
Nested queries: Condition and Entity
![Page 12: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/12.jpg)
Data Source Model (1): Single Data Source Each data source is a virtual relational table Virtual relational data elements
MI: must fill-in input attributes OI: optional fill-in input attributes O: output attributes C: inherent data source constraints
![Page 13: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/13.jpg)
Data Source Model (2): Correlated Sources Hyper-graph
dependency model Multi-source dependency
Dependency relations for data sources D1 and D2 Type 1: D1 provides must
fill-in inputs for D2 Type 2: D1 provides
optional fill-in inputs for D2
![Page 14: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/14.jpg)
Planning Algorithm Overview
Tree representation of user query
1. Each node represents a simple query
2. A divide-and-conquer approach
3. A final combination step generates the final query plan
Query Types:
1. Aggregation query
2. Nested entity sub-query
3. Ordinary query
![Page 15: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/15.jpg)
Query Planning Problem for Ordinary Query Ordinary query format
Entity keywords, attribute keywords, comparison predicates
Standard select-project-join SQL query style Formulation
Sub-graph set cover problem, NP-hard
Starting data source
Target data source
![Page 16: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/16.jpg)
Bidirectional Query Planning Algorithm (1) Heuristic algorithm based on the algorithm introduced
by Kacholia et al. Algorithm overview
Starting nodes Target nodes Bidirectional graph traversal
k1k2
k2k3
k1k2
k2k3
k1k2 k3
![Page 17: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/17.jpg)
Bidirectional Query Planning Algorithm (2) How to find minimal sub-graph
Find the shortest paths from starting nodes to target nodes
Dijkstra’s shortest path algorithm Benefit function
Data source coverage Data source data quality, ontology based User constraints matching
![Page 18: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/18.jpg)
Query Planning Problem for Aggregation Query Node connection property
The aggregation data source(s) must be directly or indirectly connected with the grouping data source.
Formulation Sub-graph set cover problem with node
connection property constraint NP-hard
![Page 19: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/19.jpg)
Center-spread Query Planning Algorithm (1) Algorithm initialization
Starting nodes Target nodes Center nodes: aggregation data source nodes
Algorithm overview Graph traversal starts from the center nodes Gradually add center nodes’ neighbors adhering
to node connection property
![Page 20: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/20.jpg)
Center-spread Query Planning Algorithm (2)
A G G
k1
k1
A G G
SS k1
k1
Grouping data source
Grouping data source
![Page 21: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/21.jpg)
Query Planning Problem for Nested Entity Query(1)
“SNP_Frequency, Gene {Function, X}”
Find the genes which have the same functions as X
Find the entities specified by b that have the same value on attribute a as the entities that are specified by e1,…,ek
{Gene, Function, X}
![Page 22: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/22.jpg)
Query Planning Problem for Nested Entity Query(2) Node linking property
The linking data source, which is the data source covering keyword a, must be topologically before the data source covering the entity keyword b
“Gene {Function, Protein X}”b a e
Formulation Sub-graph set cover problem with node linking
property constraint NP-hard
![Page 23: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/23.jpg)
Plan Combination
Ending nodesEnding nodes
Receiving nodes
![Page 24: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/24.jpg)
Plan Merging Query plans for sub-queries can be similar Reduce the network transmission cost of a query
plan Two edges and can be merged
if the used input and output of paired data sources is the same
Mergeable edges weights Optimal Merging
Compatibility graph CG Maximal node weighted clique in CG Modified reactive local search (tabu search) algorithm
![Page 25: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/25.jpg)
Query Execution Optimization: Pipelined Aggregation Performing aggregation in a pipelined manner Reduce transmission cost by early pruning Grouping-first query plans
S N P 500C a nc e r dbSN P
ge nenam e s
SN P ID s
Aggre gat io n o nSN P Fre que nc y
G ro uping o nG e ne N am e
SN PFre que nc y
(a) E xam ple fo r P ipe l ine d Aggre gat io n
A
![Page 26: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/26.jpg)
Query Execution Optimization: Moving Partial Grouping Forward Aggregation-first query plans
Conditions Aggregation data source AD covers a term pga 1 to 1 relation between the entity specified by pga and the entity
specifed by the grouping attribute N to 1 relation between the entity specified by the aggregation
attribute and the entity specified by pga
dbSN PN C B IG e ne
ge ne nam eSN P ID s
G ro uping o nC hro m o s o m e
c hro m o s o m e
Aggre gat io n o nSN P Fre que nc y
(b) E xam ple fo r M o ve P ar t ial G ro uping-by Fo rward
B
P ar t ial G ro uping o nG e ne N am e
![Page 27: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/27.jpg)
Query Planning Evaluation (1) Cost model evaluation: query plan size
![Page 28: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/28.jpg)
Query Planning Evaluation (2) Planning Algorithm Scalability
0.03% query planning overhead
![Page 29: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/29.jpg)
Query Planning Evaluation (3) Optimization techniques
NO: No optimization technique used Merging: Only perform plan merging Grouping: Only perform two grouping optimizations M+G: Perform both merging and grouping
![Page 30: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/30.jpg)
Outline
Introduction and Motivation System Core
Query planning problem Query planning algorithms
Other system components Query caching Fault tolerance Schema mining
Proposed work
![Page 31: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/31.jpg)
Query Caching: Motivation High response time for deep web queries Motivating observations
Data source redundancy Data sources return answers in a All-In-One
fashion Users issue similar queries in one session
Query-Plan-Driven query caching method Not only cache previous data, also query plans Caching query plans increases the possibility of
data reuse
![Page 32: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/32.jpg)
Query Caching: Strategy Overview We are given a list of n previous issued queries,
each of which has a query plan Pi
Given a new query q, we want to generate a query plan for q in the following way Define a reusability metric to identify the previous query
plans that is beneficial to reuse Select a set of reusable previous queries and query plans Use a selection function to obtain the sub-query plans we
will like to reuse Use a modified query planning algorithm to generate query
plan for the new query based on reusable plan templates
![Page 33: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/33.jpg)
Query Caching: Evaluation Three mechanisms compared
NC: No Caching DDC: Data Driven Caching PDC: Plan Driven Caching
![Page 34: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/34.jpg)
Fault Tolerance: Motivation Remote data sources are vulnerable to unavailability
or inaccessibility Data redundancy across multiple data sources,
partial redundancy Use similar data sources to hide unavailable or
inaccessible data sources Data redundancy based incremental query
processing Not generate new plan from scratch Inaccessible part is suspended Incrementally generate a new part to replace the
inaccessible part
![Page 35: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/35.jpg)
Fault Tolerance: Strategy Overview System Model: data redundancy graph model
Nodes: data sources Edges: redundancy usage between data source pair
Given a query plan P and a set of unavailable data sources UDS, find the minimal impacted sub-plan MISubP Impacted sub-plan: the sub-plan of the original plan P which is
rooted at unavailable data sources UDS Minimal impacted sub-plan: an impacted sub-plan with no usable
data sources Generate the maximal fixable sub-query of the minimal
impacted sub-plan Maximal fixable sub-query doesn’t contain any dead attributes
which are covered by the minimal impacted sub-plan Generate a query plan for the maximal fixable sub-query as
the new partial query plan
![Page 36: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/36.jpg)
Fault Tolerance: Evaluation Query plan execution time
Generate new plan from scratch Our incremental query processing strategy
![Page 37: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/37.jpg)
Schema Mining: Motivation Data source metadata reveals data source
coverage information Metadata: input and output attributes Data sources only return a partial set of
output attributes in response to a query the ones have non-NULL values for the input
Find approximate complete output attribute set
![Page 38: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/38.jpg)
Schema Mining: Strategy Overview Sampling based method
A modest sized sample could discover most deep web data source output schema
Rejection sampling method to choose the sample A sample size estimator is constructed
Mixture model method Sample is not enough Output attributes could be shared among different data
sources Data source: probabilistic data source model generates
output attributes with certain probability Borrowability among data sources: an output attribute is
generated from a mixture of different probabilistic data source models
![Page 39: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/39.jpg)
Schema Mining: Evaluation Four methods compared
SamplePC: Sampling + Perfect label classifier SampleRC: Sampling + Real label classifier Mixture: Mixture model method Mixture + Sample: SampleRC + Mixture
Number of Inputs Used
Reca
ll of Sch
em
a O
utp
ut Attribute
s
20151050
1.0
0.9
0.8
0.7
0.6
0.5
0.4
Method
SamplePCSampleRC
MixtureSample+Mix
1.001.001.001.001.001.001.001.001.00
1.00
1.001.00
1.001.001.001.00
1.001.00
1.001.00
0.610.620.780.84
0.840.83
0.700.700.74
0.740.740.76
0.73
0.77
0.80
0.780.780.780.780.770.760.770.77
0.780.750.77
0.780.790.810.810.81
0.770.77
0.780.80
Output Schema Attribute Mining Recall for dbSNP
![Page 40: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/40.jpg)
Outline
Introduction and Motivation System Core
Query planning problem Query planning algorithms
Other system components Query caching Fault tolerance Schema mining
Proposed work
![Page 41: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/41.jpg)
Answering Relationship Search over Deep Web Data Sources: Motivation and Formulation Knowledge is only useful when it is related Linked web data Deep web data sources are ideal sources for linked
data Supported by backend relational databases Data on output pages are related Deep web data sources are correlated, input and output
relation Deep web data source output pages are hyperlinked with
output pages from other data sources Problem Formulation
A relationship query RQ={ke1,ke2} Find the terms relate ke1 with ke2
![Page 42: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/42.jpg)
Relationship Query: Proposed Method 1 Use correlation among data sources
Q={MSMB, RET} Find the relation between these two genes
Connect the data sources taking two genes as input
Connect the data source taking one gene as input and another data sourcetaking the other gene as output
A modified query planning algorithm introduced in the current work
![Page 43: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/43.jpg)
Relationship Query: Proposed Method 2 Use hyperlinks among different output pages to
build relation Two-level source-object graph model
Sampled output pages Extract objects (entities) represented as (data source, object
name) pair Extract hyperlinks on output pages, pointing from one object to
another object in different output pages Data source nodes and object nodes Data source virtual link edges connect correlated data sources Hyperlink edges connects hyperlinked object nodes or connects
data source node with its corresponding object nodes Edges are weighted
![Page 44: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/44.jpg)
Relationship Query: Graph ModelData source node
object node
Data source virtual link edge
Edge weight
Hyperlink edge
Hyperlink edge
![Page 45: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/45.jpg)
Relationship Query: Method 2 Algorithm
Identify two nodes in the graph as path ends Path weight: multiplication of edge weights Shortest N paths: NP-hard problem
Shortest Paths
![Page 46: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/46.jpg)
Quality-Aware Data Source Selection based on Functional Dependency Analysis: Motivation Current data source selection method
Coverage Overlap relevance
Quality-aware data source selection Data richness Both sources A and B provide information genes and their
encoded proteins A only considers one encoding schema, but B considers
two B is better than A, but how to detect?
Which one is better?
Can we find the information we need?
![Page 47: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/47.jpg)
Quality-Aware Data Source Selection: Proposed Method (1) Functional dependency
A functional dependency any two tuples t1 and t2 that have must have
The previous example Data source A Data source B
Extract functional dependencies Sampling: data tuples from deep web data
sources Discover functional dependencies
![Page 48: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/48.jpg)
Quality-Aware Data Source Selection: Proposed Method (2) A set of data sources Each has a set of functional dependencies
Functional dependency lattice An attribute set
Data source has functional dependency set on
![Page 49: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/49.jpg)
Optimized Query Answering over Deep Web Data Sources: Motivation and Formulation Current technique: minimize the number of data
sources with benefit function A more interested aspect: minimized the total query
plan execution time Optimization problem 1: single query
Minimize response time (ERT), maximize plan quality (RS) Maximize the plan gain per execution unit
Optimization problem 2: multiple queries Minimize total response time for multiple queries Scheduling problem, don’t assume similarity among
queries
![Page 50: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/50.jpg)
Optimized Query Answering: Proposed Methods Optimization for single query
Tabu search framework to find the optimal plan Optimization for multiple queries
Query as a job with a list of tasks Data sources as machines Dependencies among task, each task can be
performed on a set of machines Data source response time as machine working
time Job scheduling problem
![Page 51: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources](https://reader035.vdocuments.us/reader035/viewer/2022062719/56813050550346895d95feae/html5/thumbnails/51.jpg)
Conclusion SEEDEEP: A System for Exploring and quErying DEEP web
data sources Query Planning
Three query planning algorithms Query planning and execution optimization techniques
Other components Query caching: query-plan-driven Fault tolerance: redundancy based incrementally query processing Schema Mining: sampling and mixture model approach
Proposed work New query types: relationship query Data source selection: quality-aware New optimization problems: single query and multi-queries