[email protected] query planning for searching inter- dependent deep-web databases fan wang...
TRANSCRIPT
![Page 1: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/1.jpg)
Query Planning for Searching Inter-Dependent Deep-Web Databases
Fan Wang1, Gagan Agrawal1, Ruoming Jin2
1 Department of Computer Science and Engineering Ohio State University, Columbus, OH 43210
2 Department of Computer Science Kent State University, Kent, OH 44242
![Page 2: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/2.jpg)
Introduction
• The emerge of deep web– Deep web is huge– Different from surface
web– Challenges for integration
• Not accessible through search engines
• Inter-dependences among deep web sources
![Page 3: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/3.jpg)
Motivation Example
ERCC6
dbSNP
Entrez Gene
SequenceDatabase
AlignmentDatabase
AA Positions for NonNonsynonymous SNPsynonymous SNP
Encoded Encoded ProteinProtein
Encoded Orthologous Protein
Protein Sequence
Given a gene ERCC6, we want to know the amino acid occurring occurring in the corresponding position in orthologous gene of non-humain the corresponding position in orthologous gene of non-human mammalsn mammals
![Page 4: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/4.jpg)
Observations
• Inter-dependences between sources
• Time consuming if done manually
• Intelligent order of querying
• Implicit sub-goals in user query
![Page 5: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/5.jpg)
Contributions
• Formulate the query planning problem for deep web databases with dependences
• Propose a dynamic query planner
• Develop cost models and an approximate planning algorithm
• Integrate the algorithm with a deep web mining tool
![Page 6: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/6.jpg)
Roadmap
• Introduction and Motivation
• Problem Formulation
• Planning Algorithm
• Evaluation
• Related Work
• Conclusion
![Page 7: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/7.jpg)
Formulation• Universal Term Set • Query Q is composed of two parts
– Query Key Term: focus of the query (ERCC6)– Query Target Terms: attributes of interesting (Alignment)
• Data sources – Each data source D covers an output set– Each data source D requires an input set
• Find a query plan, a ordered list of data sources– Covers the query target terms with maximal benefit– As short as possible– NP-Complete problem
1 2{ , ,..., }nT t t t
![Page 8: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/8.jpg)
Problem Scenario
Initia lK no w le d ge
K e y T e rm
T a rge t D a ta
T a rge t T e rm s
![Page 9: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/9.jpg)
Production System
• Working Memory• Target Space• Production Rules• Recognize-Act Control
W o rkingM em o ry T a rge tW o rking
M em o ry
W o rkingM em o ry
W o rkingM em o ry
R1 R2 R3
K e yT e rm
T a rge t T e rm sInte rm e d
ia teR e s u lt
Inte rm e d ia teR e s u lt
F ina l R e s u lt
Database 1 Database 2 Database 3
![Page 10: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/10.jpg)
Roadmap
• Introduction and Motivation
• Problem Formulation
• Planning Algorithm
• Evaluation
• Related Work
• Conclusion
![Page 12: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/12.jpg)
Dependency Graph
• Dependency relation – Format:– Hypergraph
• Hyperarc: ordered pair (parents, child)
• AND node• Neighbors
DR1{ , ,..., }i i i m DR jD D D D
![Page 13: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/13.jpg)
Concepts
• Database Necessity (DN)– Each term is associated with a DN value– Measures the extraction priority of a term and
the importance of a database scheme– For term t, if k database schemes can provide
it, the DN value is 1
k
![Page 14: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/14.jpg)
Concepts• Hidden Nodes
– Nodes connecting current working state and the target space
• Partially Visualize Hidden Nodes– Multiple layers of hidden nodes bring difficulty
![Page 15: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/15.jpg)
Visualize Hidden Nodes• Target Space Enlargement
Target Space: {t1}
1. Find a target term t with DN=1
2. Visualize the database D which provides t
3. Add D’s input set to target space
4. Repeat above steps till doneD 8
D 6
D 5
{t1,t2,t3}{t1,t2,t3,t4}{t1,t2,t3,t4,t5}
D 1
D 2 D 4 D 7
D 3
D 6
D 5 D 8
![Page 18: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/18.jpg)
Benefit Model
• Select an appropriate rule at each iteration of the planning algorithm
• Four metrics– Database Availability– Data Coverage (DC)– User Preference (UP)– Potential Importance (PI)
![Page 19: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/19.jpg)
Data Coverage
• The number of query target terms covered by the current rule, but has not yet been covered by previous selected rules
![Page 20: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/20.jpg)
User Preference
• Domain users have preference for certain database (rule) for a particular term
• A collaborating biologist provides the preference values
• Term provided by databases
• Rule covers the following unfound target terms – Preference for is
R
t r
1
0 1, 1r
i it t
i
UP UP
R
![Page 21: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/21.jpg)
Potential Importance• Some database is more important due to its
linking to other important databases (e.g.)• A database is more important
– Find the necessary databases which provide unfound target terms
– More such necessary databases can be reached from
• The potential importance for a rule
D
D
![Page 22: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/22.jpg)
Roadmap
• Introduction and Motivation
• Problem Formulation
• Planning Algorithm
• Evaluation
• Related Work
• Conclusion
![Page 23: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/23.jpg)
Experiment Setup
• SNPMiner System– Integrates 8 deep web databases– Provides a unified user interface
• Experimental Queries
![Page 24: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/24.jpg)
Planning Algorithm Comparison
• Naïve Algorithm (NA)– Select all rules which can be fired at each
iteration until all requested terms are covered– No rule selection strategy used
• Optimal Algorithm (OA)– Search the entire space
• Production Rule Algorithm (PRA)
![Page 25: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/25.jpg)
PRA vs. NAQuery Plan Execution Time Comparison
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25 30Queries
Rat
io
ETRatio(Production/Naive)
1. All ratio data points smaller than 1
2. PRA generates much faster query plans than NA
![Page 26: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/26.jpg)
PRA vs. OAQuery Plan Execution Time Comparison
00.2
0.40.6
0.81
1.21.4
1.6
0 5 10 15 20 25 30Queries
Rat
io
ETRatio(Production/Optimal)
1. All ratio data points distributed around 1
2. In terms of query plan execution time, PRA has performance close to OA
3. In most cases, PRA generates exactly the same plan as OA
![Page 27: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/27.jpg)
Enlarge Target SpaceExecution Time Comparison
0
100
200
300
400
500
1 2 3 4 5 6 7 8Queries
Tim
e (s
)
With Enhancement Without Enhancement
1. Query plans generated with enlargement run faster
2. Query plans generated with enlargement are shorter
![Page 29: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/29.jpg)
Roadmap
• Introduction and Motivation
• Problem Formulation
• Planning Algorithm
• Evaluation
• Related Work
• Conclusion
![Page 30: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/30.jpg)
Related Work
• Query Planning– Navigational based query planning– SQL based query planning– Bucket Algorithm
• Deep Web Mining– Database selection– E-commerce oriented, no dependency
• Keyword Search on Relational Databases• Select-Project-Join Query Optimization
![Page 31: Wangfa@cse.ohio-state.edu Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer](https://reader036.vdocuments.us/reader036/viewer/2022062804/56649dcf5503460f94ac3084/html5/thumbnails/31.jpg)
Conclusion
• Formulate and solve the query planning problem for deep web databases with dependencies
• Develop a dynamic planning algorithm with an approximation ratio of ½
• Our benefit model is effective• Our algorithm outperforms the naïve
algorithm, and obtains optimal results for most cases