crawling deep web content through query forms
DESCRIPTION
Crawling Deep Web Content Through Query Forms. Jun Liu, Zhaohui Wu, Lu Jiang, Qinghua Zheng and Xiao Liu Speaker: Lu Jiang Xi’an Jiaotong University P.R.China. Outline. Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/1.jpg)
Crawling Deep Web Content Through Query
Forms
Jun Liu, Zhaohui Wu, Lu Jiang, Qinghua Zheng and Xiao Liu
Speaker: Lu JiangXi’an Jiaotong University
P.R.China
![Page 2: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/2.jpg)
Outline
Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions
![Page 3: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/3.jpg)
Outline
Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions
![Page 4: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/4.jpg)
What is the Deep Web Deep Web (or Hidden Web) refers to World Wide
Web content that is not part of the surface Web which is directly indexed by search engines.
![Page 5: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/5.jpg)
Why the Deep Web
Data retrieval in Deep Web [Michael K. Bergman,2001]
Organizes high-quality content
Significant piece of the Web
![Page 6: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/6.jpg)
What is the problem?
Ordinary crawlers retrieve content only in Surface Web.
Challenge: make the Deep Web accessible to web search.
A Practical solution: Deep Web crawling
![Page 7: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/7.jpg)
Outline
Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions
![Page 8: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/8.jpg)
Related Work The prior knowledge-based query
methods: generate queries under the guidance of
prior knowledge E.g. HIdden Web Exposer [Raghavan, 2001]
The non-prior knowledge methods generate new query by analyzing the data
records returned from the previous queries E.g. Deep Web crawler [Ntoulas, 2005]
![Page 9: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/9.jpg)
Outline
Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions
![Page 10: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/10.jpg)
The idea of the MEP Previous work is based on either the
genetic textbox or the entire query form. For genetic textbox: the harvest rate
(capability of obtaining new records) of queries are relatively low and simplex.
For entire form. incorrectness of filling out the entire form is excessive.
A proper granularity of pattern is required.
![Page 11: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/11.jpg)
What is the MEP Query Form. A query form F is a query interface of
Deep Web, which can be defined as a set of all elements in it. where ei is an element of F such as a checkbox, text box or radio button.
Executable Pattern (EP). is an executable pattern if the deep web database returns the corresponding results after the query with value assignments of elements in it is issued.
Minimum Executable Pattern (MEP). Given is an executable pattern ,then it is a MEP iff any proper subset of it is not an executable pattern.
1{ , ..., }
me e
1{ , ..., }
me e
1{ ,... }nF e e
![Page 12: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/12.jpg)
MEP Classification
Two types of the MEP. If there is an infinite domain element
(text box) in MEP set, then the MEP is called infinite domain MEP (IMEP).
If all its element are finite domain (radio button, check boxes), then the MEP is called finite domain MEP (FMEP).
![Page 13: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/13.jpg)
What is the MEP
5 IMEPs
6 FMEPs
1 IMEP
![Page 14: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/14.jpg)
Outline
Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions
![Page 15: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/15.jpg)
What is a Query
The ith query to database is implemented using MEP mep and its corresponding keyword vector kv. E.g. qi(mep(keywords),”art”). The harvest rate of a query is the
ability of obtaining new records.
![Page 16: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/16.jpg)
Overall AlgorithmPrepare to submit
the query qi.
If i<s ?
Load a set of most promising values from LVS to corresponding labels.
Using the Probabilistic Ranking Function to pick the keyword vector kv.
Kv matches any mepj in Smep
Predict the pattern harvest rate of each mepj in Smep
Estimate the keyword harvest rate of all possible (kv,mep) pair already known.
Pick out the (kv,mepj) pair which has the max value of Efficient
Return kv and mepj of qi
T F
T
F
Data Accumulation Phase
Prediction Phase
![Page 17: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/17.jpg)
Submit queries Response results
Stage I
Form
Stage II
MEP Set
Extract records
Next query
Query Selector
predictor2
Wrapper
Form Analysis
Deep WebDatabase
Prediction information
Submit queries
sumbitter
How does a Crawler Work
q (mep(keywords),”art”).art
Obtained x new records while accessing y records.
Harvest rate = x/y.The harvest rate and extracted records are used
to evaluate query candidate.Iteration goes on until stop condition is Iteration goes on until stop condition is
satisfiedsatisfied
![Page 18: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/18.jpg)
Overall AlgorithmPrepare to submit
the query qi.
If i<s ?
Load a set of most promising values from LVS to corresponding labels.
Using the Probabilistic Ranking Function to pick the keyword vector kv.
Kv matches any mepj in Smep
Predict the pattern harvest rate of each mepj in Smep
Estimate the keyword harvest rate of all possible (kv,mep) pair already known.
Pick out the (kv,mepj) pair which has the max value of Efficient
Return kv and mepj of qi
T F
T
F
Prediction Phase
![Page 19: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/19.jpg)
Pattern Harvest Rate Pattern harvest rate of the mep,
depends on the pattern mep itself, rather than choice of keyword vectors. E.g. MEP(Keywords) and MEP(Abstract)
Two approaches to predict the value. Continuous prediction Weighted prediction
![Page 20: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/20.jpg)
Keyword Vector Harvest Rate Keyword vector harvest rate represents the
conditional harvest rate of kv among all candidate keyword vectors of the given mep.
E.g. given the MEP(keywords), find out which kv will bring the most new records.
The estimation of kv harvest rate consists of two parts Calculate how many records containing kv has
been downloaded (SampleDF) Sampling Estimate how many records containing kv reside in
Deep Web (Keyword Capability) Zipf Law Keyword Vector Harvest rate = Keyword Capability
– SampleDF
![Page 21: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/21.jpg)
1k
k k k
S aa a m
S
Convergence Analysis When to terminate crawling the Deep web
database, especially when the size of target database is unknown?
S is the record numeber of Deep Web Database
ak is the cumulated fraction of new records
mk is the fraction of records returned by the kth query
If we assume mk is constant, We have:
1/ 1 (1 )kk
ma S
S
Crawler Bottleneck!
![Page 22: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/22.jpg)
Outline
Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions
![Page 23: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/23.jpg)
Effectiveness
URL Size Harvest NO. of
Querieshttp://www.jos.org.cn 1,380 1,380 143
http://cjc.ict.ac.cn 2,523 2,523 13
http://www.jdxb.cn 424 424 16
http://www.paperopen.com
743,444
730,000 399
http://vod.xjtu.edu.cn 700 700 311
http://music.xjtu.edu.cn
154,000
146,967 386
![Page 24: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/24.jpg)
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1 22 43 64 85 106 127 148 169 190 211 232 253 274 295 316 337 358 379 400 421 442
query number
cove
rage
of
deep
web
dat
abas
e
MEP
I DE
Comparison with state of art method
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
1 19 37 55 73 91 109 127 145 163 181 199 217 235 253 271 289query number
cove
rage
of
deep
web
dat
abas
e
MEPI DE1I DE2I DE3
We believe MEP method with multi-MEP We believe MEP method with multi-MEP outperforms than that with a single one of outperforms than that with a single one of the multi-MEPthe multi-MEP
![Page 25: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/25.jpg)
Outline
Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions
![Page 26: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/26.jpg)
Conclusion
The novel concept of MEP provides a foundation to study Deep Web crawling through query forms.
The adaptive crawling method and its related prediction algorithm offer a efficient way to crawling Deep Web content through query forms.
![Page 27: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/27.jpg)
Thanks You!
![Page 28: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/28.jpg)
Appendix
Here comes the Appendix
![Page 29: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/29.jpg)
MEP Generation Algorithm
![Page 30: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/30.jpg)
Examples of Prediction
![Page 31: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/31.jpg)
Comparison with LVS method
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136query number
cove
rage
of
deep
web
dat
abas
e
MEP
Enhanced LVS
Cl assi cal LVS
![Page 32: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/32.jpg)
Continues Prediction
The current harvest rate of a MEP totally depends on the harvest rate of the latest issued query by the MEP.
0.33 0.33 0.33
mep1 mep2 mep3Issue a query via mep1 and get 200 record
assessing 250 records
Accessing new record rate = 200/250 = 0.8
mep1 = 0.8/(0.33+0.33+0.8) = 0.55
mep2 = 0.33/(0.33+0.33+0.8) = 0.22
mep3 = 0.33/(0.33+0.33+0.8) = 0.22
0.55 0.22 0.22
Issue a query via mep1 and get 30 record assessing 100 records
Accessing new record rate = 30/100 = 0.3
mep1 = 0.3/(0.22+0.22+0.3) = 0.40
mep2 = 0.22/(0.22+0.22+0.3) = 0.29
mep3 = 0.22/(0.22+0.22+0.3) = 0.290.40 0.29 0.29
![Page 33: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/33.jpg)
Weighted Prediction
The current harvest rate of a MEP depends on all its previous harvest rates of issued query by the MEP.
![Page 34: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/34.jpg)
SampleDF Calculation document frequency of observed keyword
vector kv in sample croups {d1,...,ds}.
where kvxk is the corresponding Boolean vector of kv in dk, and similarly mepx is the Boolean vector of mep.
ith dimension of vector kv contains in document corresponding dimension of vector kvx is assigned to 1. 0 otherwise;
ith dimension of mep is infinite domain mep then the corresponding position is assigned to 1. 0 otherwise.
![Page 35: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/35.jpg)
SampleDF Calculation Example
kx = (a,b) mep = (student id, exam id, subject)
Four documents D1,D2,D3 and D4 D1 has both Student ID a and Exam
ID b D2 has only Student ID a D3 has only Exam ID b D4 has neither Student ID a and
Exam ID b mepx = (1,1,0) D1 kvx1 (1,1,0) cos<(1,1,0),(1,1,0)> = 1 D2 kvx2 (1,0,0) cos<(1,0,0),(1,1,0)> = 0.707 D3 kvx3 (0,1,0) cos<(0,1,0),(1,1,0)> = 0.707 D4 kvx4 (0,0,0) cos<(0,0,0),(1.1.0)> = 0 SampleDF((a,b)| mep) = 1+0.707+0.707+0 = 2.414
![Page 36: Crawling Deep Web Content Through Query Forms](https://reader036.vdocuments.us/reader036/viewer/2022062407/56812d5f550346895d926c00/html5/thumbnails/36.jpg)
Keyword Capability Estimation
Keyword capability denote capability of obtaining records. (differ from harvest rate)
|Dt| is Cartesian product of values of finite element in MEP
For FMEP: f = 1
For IMEP: Zipf-Mandelbrot Law to estimate f
Keyword capability
= 1
| |n
t
t
f
D