seedeep: a system for exploring and querying deep web data sources phd defense the ohio state...

97
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and Querying Deep Web Data Sources Fan Wang Advisor: Prof. Gagan Agrawal Ohio State University

Upload: eunice-ward

Post on 16-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010

SEEDEEP: A System for Exploring and Querying

Deep Web Data Sources

Fan Wang

Advisor: Prof. Gagan Agrawal

Ohio State University

Page 2: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 2

The Deep Web

• The definition of “the deep web” from Wikipedia

The deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.

• Some Examples: Expedia, Priceline

Page 3: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 3

The Deep Web is Huge and Informative• 500 times larger than the surface

web• 7500 terabytes of information (19

terabytes in the surface web)• 550 billion documents (1 billion in

the surface web)• More than 200,000 deep web

sites• Relevant to every domain:

scientific, e-commerce, market• 95 percent of the deep web is

publicly accessible (with access limitations)

Page 4: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 4

How to Access Deep Web Data1. A user issues query through input interfaces of deep web data sources

2. Query is translated into SQL style query

3. Trigger search on backend database

4. Answers returned through network

Select priceFrom ExpediaWhere depart=CMH and arrive=SEA and dedate=“7/13/10” and redate=“7/16/10”

Page 5: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 5

Drawbacks• Constrained flexibility

– Types of queries: aggregation query, nested query, queries with grouping requirement

– User specified predicates

• Inter-dependent data sources– Users may want data from multiple correlated data

sources

• Long latency– Network transmission time– Denial of service

Page 6: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 6

Goal

Develop a deep web search tool which could support online (real time)structured, and high level queries (semi)automatically

Page 7: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 7

Challenges• Challenges for Integration

– Self-maintained and created– Heterogeneous, hidden and dynamically updated

metadata

• Challenges for Searching– Limited data access pattern– Data redundancy and data quality– Data source dependency

• Challenges for Performance– Network latency– Fault tolerance issue

Page 8: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 8

Our Contributions (1)• Support online aggregation over deep web

– Answer deep web aggregation search in a OLAP fashion

– Propose novel sampling techniques to find accurate approximate answers in a timely manner

• Support low selectivity query over deep web– Answer low selectivity query in presence of

limited data access– Propose a novel Bayesian method to find optimal

stratification for hidden selective attribute

Page 9: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 9

Our Contributions (2)• Support structured SQL queries over the

deep web– Support SPJ, aggregation, and nested queries

• Automatic hidden schema mining– A statistic framework to discover hidden metadata

from deep web data sources

• Novel query caching mechanism for query optimization

• Effective fault tolerance handling mechanism

Page 10: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 10

System OverviewD eepW eb

S chem a M in ing

Source Input Ouput Constraint

S1 A1 B1,B2 C2

S2 A1 B2,B3 C1

D1

D2

D3

D 4

D ata S ource M odel D ata S ourceD ependency M ode l

S ys tem M odels

QueryP lann ing

QueryOptim iza tion

FaultTo lerance

C o m plexS tructured Q uery

A pproxim ateQuery

A nsw ering

E xploring P art of S E E DE E P Q uerying P art of S E E DE E P

A ggregatio n/Lo wS electiv ity Q uery

Hidden schema discoveryData source integration

Structured SQL querySampling the deep webOnline aggregationLow selectivity query

Page 11: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 11

Outline• Introduction• Sampling methods for online aggregation query

– Motivation– ANS and TPS methods– Evaluation

• Stratification methods for low selectivity query– Motivation– Harmony search and Bayesian based adaptation– Evaluation

• Future work and conclusion

Page 12: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 12

Online Aggregation Motivation• Aggregation queries requiring data enumeration

I want to know the average airfare from US to Europe across all major US airline flights in the next week

Select AVG(airfare)From AirTable ATWhere AT.depart=any US city and AT.arrive=any European city

Need Enumeration!!!

Relational DatabaseDeep Web Data Source

NYC, London

Boston, Paris

LA, Rome

AA: 500UA: 550

USAir: 450Delta: 400

UA: 600AA: 650

Where do you get these names?

How long can you wait?

What if the data is updated dynamically?

Page 13: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 13

Initial Thoughts• Sampling: Approximate answers

• Simple random sampling (SRS)– Every data record has the same probability to

be selected

• Drawbacks of SRS– Bad performance on skewed data– High sampling cost to perform SRS on deep

web (Dasgupta et al, HDSampler)

Page 14: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 14

We Want To Achieve• Handle data with (probably high) skew

– Top 20 IT companies account for 80% of the sales among all top 100 IT companies in 2005

– Hidden data (hard to gather statistical information)– Has skew or not? – Unknown data distribution– Pilot sample, how much can you trust your pilot

sample?

• Lower sampling cost for sampling deep web data

Page 15: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 15

Our Contributions

• Two Sampling Algorithms– ANS (Adaptive Neighborhood Sampling): handling skewed

data, sample skew causing data easier

– TPS (Two Phase adaptive Sampling): lower sampling cost

• Performance– Accurate estimates without prior knowledge

– ANS and TPS outperform HDSampler by a factor of 4 on skewed data

– TPS has one-third of the sampling cost of HDSampler

Page 16: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 16

Background Knowledge A survey on a type of rare monkey, which only lives in a small but

dangerous area in southern China Associated Samples

Page 17: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 17

Why this is good and Can we use it?• Sample more rare but significant data records

– Good for handling with skewed data

• Associated samples have relatively low sampling cost– Cheaper than SRS with the same sample size

• Yes, we can use it! ~~ with modification– Much real world data has skew (IT company, household income)– Rare data often form clusters– Deep web data sources often return multiple records w.r.t. one

input sample

Page 18: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 18

Drawbacks

• Performance depends on the initial sample• Initial sample is simple random sample• No cost limit explicitly considered

– What is the size of the initial sample?– How many associated samples should be added?

Page 19: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 19

• Select a random sample– Stop random sampling if any of the two termination rules applies

• We have sampled k number of units of interest• We have reached the cost limit

– Take the sampled data record, add it to our sample

– If this data record is a unit of interest• Obtain its neighbors (neighborhood sampling)• For each data records obtained from neighborhood sampling

– Add it to our sample

– Perform recursive neighborhood sampling if necessary

– If neighborhoods are too large• Increase unit of interest threshold

– If neighborhoods are too small• Decrease unit of interest threshold

The ANS Sampling Algorithm

Aggressively sample skew causing data

Control sampling cost

Page 20: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 20

ANS Example

• Estimate the total sale of IT companies in 2005

• Each point represents a company’s sale record

• Color shows the scale of the sale value, the darker, the higher

• Neighborhood of data records is defined according to some rules

Page 21: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 21

1. Select initial random samples sequentially until we have k units of interest (k=3)

Unit of interest: sales larger than a threshold

2. Explore the neighborhood recursively for all units of interest until the total number of samples reach a limit

3. If too many neighbors are Included, we increase the threshold of unit of interest

Page 22: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 22

Estimators and Analysis for ANS• Estimator for AVG, fixed unit of interest threshold

Lemma 1: The above estimator is a biased estimator, when k is small, the bias is very small

• We also proposed estimator for variable unit of interest threshold using post-stratification

is the estimated average value from the hth stratum (all samples corresponding to one specific unit of interest threshold value)

β is the percentage of units ofInterest w.r.t. entire data setβ=(k-1)/(n-1)

Page 23: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 23

Drawbacks of ANS

• Initial samples are simple random samples

• SRS: one input search only gets one sample from the output page

• High cost

Page 24: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 24

The TPS Sampling Algorithm• Partition data set D into M sub-spaces

– According to combinations of input attribute values

• Random select m sub-spaces• Select a sample of size n1 from each of the m selected sub-spaces

– First phase sampling

– For one selected sub-space, if any such selected data record is a unit of interest, proceed

• Select a sample of size n2 from the corresponding sub-spaces– Second phase sampling

– Sub-spaces contains units of interest in the first sampling phase may give us more units of interests in the second sampling phase

Aggressively draw skewCausing data

Low sampling cost

Page 25: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 25

Unit of interest: sales larger than a threshold

1. Random select the subspaces (input search)

2. Random select N1 samplesIn each selected subspaceN1=4

3. If any sample selected in asub space is a unit of interest,select N2 samples from thesub space again, N2=3

Page 26: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 26

Evaluation• Data sets

– Synthetic data sets: generated using MINITAB, varying data skew from 1 to 9

– US Census data: 2002 US economic census data on wholesale trade product lines listed by the kind of business (skew=8)

– Yahoo! Auto: prices of used Ford cars from 2000 to 2009 located within 50 miles of a zipcode (skew=0.7)

• Metrics– AER: Absolute Error Rate– Sampling cost: number of input samples needed

• Methods– ANS– TPS– SRS

Page 27: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 27

ANS Performance w.r.t. Data Skew

1. AER increases moderately with the increase of data skew

2. When k=8, AER is consistently smaller than 19%

3. Larger k doesn’t help much to improve

Page 28: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 28

TPS Performance w.r.t. Data Skew

1. With the increase of data skew, AER increases moderately

2. For sub-space sample size of 30%, AER always smaller than 17%

Page 29: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 29

AER Comparison on Synthetic Data

1. All methods work well on small skew data2. HDSampler has bad performance on data with skew>2

3. Our two methods outperform HDSampler by a factor of 5

Page 30: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 30

AER Comparison on US Census Data

Page 31: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 31

AER Comparison on Yahoo! Data

1. For AVG, three methods are comparable in terms of accuracy

2. For MAX, our methods are better (by a factor of 4)

Page 32: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 32

Sampling Cost Comparison on Yahoo Data

1. To achieve a low AER, TPS has one-third of the sampling cost of HDSampler

2. The total number of samples TPS obtained (with same cost in time) is twice the number of samples HDSampler obtained

More Results

Page 33: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 33

Outline• Introduction• Sampling methods for online aggregation query

– Motivation– ANS and TPS methods– Evaluation

• Stratification methods for low selectivity query– Motivation– Harmony search and Bayesian based adaptation– Evaluation

• Future work and conclusion

Page 34: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 34

Motivating Example: Low Selectivity

• Random Sampling– None of the selected records satisfies the low selectivity predicate

• Stratified Sampling– Partitioning attribute, selective attribute– How to perform stratification

• Distance based methods (clustering, outlier indexing)• Selective attribute is not queriable

– Auxiliary attribute based stratification» Dalenius and Hodges’s method, Ekman’s method, Gunning and Horgan’s

geometric method– Require strong correlation between auxiliary and selective attribute

Page 35: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 35

Our Contributions• We focus on queries in the following format

• Propose a Bayesian adaptive harmony search stratification method to stratify a hidden selective attribute based on an auxiliary attribute

• The stratification accurately reflects the distribution of the hidden selective attribute when correlation is weak

• Estimations from our stratification outperforms existing methods by a factor of 5

• Estimation accuracy obtained from our method is higher than 95% for 0.01% selectivity queries

Page 36: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 36

Background: Stratification

• Purpose– Within stratum homogeneity

• Partition data set R into k strata– Partitioning attribute x has value range , find k-1

breaking points

• Sampling allocation– Neyman allocation

– Bayesian Neyman allocation

Page 37: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 37

Background: Harmony Search (1)• A phenomenon-mimicking meta-heuristic algorithm

inspired by the improvisation process of musicians• Optimize an objective function

• Initialize the harmony memory, M random guesses of the decision variable vector

Page 38: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 38

Background: Harmony Search (2)• Improvise a new harmony from the harmony memory

– A new harmony vector is generated from two parameters HMCR and PAR.

• Update harmony memory• Termination

Page 39: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 39

Algorithm Overview (1)

• Find the best stratification of the selective attribute

X

Per

cent

2.11.81.51.20.90.60.30.0

181614121086420

Stratification of A Selective Attribute

X

Per

cent

2.11.81.51.20.90.60.30.0

1816141210

86420

VariableSelective AttributeAuxiliary Attribute

Stratification of An Auxiliary Attribute0 0.45 0.75 1.1 2.1 0.85 1.4 2.10 0.45

Page 40: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 40

Algorithm Overview (2)• Consider an auxiliary attribute as the partitioning

attribute• Harmony memory

– A list of breaking points vectors of the auxiliary attribute

Page 41: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 41

Harmony Objective Function

• What is a good stratification?– Condition 1: homogenous data within stratum

• Small sample variance summation across strata

– Condition 2: stratification with high precision, i.e., the low selective region of the distribution is exclusively covered by some strata

Page 42: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 42

Sample Allocation• Determine the number of samples assigned to

each stratum• What stratum should be put heavier sampling

weight?– Stratum with diversity (more heterogeneous)

• High sample variance

– Stratum which covers large percentage of the low selective region

• High precision

Page 43: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 43

Parameter Adaptation: Overview• Two parameters: HMCR and PAR

X: value of HMCR parameter

Y: precentage of cases in which we obtain better harmony vector

Page 44: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 44

Bayesian Method Overview• Estimate unknown parameter (posterior distribution)

based on prior knowledge or belief (prior distribution)

• Observed data y, unknown parameter θ

• In our scenario– Observed data: the harmony parameter values which yield

better harmony vector

– Unknown parameter: adaptation pattern of harmony parameters

Page 45: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 45

Bayesian Adaptation (1)• Assume the adaptation patterns of parameters HMCR and

PAR follows probability functions

• Represent our belief in θ as a prior probability distribution

• Observe our data: the HMCR and PAR parameters which yield the best new harmony vector based on the current harmony memory, denoted as

• Based on , compute θ (posterior distribution), denoted as (Details)

• Compute the adapted parameter value as

Page 46: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 46

Bayesian Adaptation (2)• Probability function

• Prior distribution of θ

Details

Page 47: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 47

Evaluation• Data sets

– Synthetic data sets: generated using MINITAB, varying correlation between auxiliary attribute and selective attribute from 1 to 0.3

– US Census data, correlation (number of settlements and sale) 0.56

– Yahoo! Auto, correlation (year and mileage) 0.7

• Metric– AER: Absolute Error Rate

• Methods– Leaps and Bounds (L&B)– Dalenius and Hodges (D&H)– Random Sampling (No Stratification)– Ours (HarmonyAdp)

Page 48: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 48

HarmonyAdp Performance (1)

1. Higher accuracy with more iterations

2. Iteration>40 (2% of totaldata size), low AER for allselectivity values

3. Robust with respect to data correlation

Page 49: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 49

HarmonyAdp Performance (2)

Page 50: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 50

Four Methods Comparison (1)1. All methods work well on easy queries

2. When queries get harder,D&H, L&B and Random degrade severely. Ours has good performance

3. For 0.1% queries, our method outperforms others by a factor of 5

4. For extreme low selectivity queries, the error rate from our method is always lower than 18%

Page 51: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 51

Four Methods Comparison (2)

1. Similar pattern

2. Performance degradeof other three algorithmson Auto data is small,because data correlationis high

More Results

Page 52: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 52

Conclusion• Propose a system for exploring and querying deep

web data sources

• Propose novel sampling methods to handle online aggregation and low selectivity queries

• Propose novel query planning algorithms to answer structured SQL queries over the deep web

• System also has the following features– Self-healing from unavailable/inaccessible data sources

– Query optimization

– Automatic hidden schema mining and integration

Page 53: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 53

Future Work• Better understanding the deep web data

– Data quality• Reliability, authority, data provenance models• Data source correlation

– Data distribution• Learning approximate data distribution with a bounded confidence interval• Handling consistency issue

• Better querying the deep web– Structured query

• Existence (Is-there) queries, data mining queries

– Semantic query, incorporate semantic web with deep web• Automatic ontology generating from deep web• Semantic query answering

– Result ranking• Ranking results in heterogeneous formats• Combining results from different/correlated data sources

Page 54: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 54

Thank You!

Page 55: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 55

Backup• ANS Estimator

• TPS Estimator

• SPJ NP Hard

• Aggregation NP Hard

• Bayesian Inference

• Other System Modules

Page 56: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 56

Bayesian Adaptation (2)• Probability function

– HMCR ranges from 0 to 1

– HMCR varies from small values to large values

– Beta distribution best mimics this adaptation pattern

Page 57: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 57

Bayesian Adaptation (3)

• Prior distribution of θ– Belief about θ before any data is observed

– Belief about θ before the execution of harmony search algorithm

– What is our belief?• For Beta distribution, θ<β, peak occurs before HMCR=0.5• θ>β, peak occurs after HMCR=0.5• At the beginning, HMCR is small

Page 58: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 58

Lemma on Biasness

Page 59: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 59

Page 60: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 60

Page 61: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 61

Page 62: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 62

Page 63: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 63

Page 64: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 64

Page 65: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 65

Page 66: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 66

TPS Estimator

Page 67: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 67

Page 68: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 68

Details on theorem, sufficient statistics and etc

Page 69: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 69

Page 70: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 70

Details on max

Page 71: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 71

SPJ NP hardLet C={C1,C2,…,Cm} be an instance of the set cover problem.

The node set V of G is given by V={C1,…Cm}

Page 72: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 72

Aggregation NP hardLet C={C1,C2,…,Cm} be an instance of the set cover problem.

The node set V of G is given by V={C1,…Cm, W}

The grouping attribute is s*, and node W covers s*

Page 73: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 73

ANS Estimator Bias Evaluation

1. k=8, we achieve the best estimation

2. k>10, AER increases moderately, bias occurs

3. k=10, we reach the cost limit, bias occurs

Page 74: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 74

Harmony Objective Function• What is a good stratification?

– Condition 1: homogenous data within stratum• Small sample variance summation across strata

– Condition 2: stratification with high precision, i.e., the low selective region of the distribution is exclusively covered by some strata

Selective Attribute

Perc

ent

2.11.81.51.20.90.60.30.0

181614121086420

0 0.45 0.75 1.1 2.1Ideal Stratification

Selective Attribute

Per

cent

2.11.81.51.20.90.60.30.0

181614121086420

0 0.45 0.75 2.1

A Homogenous But Not Precise Stratification

Page 75: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 75

Outline• Introduction• Sampling methods for online aggregation query

– Motivation– ANS and TPS methods– Evaluation

• Stratification methods for low selectivity query– Motivation– Harmony search and Bayesian based adaptation– Evaluation

• Answering complex structured query• Query optimization and fault tolerance• Future work and conclusion

Page 76: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 76

Motivating Example: Structured Query

Biologists have identified the gene X and protein Y are contributors of a disease. They want to examine the SNPs (Single Nucleotide Polymorphisms) located in the genes that share the same functions as either X or Y.

Particularly, for all SNPs located in each such gene functions similar to either X or Y, and those have a heterozygosity value greater than 0.01, biologists want to know the maximal SNP frequency in the Asian population.

Page 77: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 77

N C B IG E N E GO

Q u ery P lan P art 1 (su b q u ery 1 )

gen e X fu n c tio n gene

nam e

H um anP ro tein

N C B IG E N E GO

Q u ery P lan P art 2 (su b q u ery 2 )gen en am e

p ro te in Y fu n ctio n genenam e

The gene has the same functions as XThe gene has the same functions as YThe frequency information of the SNPs located in these genes and filtered by heterozygosity values

S N P 5 0 0C a nc e r

d bS N P

Q u ery P lan P art 3 (m ain q u ery)

fi l te rin g b y H etero zygo s ity

S N P freq u en cy

Page 78: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 78

Query Planning Algorithm• We propose query planning algorithms to support

– Select-Project-Join query– Aggregation query– Uncorrelated nested query

• We also extend our algorithm to the following SQL operators– Union and OR– Order by– Having

• Planning problem: NP hard sub-graph search problem– We propose heuristic graph search algorithms

Page 79: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 79

Query Optimization Techniques• Data redundancy across data sources

• “All-In-One” result page of deep web data sources– Deep web data sources: return all in one page no matter you

want it or not

• Query plan driven caching mechanism to optimization query execution– Reuse similar previous query plans– Increase the possibility of previous data reuse

• Results– Our method achieves twice the speed of a baseline method for

half the cases

Page 80: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 80

Fault Tolerance Techniques• Inaccessibility and unavailability of deep web data

sources• Self-healing approach for deep web query execution• Data redundancy based fault tolerance techniques

– Hide unavailable and/or inaccessible data sources

• Steps– Find minimal impacted subplan– Find maximal fixable query– Generate a replacement plan

• Result– For half the cases, our method achieves 50% speedup

compared to a baseline method

Page 81: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 81

Initial Thoughts• Sampling: Approximate answers• Simple random sampling (SRS)

– Every data record has the same probability to be selected

• Drawbacks of SRS– Bad performance on skewed data

– High sampling cost to perform SRS on deep web (Dasgupta et al, HDSampler)

D=(1,1,1,1,1,1,1,1,10,1000) True average: 101.8

SRS samples of size 2:

S1(1,1)=1, S2(1,10)=5.5, S3(1,1000)=500.5, S4(10,1000)=505

Page 82: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 82

The Deep Web is Informative• Structured Data

– Surface web: text format– Deep web: relational data in backend relational

databases

• Topic Specific Data– Biology, Chemistry, Medical, Travel, Business,

Academia, and many more…

• Publicly Accessible Data– 95 percent of the deep web is publicly accessible (with

access limitations)

Page 83: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 83

Sufficient Statistics No other statistic which can be calculated from the same

sample provides any additional information as to the value of the parameter

The concept is most general when defined as follows: A statistic T(X) is sufficient for underlying parameter θ precisely if the conditional probability distribution of the data X, given the statistic T(X), is not a function of the parameter θ

This is the reason that the final sample s i is a sufficient statistics for the estimator (all information must be contained in the final sample)

Page 84: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 84

Rao–Blackwell theorem • In statistics, the Rao–Blackwell theorem, is a result which

characterizes the transformation of an arbitrarily crude estimator into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria.

• The Rao–Blackwell theorem states that if g(X) is any kind of estimator of a parameter θ, then the conditional expectation of g(X) given T(X), where T is a sufficient statistic, is typically a better estimator of θ, and is never worse.

• The mean squared error of the Rao–Blackwell estimator does not exceed that of the original estimator.

Page 85: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 85

Conditional Expectation• Let X and Y be discrete random variables, then the

conditional expectation of X given the event Y=y is a function of y over the domain of Y

Page 86: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 86

Chebyshev Inequality

Page 87: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 87

Page 88: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 88

Page 89: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 89

Page 90: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 90

Bayesian Computation (1)• The posterior estimation for θ can be computed using

• Using Monte Carlo Sampling– Generate a sequence of random variables ,

having common density – Using strong law of large numbers, we could approximate the

above formula as

Page 91: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 91

Bayesian Computation (2)

• We choose to be equal to the prior distribution

Page 92: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 92

Strong Law of Large Numbers• In probability theory, the law of large numbers

(LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

Page 93: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 93

Page 94: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 94

Page 95: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 95

Neyman Allocation

Page 96: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 96

Leaps and Bounds• Assumptions

– Strong correlation– Uniform distribution within stratum (auxiliary attribute)– Equal coefficients of variance in each stratum

(auxiliary attribute)

• Breaks follows geometric progression

Page 97: SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

PhD Defense The Ohio State University Summer 2010 97

Dalenius and Hodges• Assumptions

– Strong correlation– Auxiliary attribute uniform distribution within stratum– So we have

• Breaks in following positions