similarity search for web services xin (luna) dong, alon halevy, jayant madhavan, ema nemes, jun...

Post on 17-Dec-2015

220 Views

Category:

Documents

7 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Similarity Search for Similarity Search for Web ServicesWeb Services

Xin (Luna) DongXin (Luna) Dong, Alon Halevy, , Alon Halevy, Jayant Madhavan, Ema Nemes, Jun ZhangJayant Madhavan, Ema Nemes, Jun Zhang

University of WashingtonUniversity of Washington

Web Service SearchWeb Service Search Web services are getting popular within Web services are getting popular within

organizations and on the weborganizations and on the web The growing number of web services raises the The growing number of web services raises the

problem of web-service search.problem of web-service search. First-generation web-service search engines do First-generation web-service search engines do

keyword search on web-service descriptionskeyword search on web-service descriptions BindingPoint, Grand Central, Web Service List, BindingPoint, Grand Central, Web Service List,

Salcentral, Web Service of the Day, Remote Methods, Salcentral, Web Service of the Day, Remote Methods, etc.etc.

Keyword Search does not Capture the Keyword Search does not Capture the Underlying SemanticsUnderlying Semantics

zip

Keyword Search does not Capture the Keyword Search does not Capture the Underlying SemanticsUnderlying Semantics

50

Keyword Search does not Capture the Keyword Search does not Capture the Underlying SemanticsUnderlying Semantics

zipcode

Keyword Search does not Capture the Keyword Search does not Capture the Underlying SemanticsUnderlying Semantics

18

Keyword Search does not Accurately Keyword Search does not Accurately Specify Users’ Information NeedsSpecify Users’ Information Needs

Keyword Search does not Accurately Keyword Search does not Accurately Specify Users’ Information NeedsSpecify Users’ Information Needs

Users Need to Drill Down to Find the Users Need to Drill Down to Find the Desired OperationsDesired Operations

Choose a web service

Users Need to Drill Down to Find the Users Need to Drill Down to Find the Desired OperationsDesired Operations

Choose an operation

Users Need to Drill Down to Find the Users Need to Drill Down to Find the Desired OperationsDesired Operations

Enter the input parameters

Users Need to Drill Down to Find the Users Need to Drill Down to Find the Desired OperationsDesired Operations

Results – output

How to Improve Web Service Search?How to Improve Web Service Search?Offer users more flexibility by providing Offer users more flexibility by providing

similar operationssimilar operationsBase the similarity comparison on the Base the similarity comparison on the

underlying semanticsunderlying semantics

1) 1) Provide Similar WS OperationsProvide Similar WS Operations Op1: GetTemperatureOp1: GetTemperature

Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return

Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, Output: TemperatureF, WindChill,

HumidityHumidity

Similar Operations

Select the most appropriate

one

2) Provide Operations with Similar Inputs/Outputs2) Provide Operations with Similar Inputs/Outputs Op1: GetTemperatureOp1: GetTemperature

Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return

Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, Output: TemperatureF, WindChill,

HumidityHumidity Op3: LocalTimeByZipcodeOp3: LocalTimeByZipcode

Input: ZipcodeInput: Zipcode Output: LocalTimeByZipCodeResultOutput: LocalTimeByZipCodeResult

Op4: ZipCodeToCityStateOp4: ZipCodeToCityState Input: ZipCodeInput: ZipCode Output: City, StateOutput: City, State

Similar Inputs

Aggregate the results of

the operations

3) 3) Provide Composable WS OperationsProvide Composable WS Operations Op1: GetTemperatureOp1: GetTemperature

Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return

Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, HumidityOutput: TemperatureF, WindChill, Humidity

Op3: LocalTimeByZipcodeOp3: LocalTimeByZipcode Input: ZipcodeInput: Zipcode Output: LocalTimeByZipCodeResultOutput: LocalTimeByZipCodeResult

Op4: ZipCodeToCityStateOp4: ZipCodeToCityState Input: ZipCodeInput: ZipCode Output: City, StateOutput: City, State

Op5: CityStateToZipCodeOp5: CityStateToZipCode Input: City, StateInput: City, State Output: ZipCodeOutput: ZipCode

Input of Op2 is similar to

Output of Op5

Compose web-service operations

Searching with WoogleSearching with Woogle

Similar Operations, Inputs, Outputs

Composable with Input, Output

Searching with WoogleSearching with Woogle

A sample list of similar operations

Jump from operation to operation

Elementary ProblemsElementary Problems Two elementary problems:Two elementary problems:

Operation matching: Operation matching: Given a web-service operation, Given a web-service operation, return a list of similar operationsreturn a list of similar operations

Input/output matching: Input/output matching: Given the input/output of a Given the input/output of a web-service operation, return a list of web-service web-service operation, return a list of web-service operations with similar inputs/outputsoperations with similar inputs/outputs

Goal:Goal: High recallHigh recall: Return potentially similar operations: Return potentially similar operations Good rankingGood ranking: Rank closer operations higher: Rank closer operations higher

Can We Apply Previous Work?Can We Apply Previous Work? Software component matching Software component matching

Require the knowledge of implementation Require the knowledge of implementation – We only know the interface– We only know the interface

Schema matchingSchema matching Similarity on different granularitySimilarity on different granularity Web services are more loosely relatedWeb services are more loosely related

Text document matchingText document matching TF/IDF: term frequency analysis TF/IDF: term frequency analysis E.g. GoogleE.g. Google

Why Text Matching Does not Apply?Why Text Matching Does not Apply? Web page: often long textWeb page: often long text

Web service: very brief descriptionWeb service: very brief description

Lack of informationLack of information

Web Services Have Very Brief Web Services Have Very Brief DescriptionsDescriptions

Why Text Matching Does not Apply?Why Text Matching Does not Apply? Web page: often long textWeb page: often long text

Web service: very brief description Web service: very brief description

Lack of informationLack of information Web page: mainly plain textWeb page: mainly plain text

Web service: more complex structureWeb service: more complex structure

Finding term frequency is not enoughFinding term frequency is not enough

Operations Have More Complex StructuresOperations Have More Complex Structures Op1: GetTemperatureOp1: GetTemperature

Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return

Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, HumidityOutput: TemperatureF, WindChill, Humidity

Op3: LocalTimeByZipcodeOp3: LocalTimeByZipcode Input: ZipcodeInput: Zipcode Output: LocalTimeByZipCodeResultOutput: LocalTimeByZipCodeResult

Op4: ZipCodeToCityStateOp4: ZipCodeToCityState Input: ZipCodeInput: ZipCode Output: City, StateOutput: City, State

Op5: CityStateToZipCodeOp5: CityStateToZipCode Input: City, StateInput: City, State Output: ZipCodeOutput: ZipCode

Similar use of words, but opposite functionality

Our Solution Our Solution Part 1: Exploit StructurePart 1: Exploit Structure

Web ServiceCorpus

Web service description

Operation name and description

Input parameter names

Output parameter names

OperationSimilarity

Why Text Matching Does not Apply?Why Text Matching Does not Apply? Web page: often long textWeb page: often long text

Web service: very brief description Web service: very brief description

Lack of informationLack of information Web page: mainly plain textWeb page: mainly plain text

Web service: more complex structureWeb service: more complex structure

Finding term frequency is not enoughFinding term frequency is not enough Operation and parameter names are highly variedOperation and parameter names are highly varied

Finding word usage patterns is hard Finding word usage patterns is hard

Parameter Names Are Highly VariedParameter Names Are Highly Varied Op1: GetTemperatureOp1: GetTemperature

Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return

Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, HumidityOutput: TemperatureF, WindChill, Humidity

Op3: LocalTimeByZipcodeOp3: LocalTimeByZipcode Input: ZipcodeInput: Zipcode Output: LocalTimeByZipCodeResultOutput: LocalTimeByZipCodeResult

Op4: ZipCodeToCityStateOp4: ZipCodeToCityState Input: ZipCodeInput: ZipCode Output: City, StateOutput: City, State

Op5: CityStateToZipCodeOp5: CityStateToZipCode Input: City, StateInput: City, State Output: ZipCodeOutput: ZipCode

Input parameter names

Output parameter names

Our Solution Our Solution Part 2: Cluster Parameters into ConceptsPart 2: Cluster Parameters into Concepts

Web ServiceCorpus

Web service description

Operation name and description

Input parameter names & concepts

Output parameter names & concepts

OperationSimilarity

Concepts

OutlineOutlineOverviewOverviewClustering parameter namesClustering parameter namesExperimental evaluationExperimental evaluationConclusions and ongoing workConclusions and ongoing work

Clustering Parameter NamesClustering Parameter Names Heuristic: Parameter terms tend to express the Heuristic: Parameter terms tend to express the

same concept if they occur together oftensame concept if they occur together often Strategy: Cluster parameter terms into Strategy: Cluster parameter terms into conceptsconcepts

based on their co-occurrencesbased on their co-occurrences Given terms Given terms pp and and qq, , similaritysimilarity from from p p to to qq::

Sim(pSim(pq) = P(q|p) q) = P(q|p) Directional: e.g. Directional: e.g. Sim Sim ((zipzipcodecode) > ) > Sim Sim ((codecodezipzip))

( (ZipCode v.s. TeamCodeZipCode v.s. TeamCode, , ProxyCodeProxyCode, , BarCodeBarCode, etc.), etc.)

Term Term p p is is close close to to qq:: Sim(pSim(pq) > Threshold e.gq) > Threshold e.g. . citycity is close to is close to statestate..

Criteria for an Ideal ClusteringCriteria for an Ideal Clustering High cohesion and low correlationHigh cohesion and low correlation

cohesion cohesion measures the intra-cluster term similaritymeasures the intra-cluster term similarity correlationcorrelation measures the inter-cluster term similarity measures the inter-cluster term similarity

cohesion/correlation scorecohesion/correlation score = = )avg(

)avg(

ncorrelatio

cohesion

Clustering Algorithm (I)Clustering Algorithm (I) Algorithm – a series of refinements of the classic Algorithm – a series of refinements of the classic

agglomerative clusteringagglomerative clustering Basic agglomerative clustering: merge clusters Basic agglomerative clustering: merge clusters I I

and and J J if term if term ii in in II is close to term is close to term j j in in JJ

Clustering Algorithm (II)Clustering Algorithm (II) Problem: Problem:

{temperature, windchill} + {zip}{temperature, windchill} + {zip}

=>=> {temperature, windchill, zip}{temperature, windchill, zip} Solution: Solution:

Cohesion condition:Cohesion condition: each term in the result cluster is each term in the result cluster is close to most (e.g. half) of the other terms in the close to most (e.g. half) of the other terms in the clustercluster

Refined Algorithm: merge clusters Refined Algorithm: merge clusters I I and and J J only if the only if the result cluster satisfies the cohesion conditionresult cluster satisfies the cohesion condition

Clustering Algorithm (III)Clustering Algorithm (III) Problem:Problem:

{code, zip} + {city, state, street}{code, zip} + {city, state, street}

{code} + {zip, city, state, street}{code} + {zip, city, state, street} Solution: split before mergeSolution: split before merge

I

J

I

JI-I’I’

J

I-I’I’ I

JI-I’I’J-J’J’

I-I’I’

J-J’J’

=>=>

Clustering Algorithm (IV)Clustering Algorithm (IV) Problem: Problem:

{city, state, street} + {zip, code}{city, state, street} + {zip, code}

=> => {city, state, street, zip, code}{city, state, street, zip, code} Solution: Solution:

noise noise terms – most (e.g. half) of the occurrences are terms – most (e.g. half) of the occurrences are not accompanied by other terms in the conceptnot accompanied by other terms in the concept

After a pass of splitting and merging, remove noise After a pass of splitting and merging, remove noise terms.terms.

Clustering Algorithm (V)Clustering Algorithm (V) Problems: Problems:

The cohesion condition is too strict for large conceptsThe cohesion condition is too strict for large concepts The terms taken off during splitting lose the chance to The terms taken off during splitting lose the chance to

merge with other termsmerge with other terms

Solution: Run the algorithm iterativelySolution: Run the algorithm iterativelydo{do{

refined agglomerative clustering (a set of splitting-and-merging);refined agglomerative clustering (a set of splitting-and-merging);

remove noise terms;remove noise terms;

replace each term with its concept;replace each term with its concept;

} while (} while (no more mergesno more merges))

OutlinesOutlinesOverviewOverviewClustering parameter namesClustering parameter namesExperimental evaluationExperimental evaluationConclusions and ongoing workConclusions and ongoing work

Experiment Data and Clustering ResultsExperiment Data and Clustering Results Data set:Data set:

790 web services (431 are active)790 web services (431 are active) 1574 distinct operations1574 distinct operations 3148 inputs/outputs3148 inputs/outputs

Clustering results:Clustering results: 1599 parameter terms 1599 parameter terms 623 concepts623 concepts

441 single-term concepts (54 frequent terms and 387 441 single-term concepts (54 frequent terms and 387 infrequent terms)infrequent terms)

182 multi-term concepts (59 concepts with more than 5 182 multi-term concepts (59 concepts with more than 5 terms)terms)

Example ClustersExample Clusters (temperature, heatindex, icon, chance, precipe, uv, like, (temperature, heatindex, icon, chance, precipe, uv, like,

temprature, dew, feel, weather, wind, humid, visible, temprature, dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, china, moonrise, sunset, heat, precipit, extend, forecast, china, local, update)local, update)

(entere, enter, pitcher, situation, overall, hit, double, strike, (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier)players, draft, experier, birth, modifier)

(state, city)(state, city) (zip)(zip) (code)(code)

Example ClustersExample Clusters (temperature, heatindex, icon, chance, precipe, uv, like, (temperature, heatindex, icon, chance, precipe, uv, like,

tempraturetemprature, dew, feel, weather, wind, humid, visible, , dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, china, moonrise, sunset, heat, precipit, extend, forecast, china, local, update)local, update)

(entere, enter, pitcher, situation, overall, hit, double, strike, (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier)players, draft, experier, birth, modifier)

(state, city)(state, city) (zip)(zip) (code)(code)

Example ClustersExample Clusters (temperature, heatindex, icon, chance, precipe, uv, like, (temperature, heatindex, icon, chance, precipe, uv, like,

temprature, dew, feel, weather, wind, humid, visible, temprature, dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, moonrise, sunset, heat, precipit, extend, forecast, chinachina, , local, update)local, update)

(entere, enter, pitcher, situation, overall, hit, double, strike, (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier)players, draft, experier, birth, modifier)

(state, city)(state, city) (zip)(zip) (code)(code)

Measuring Top-K PrecisionMeasuring Top-K Precision BenchmarkBenchmark

25 web-service operations25 web-service operations From several domainsFrom several domains With different input/output sizes and description sizesWith different input/output sizes and description sizes

Manually label whether the top hits are similarManually label whether the top hits are similar

MeasureMeasure Top-k precision: precision for the top-k hitsTop-k precision: precision for the top-k hits

Top-k Precision for Operation MatchingTop-k Precision for Operation MatchingWoogle

Text matching on descriptions

Ignore structure

Top-k Precision for Input/output MatchingTop-k Precision for Input/output Matching

Measuring Precision and RecallMeasuring Precision and Recall Benchmark:Benchmark:

8 web-service operations and 15 inputs/outputs8 web-service operations and 15 inputs/outputs From 6 domainsFrom 6 domains With different popularityWith different popularity Inputs/outputs convey different numbers of concepts, and Inputs/outputs convey different numbers of concepts, and

concepts have varied popularityconcepts have varied popularity

Manually label similar operations and inputs/outputs.Manually label similar operations and inputs/outputs.

Measure: R-P (Recall-Precision) curveMeasure: R-P (Recall-Precision) curve

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cisi

on Func

Comb

ParOnly

Woogle

Impact of Multiple Sources of Evidences Impact of Multiple Sources of Evidences in Operation Matchingin Operation Matching

Wooglewithout

clustering

Ignore structure

Text matching on descriptions

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cisi

on ParIO

ConIO

Woogle

Impact of Parameter Clustering in Impact of Parameter Clustering in Input/output MatchingInput/output Matching

WoogleCompare

only concepts

Compare only parameter names

ConclusionsConclusions Defined primitives for web-service searchDefined primitives for web-service search Algorithms for similarity search on web-service Algorithms for similarity search on web-service

operationsoperations Exploit structure informationExploit structure information Cluster parameter names into concepts based on Cluster parameter names into concepts based on

their co-occurrencestheir co-occurrences

Experiments show that the algorithm obtains Experiments show that the algorithm obtains high recall and precision.high recall and precision.

Ongoing Work I – Template search Ongoing Work I – Template search on Operationson Operations

Input: city stateOutput: weatherDescription: forecast in the

next nine days

Ongoing Work I – Template search Ongoing Work I – Template search on Operationson Operations

GetWeatherByCityState

Ongoing Work II – Composition Ongoing Work II – Composition search on Operationssearch on Operations

See compositions

Ongoing Work II – Composition Ongoing Work II – Composition search on Operationssearch on Operations

getZIPInfoByAddress+GetNineDayForecastInfo

Ongoing Work III – Automatic Web Ongoing Work III – Automatic Web Service InvocationService Invocation

city=“Seattle” state=“WA”

Similarity Search for Similarity Search for Web ServicesWeb Services

@VLDB 2004@VLDB 2004Xin (Luna) Dong, Alon Halevy, Xin (Luna) Dong, Alon Halevy,

Jayant Madhavan, Ema Nemes, Jun ZhangJayant Madhavan, Ema Nemes, Jun Zhang

University of WashingtonUniversity of Washington

www.cs.washington.edu/wooglewww.cs.washington.edu/woogle

Ongoing Work I – Template search Ongoing Work I – Template search on Operationson Operations

Italian CAPLocation InformationHoliday Information

Get Weather Forecast

top related