similarity search for web services xin (luna) dong, alon halevy, jayant madhavan, ema nemes, jun...
TRANSCRIPT
Similarity Search for Similarity Search for Web ServicesWeb Services
Xin (Luna) DongXin (Luna) Dong, Alon Halevy, , Alon Halevy, Jayant Madhavan, Ema Nemes, Jun ZhangJayant Madhavan, Ema Nemes, Jun Zhang
University of WashingtonUniversity of Washington
Web Service SearchWeb Service Search Web services are getting popular within Web services are getting popular within
organizations and on the weborganizations and on the web The growing number of web services raises the The growing number of web services raises the
problem of web-service search.problem of web-service search. First-generation web-service search engines do First-generation web-service search engines do
keyword search on web-service descriptionskeyword search on web-service descriptions BindingPoint, Grand Central, Web Service List, BindingPoint, Grand Central, Web Service List,
Salcentral, Web Service of the Day, Remote Methods, Salcentral, Web Service of the Day, Remote Methods, etc.etc.
Keyword Search does not Capture the Keyword Search does not Capture the Underlying SemanticsUnderlying Semantics
zip
Keyword Search does not Capture the Keyword Search does not Capture the Underlying SemanticsUnderlying Semantics
50
Keyword Search does not Capture the Keyword Search does not Capture the Underlying SemanticsUnderlying Semantics
zipcode
Keyword Search does not Capture the Keyword Search does not Capture the Underlying SemanticsUnderlying Semantics
18
Keyword Search does not Accurately Keyword Search does not Accurately Specify Users’ Information NeedsSpecify Users’ Information Needs
Keyword Search does not Accurately Keyword Search does not Accurately Specify Users’ Information NeedsSpecify Users’ Information Needs
Users Need to Drill Down to Find the Users Need to Drill Down to Find the Desired OperationsDesired Operations
Choose a web service
Users Need to Drill Down to Find the Users Need to Drill Down to Find the Desired OperationsDesired Operations
Choose an operation
Users Need to Drill Down to Find the Users Need to Drill Down to Find the Desired OperationsDesired Operations
Enter the input parameters
Users Need to Drill Down to Find the Users Need to Drill Down to Find the Desired OperationsDesired Operations
Results – output
How to Improve Web Service Search?How to Improve Web Service Search?Offer users more flexibility by providing Offer users more flexibility by providing
similar operationssimilar operationsBase the similarity comparison on the Base the similarity comparison on the
underlying semanticsunderlying semantics
1) 1) Provide Similar WS OperationsProvide Similar WS Operations Op1: GetTemperatureOp1: GetTemperature
Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return
Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, Output: TemperatureF, WindChill,
HumidityHumidity
Similar Operations
Select the most appropriate
one
2) Provide Operations with Similar Inputs/Outputs2) Provide Operations with Similar Inputs/Outputs Op1: GetTemperatureOp1: GetTemperature
Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return
Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, Output: TemperatureF, WindChill,
HumidityHumidity Op3: LocalTimeByZipcodeOp3: LocalTimeByZipcode
Input: ZipcodeInput: Zipcode Output: LocalTimeByZipCodeResultOutput: LocalTimeByZipCodeResult
Op4: ZipCodeToCityStateOp4: ZipCodeToCityState Input: ZipCodeInput: ZipCode Output: City, StateOutput: City, State
Similar Inputs
Aggregate the results of
the operations
3) 3) Provide Composable WS OperationsProvide Composable WS Operations Op1: GetTemperatureOp1: GetTemperature
Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return
Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, HumidityOutput: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcodeOp3: LocalTimeByZipcode Input: ZipcodeInput: Zipcode Output: LocalTimeByZipCodeResultOutput: LocalTimeByZipCodeResult
Op4: ZipCodeToCityStateOp4: ZipCodeToCityState Input: ZipCodeInput: ZipCode Output: City, StateOutput: City, State
Op5: CityStateToZipCodeOp5: CityStateToZipCode Input: City, StateInput: City, State Output: ZipCodeOutput: ZipCode
Input of Op2 is similar to
Output of Op5
Compose web-service operations
Searching with WoogleSearching with Woogle
Similar Operations, Inputs, Outputs
Composable with Input, Output
Searching with WoogleSearching with Woogle
A sample list of similar operations
Jump from operation to operation
Elementary ProblemsElementary Problems Two elementary problems:Two elementary problems:
Operation matching: Operation matching: Given a web-service operation, Given a web-service operation, return a list of similar operationsreturn a list of similar operations
Input/output matching: Input/output matching: Given the input/output of a Given the input/output of a web-service operation, return a list of web-service web-service operation, return a list of web-service operations with similar inputs/outputsoperations with similar inputs/outputs
Goal:Goal: High recallHigh recall: Return potentially similar operations: Return potentially similar operations Good rankingGood ranking: Rank closer operations higher: Rank closer operations higher
Can We Apply Previous Work?Can We Apply Previous Work? Software component matching Software component matching
Require the knowledge of implementation Require the knowledge of implementation – We only know the interface– We only know the interface
Schema matchingSchema matching Similarity on different granularitySimilarity on different granularity Web services are more loosely relatedWeb services are more loosely related
Text document matchingText document matching TF/IDF: term frequency analysis TF/IDF: term frequency analysis E.g. GoogleE.g. Google
Why Text Matching Does not Apply?Why Text Matching Does not Apply? Web page: often long textWeb page: often long text
Web service: very brief descriptionWeb service: very brief description
Lack of informationLack of information
Web Services Have Very Brief Web Services Have Very Brief DescriptionsDescriptions
Why Text Matching Does not Apply?Why Text Matching Does not Apply? Web page: often long textWeb page: often long text
Web service: very brief description Web service: very brief description
Lack of informationLack of information Web page: mainly plain textWeb page: mainly plain text
Web service: more complex structureWeb service: more complex structure
Finding term frequency is not enoughFinding term frequency is not enough
Operations Have More Complex StructuresOperations Have More Complex Structures Op1: GetTemperatureOp1: GetTemperature
Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return
Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, HumidityOutput: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcodeOp3: LocalTimeByZipcode Input: ZipcodeInput: Zipcode Output: LocalTimeByZipCodeResultOutput: LocalTimeByZipCodeResult
Op4: ZipCodeToCityStateOp4: ZipCodeToCityState Input: ZipCodeInput: ZipCode Output: City, StateOutput: City, State
Op5: CityStateToZipCodeOp5: CityStateToZipCode Input: City, StateInput: City, State Output: ZipCodeOutput: ZipCode
Similar use of words, but opposite functionality
Our Solution Our Solution Part 1: Exploit StructurePart 1: Exploit Structure
Web ServiceCorpus
Web service description
Operation name and description
Input parameter names
Output parameter names
OperationSimilarity
Why Text Matching Does not Apply?Why Text Matching Does not Apply? Web page: often long textWeb page: often long text
Web service: very brief description Web service: very brief description
Lack of informationLack of information Web page: mainly plain textWeb page: mainly plain text
Web service: more complex structureWeb service: more complex structure
Finding term frequency is not enoughFinding term frequency is not enough Operation and parameter names are highly variedOperation and parameter names are highly varied
Finding word usage patterns is hard Finding word usage patterns is hard
Parameter Names Are Highly VariedParameter Names Are Highly Varied Op1: GetTemperatureOp1: GetTemperature
Input: Zip, AuthorizationInput: Zip, Authorization Output: ReturnOutput: Return
Op2: WeatherFetcherOp2: WeatherFetcher Input: PostCodeInput: PostCode Output: TemperatureF, WindChill, HumidityOutput: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcodeOp3: LocalTimeByZipcode Input: ZipcodeInput: Zipcode Output: LocalTimeByZipCodeResultOutput: LocalTimeByZipCodeResult
Op4: ZipCodeToCityStateOp4: ZipCodeToCityState Input: ZipCodeInput: ZipCode Output: City, StateOutput: City, State
Op5: CityStateToZipCodeOp5: CityStateToZipCode Input: City, StateInput: City, State Output: ZipCodeOutput: ZipCode
Input parameter names
Output parameter names
Our Solution Our Solution Part 2: Cluster Parameters into ConceptsPart 2: Cluster Parameters into Concepts
Web ServiceCorpus
Web service description
Operation name and description
Input parameter names & concepts
Output parameter names & concepts
OperationSimilarity
Concepts
OutlineOutlineOverviewOverviewClustering parameter namesClustering parameter namesExperimental evaluationExperimental evaluationConclusions and ongoing workConclusions and ongoing work
Clustering Parameter NamesClustering Parameter Names Heuristic: Parameter terms tend to express the Heuristic: Parameter terms tend to express the
same concept if they occur together oftensame concept if they occur together often Strategy: Cluster parameter terms into Strategy: Cluster parameter terms into conceptsconcepts
based on their co-occurrencesbased on their co-occurrences Given terms Given terms pp and and qq, , similaritysimilarity from from p p to to qq::
Sim(pSim(pq) = P(q|p) q) = P(q|p) Directional: e.g. Directional: e.g. Sim Sim ((zipzipcodecode) > ) > Sim Sim ((codecodezipzip))
( (ZipCode v.s. TeamCodeZipCode v.s. TeamCode, , ProxyCodeProxyCode, , BarCodeBarCode, etc.), etc.)
Term Term p p is is close close to to qq:: Sim(pSim(pq) > Threshold e.gq) > Threshold e.g. . citycity is close to is close to statestate..
Criteria for an Ideal ClusteringCriteria for an Ideal Clustering High cohesion and low correlationHigh cohesion and low correlation
cohesion cohesion measures the intra-cluster term similaritymeasures the intra-cluster term similarity correlationcorrelation measures the inter-cluster term similarity measures the inter-cluster term similarity
cohesion/correlation scorecohesion/correlation score = = )avg(
)avg(
ncorrelatio
cohesion
Clustering Algorithm (I)Clustering Algorithm (I) Algorithm – a series of refinements of the classic Algorithm – a series of refinements of the classic
agglomerative clusteringagglomerative clustering Basic agglomerative clustering: merge clusters Basic agglomerative clustering: merge clusters I I
and and J J if term if term ii in in II is close to term is close to term j j in in JJ
Clustering Algorithm (II)Clustering Algorithm (II) Problem: Problem:
{temperature, windchill} + {zip}{temperature, windchill} + {zip}
=>=> {temperature, windchill, zip}{temperature, windchill, zip} Solution: Solution:
Cohesion condition:Cohesion condition: each term in the result cluster is each term in the result cluster is close to most (e.g. half) of the other terms in the close to most (e.g. half) of the other terms in the clustercluster
Refined Algorithm: merge clusters Refined Algorithm: merge clusters I I and and J J only if the only if the result cluster satisfies the cohesion conditionresult cluster satisfies the cohesion condition
Clustering Algorithm (III)Clustering Algorithm (III) Problem:Problem:
{code, zip} + {city, state, street}{code, zip} + {city, state, street}
{code} + {zip, city, state, street}{code} + {zip, city, state, street} Solution: split before mergeSolution: split before merge
I
J
I
JI-I’I’
J
I-I’I’ I
JI-I’I’J-J’J’
I-I’I’
J-J’J’
=>=>
Clustering Algorithm (IV)Clustering Algorithm (IV) Problem: Problem:
{city, state, street} + {zip, code}{city, state, street} + {zip, code}
=> => {city, state, street, zip, code}{city, state, street, zip, code} Solution: Solution:
noise noise terms – most (e.g. half) of the occurrences are terms – most (e.g. half) of the occurrences are not accompanied by other terms in the conceptnot accompanied by other terms in the concept
After a pass of splitting and merging, remove noise After a pass of splitting and merging, remove noise terms.terms.
Clustering Algorithm (V)Clustering Algorithm (V) Problems: Problems:
The cohesion condition is too strict for large conceptsThe cohesion condition is too strict for large concepts The terms taken off during splitting lose the chance to The terms taken off during splitting lose the chance to
merge with other termsmerge with other terms
Solution: Run the algorithm iterativelySolution: Run the algorithm iterativelydo{do{
refined agglomerative clustering (a set of splitting-and-merging);refined agglomerative clustering (a set of splitting-and-merging);
remove noise terms;remove noise terms;
replace each term with its concept;replace each term with its concept;
} while (} while (no more mergesno more merges))
OutlinesOutlinesOverviewOverviewClustering parameter namesClustering parameter namesExperimental evaluationExperimental evaluationConclusions and ongoing workConclusions and ongoing work
Experiment Data and Clustering ResultsExperiment Data and Clustering Results Data set:Data set:
790 web services (431 are active)790 web services (431 are active) 1574 distinct operations1574 distinct operations 3148 inputs/outputs3148 inputs/outputs
Clustering results:Clustering results: 1599 parameter terms 1599 parameter terms 623 concepts623 concepts
441 single-term concepts (54 frequent terms and 387 441 single-term concepts (54 frequent terms and 387 infrequent terms)infrequent terms)
182 multi-term concepts (59 concepts with more than 5 182 multi-term concepts (59 concepts with more than 5 terms)terms)
Example ClustersExample Clusters (temperature, heatindex, icon, chance, precipe, uv, like, (temperature, heatindex, icon, chance, precipe, uv, like,
temprature, dew, feel, weather, wind, humid, visible, temprature, dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, china, moonrise, sunset, heat, precipit, extend, forecast, china, local, update)local, update)
(entere, enter, pitcher, situation, overall, hit, double, strike, (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier)players, draft, experier, birth, modifier)
(state, city)(state, city) (zip)(zip) (code)(code)
Example ClustersExample Clusters (temperature, heatindex, icon, chance, precipe, uv, like, (temperature, heatindex, icon, chance, precipe, uv, like,
tempraturetemprature, dew, feel, weather, wind, humid, visible, , dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, china, moonrise, sunset, heat, precipit, extend, forecast, china, local, update)local, update)
(entere, enter, pitcher, situation, overall, hit, double, strike, (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier)players, draft, experier, birth, modifier)
(state, city)(state, city) (zip)(zip) (code)(code)
Example ClustersExample Clusters (temperature, heatindex, icon, chance, precipe, uv, like, (temperature, heatindex, icon, chance, precipe, uv, like,
temprature, dew, feel, weather, wind, humid, visible, temprature, dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, moonrise, sunset, heat, precipit, extend, forecast, chinachina, , local, update)local, update)
(entere, enter, pitcher, situation, overall, hit, double, strike, (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier)players, draft, experier, birth, modifier)
(state, city)(state, city) (zip)(zip) (code)(code)
Measuring Top-K PrecisionMeasuring Top-K Precision BenchmarkBenchmark
25 web-service operations25 web-service operations From several domainsFrom several domains With different input/output sizes and description sizesWith different input/output sizes and description sizes
Manually label whether the top hits are similarManually label whether the top hits are similar
MeasureMeasure Top-k precision: precision for the top-k hitsTop-k precision: precision for the top-k hits
Top-k Precision for Operation MatchingTop-k Precision for Operation MatchingWoogle
Text matching on descriptions
Ignore structure
Top-k Precision for Input/output MatchingTop-k Precision for Input/output Matching
Measuring Precision and RecallMeasuring Precision and Recall Benchmark:Benchmark:
8 web-service operations and 15 inputs/outputs8 web-service operations and 15 inputs/outputs From 6 domainsFrom 6 domains With different popularityWith different popularity Inputs/outputs convey different numbers of concepts, and Inputs/outputs convey different numbers of concepts, and
concepts have varied popularityconcepts have varied popularity
Manually label similar operations and inputs/outputs.Manually label similar operations and inputs/outputs.
Measure: R-P (Recall-Precision) curveMeasure: R-P (Recall-Precision) curve
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cisi
on Func
Comb
ParOnly
Woogle
Impact of Multiple Sources of Evidences Impact of Multiple Sources of Evidences in Operation Matchingin Operation Matching
Wooglewithout
clustering
Ignore structure
Text matching on descriptions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cisi
on ParIO
ConIO
Woogle
Impact of Parameter Clustering in Impact of Parameter Clustering in Input/output MatchingInput/output Matching
WoogleCompare
only concepts
Compare only parameter names
ConclusionsConclusions Defined primitives for web-service searchDefined primitives for web-service search Algorithms for similarity search on web-service Algorithms for similarity search on web-service
operationsoperations Exploit structure informationExploit structure information Cluster parameter names into concepts based on Cluster parameter names into concepts based on
their co-occurrencestheir co-occurrences
Experiments show that the algorithm obtains Experiments show that the algorithm obtains high recall and precision.high recall and precision.
Ongoing Work I – Template search Ongoing Work I – Template search on Operationson Operations
Input: city stateOutput: weatherDescription: forecast in the
next nine days
Ongoing Work I – Template search Ongoing Work I – Template search on Operationson Operations
GetWeatherByCityState
Ongoing Work II – Composition Ongoing Work II – Composition search on Operationssearch on Operations
See compositions
Ongoing Work II – Composition Ongoing Work II – Composition search on Operationssearch on Operations
getZIPInfoByAddress+GetNineDayForecastInfo
Ongoing Work III – Automatic Web Ongoing Work III – Automatic Web Service InvocationService Invocation
city=“Seattle” state=“WA”
Similarity Search for Similarity Search for Web ServicesWeb Services
@VLDB 2004@VLDB 2004Xin (Luna) Dong, Alon Halevy, Xin (Luna) Dong, Alon Halevy,
Jayant Madhavan, Ema Nemes, Jun ZhangJayant Madhavan, Ema Nemes, Jun Zhang
University of WashingtonUniversity of Washington
www.cs.washington.edu/wooglewww.cs.washington.edu/woogle
Ongoing Work I – Template search Ongoing Work I – Template search on Operationson Operations
Italian CAPLocation InformationHoliday Information
Get Weather Forecast