seekda's web service search engine

35
© Copyright 2012 SEEKDA GmbH – www.seekda.com seekda‘s Web Service Search Engine 1 Nathalie Steinmetz seekda GmbH

Upload: nathalie-steinmetz

Post on 05-Dec-2014

1.598 views

Category:

Technology


3 download

DESCRIPTION

Presentation at the Semantic Web meetup in Seattle, WA, USA, in March 2012: http://www.meetup.com/Semantically-Webbed-Seattle-Meetup-Group/events/52635992/

TRANSCRIPT

Page 1: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

seekda‘s Web Service Search Engine

1

Nathalie Steinmetz

seekda GmbH

Page 2: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

seekda Web Service Search Engine

2

Page 3: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Motivation

“Web of services” Growing amount of public services & data on the Web Problem: How do I find the service I need?

General search engine: services hard to identify, not much information on results page

Specific portals: access to restricted sets of registered and editorially maintained services

Use semantic technologies for better search experience No to heavy-weight, expressive semantic web service languages

such as OWL-S or WSML Yes to simple light-weight semantic annotations in RDF Scalability!

3

Page 4: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Outline

Web Service search engine - basics Focused Crawling WSDL-based services Web APIs

Seekda‘s search engine & experimental prototype

Crowdsourcing Web Service annotations Web Service Annotation wizard Amazon Mechanical Turk crowdsourcing

Service ontologies

Page 5: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Location

Locating Web Services on the Web (Approach adopted by European projects Service-Finder & SOA4All)

Crawling the Web for services Aggregate information Annotate services

Supported services: WSDL descriptions Web APIs (a.k.a. RESTful services)

5

Page 6: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Crawler Architecture

6

Crawling

DataPost-Processing

Collecting SeedsCrawl Operator

ARCs Index

Co

nfig

uration

& M

onitorin

g

RDFmeta-data

Page 7: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Crawling the Web for Services

Basic crawling process: Start with a set of seed URLs Check whether a page should be fetched or not Fetch the document the URL points to Extract links from the fetched document Decide whether or not to store fetched documents Feed crawler queues with newly extracted links Assign costs/priorities to single URLs and queues

7

Page 8: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Focused Crawling Techniques

Seed Collection Collecting seeds from specialized portals Reuse known Web Service descriptions and related documents

URL Scheduling Use clever means to prioritize URLs to focus the crawls to the relevant part of

the Web Assign costs that influence the priority of a URL in a queue Based on:

Building term vectors of pages to assess similarity to WS domain URL characteristics

Queue Scheduling One queue per host Prioritize queues with low-cost URLs

8

Page 9: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Identify WSDLs and Related Information

WSDL identification Check whether a fetched page is XML and valid WSDL

Related documents identification Definition of related document

Inlink to the WSDL Outlink from the WSDL Associated by term vector similarity

Task split between crawl run-time and post-processing of the crawl data

Task implies the deeper crawling of service provider domains

9

Page 10: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Unique Service Objects

Building unique service objects Collect all similar WSDLs deduplication

One service = all WSDLs with same provider and service Example:

Unique Service: http://seekda.com/providers/cdyne.com/IP2Geo Endpoint: http://ws.cdyne.com/ip2geo/ip2geo.asmx Provider: cdyne.com Service: IP2Geo WSDLs:

http://ws.cdyne.com/ip2geo/ip2geo.asmx?wsdlhttp://miki2005.uda.ad/p1net/Web%20References/com.cdyne.ws/ip2geo.wsdl...

Create uniqe service identifiers: http://seekda.com/providers/<providerName>/<serviceName>

Assemble related information

10

Page 11: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Search Results

11

Page 12: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Overview

12

Page 13: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

seekda Web Service Search Engine

13

WSDL ONLY

Page 14: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Why crawl for Web APIs?

Significant growth of Web APIs > 5,400 Web APIs on ProgrammableWeb (including SOAP and

REST APIs) [end of 2009: ca. 1,500 Web APIs] > 6,500 Mashups on ProgrammableWeb (combining Web APIs

from one or more sources) SOAP services are only a small part of the overall available

public services

14

Page 15: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Web API – Example (1/3)

15

Page 16: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Web API – Example (2/3)

16

Page 17: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Web API – Example (3/3)

17

Problem: Web APIs are

described by regular HTML pages

No standardized structure that helps with the identification

Page 18: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Web API Identification

Solution: Crawl for Web APIs

Approach 1: Manual Feature Identification Approach Taking into account HTML structure (e.g., title, mark-up), syntactical

properties of used language (e.g., camel-cased words), and link properties of pages (ratio external links / internal links)

Approach 2: Automatic Classification Approach Text Classification, supervised learning (Support Vector Machine

model) Training set: APIs from ProgrammableWeb

18

Page 19: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Unique Service Objects – Web APIs

Create unique identifiers: Again using the provider name (from the Web API homepage) We do not know the service name hash value of URL instead http://seekda.com/providers/<providerName>/

<hashValueOfURL>

But: still needed human confirmation to be sure

19

Page 20: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

New Search Engine Prototype

20

Page 21: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Prototype – User Contributions

Web API – yes/no: confirmation from human needed!

Other annotations that help improve the search for Web Services

Categories Tags Natural Language descriptions Cost: Free or paid service

21

Page 22: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Problem - User Contribution

Problem: Users/developers don’t contribute enough Hard to motivate them to provide annotations Community recognition or peer respect not enough

Solution: crowdsourcing the annotations, pay people to provide annotations

Use Amazon Mechanical Turk Bootstrap annotations quickly and cheap

22

Page 23: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Annotation Wizard (1/4)

23

Page 24: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Annotation Wizard (2/4)

24

Page 25: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Annotation Wizard (3/4)

25

Page 26: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Annotation Wizard (4/4)

26

Page 27: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Amazon Mechanical Turk – Iteration 1

Annotation Wizard Web API Yes/No Assign a category Assign tags Provide a natural language description Determine whether page is documentation, pricing or listing Rate the service

27

Number of Submissions 70

Reward per task $0.10

Restrictions none

Page 28: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Amazon Mechanical Turk – Iteration 1

Results 21 APIs correctly identified as APIs 28 Web documents (non APIs) identified correctly as non APIs 49/70 correctly identified (70% accuracy) Average task completion time: 2:20 min

But, only: 4 well done & complete annotations 8 acceptable annotations (non complete)

28

Page 29: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Amazon Mechanical Turk – Iterations 2 & 3

Annotation Wizard Removed page type identification & service rating For a task to be accepted:

At least one category must be assigned At least 2 tags must be provided A meaningful description must be provided

29

Iteration 2 Iteration 3

Number of Submissions 100 150

Reward per task $0.20 $0.20

Restrictions yes yes

Page 30: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Amazon Mechanical Turk – Iteration 2 & 3

Results Iteration 2 & 3: Ca. 80% of documents correctly identified Very satisfying annotations Average completion time: 2:36 min

30

Page 31: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Amazon Mechanical Turk – Survey

48 survey submissions Female 18, Male 30 Most popular origins: India (27) and USA (9) Popular age groups:

15-22 (12) 23-30 (18) 31-50 (16)

Most of them worked in some IT profession Provided best quality annotations

31

Page 32: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Amazon Mechanical Turk

Recommendations for further improvement: Improve task description, especially ‘what is a Web API’ Better examples (e.g., hinting what makes a false page false) Allow assignment of multiple categories Restrict to workers in IT professions?

Conclusion: Very positive results good way to get quality annotations Results will help provide better search experience to users Results can be used as positive set for automatic classification

32

Page 33: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Ontologies (1/2)

33

Page 34: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Service Ontologies (2/2)

34

http://www.service-finder.eu/ontologies/ServiceCategories

Page 35: seekda's Web Service search engine

© Copyright 2012 SEEKDA GmbH – www.seekda.com

Questions?

35