mashing up the deep web - research in progress€¦ · abstract: deep web (dw) sources offer a...

MASHING UP THE DEEP WEBResearch in Progress

Thomas Hornung, Kai Simon and Georg LausenInstitut fur Informatik, Universitat Freiburg, Georges-Kohler-Allee, Gebaude 51, 79110 Freiburg i.Br., Germany

{hornungt, ksimon, lausen}@informatik.uni-freiburg.de

Keywords: Assisted Mashup Generation, Deep Web, Information Integration.

Abstract: Deep Web (DW) sources offer a wealth of structured, high-quality data, which is hidden behind human-centricuser interfaces. Mashups, the combination of data from different Web services with formally defined queryinterfaces (QIs), are very popular today. If it would be possible to use DW sources as QIs, a whole new set ofdata services would be feasible.We present in this paper a framework that enables non-expert users to convert DW sources into machine-processable QIs. In the next step these QIs can be used to build a mashup graph, where each vertex represents aQI and edges organize the data flow between the QIs. To reduce the modeling time and increase the likelihoodof meaningful combinations, the user is assisted by a recommendation function during mashup modelingtime. Finally, an execution strategy is proposed that queries the most likely value combinations for each QI inparallel.

1 INTRODUCTION

Aggregating information about a specific topic onthe Web can easily result in a frustrating and time-consuming task, when the relevant data is spread overmultiple Web sites. Even in a simple scenario, ifwe try to answer the query ”What are the best-ratedmovies which are currently in theaters nearby?” itmight be necessary to first query one Web site whichreturns a list of all movies and then iteratively ac-cess another site to find the ratings for each movie.One characteristic that underlies many of such ad hocqueries is the dynamic nature of their results, e.g. in-formation about movies that were shown last weekis of no avail. Additionally it is often necessary tofill out HTML forms to navigate to the result pages,which contain the desired information.The part of the Web that exhibits the abovementionedproperties is often referred to as the Deep or Hid-den Web (DW) (Raghavan and Garcia-Molina, 2001).Due to its exponential growth and great subject di-versity as recently reported in (He et al., 2007) itis an excellent source of high-quality information.This sparked several proposals for DW query engines,which either support a (vertical) search confined to aspecific area with a unified query interface, e.g. (Heet al., 2005) and (Chang et al., 2005), or offer ad hoc

query capabilities over multiple DW sources such as(Davulcu et al., 1999).Although these proposals are quite elaborate andpowerful, they either constrain the possible combi-nations of DW sources or are based on a complexarchitecture which needs to be administrated. Forinstance domain experts might be necessary to de-fine the salient concepts for a specific area or to pro-vide wrappers for data extraction from result pages.Hence, a framework which allows non-expert users toadminstrate the system while retaining the possibil-ity to pose complex queries and solve real-life infor-mation aggregation problems in an intuitive fashionwould be beneficial. A promising approach for thiscould be mashup tools such as Yahoo Pipes1, whichoffer a graphical interface to arrange different Webquery interfaces (QIs) in a graph, where nodes rep-resent QIs and arcs model the data flow. This en-ables users with little or no programming skills to in-tuitively aggregate information in an ad hoc fashion.Unfortunately mashups are nowadays based on a fixset of pre-defined Web QIs that return structured datawith a formal, known semantics. Therefore usershave to rely on content providers to supply these pro-grammable interfaces, which unnecessarily limits the

1http://pipes.yahoo.com/pipes/

58

number of combinable data sources.In this paper we propose a framework which enablesnon-expert users to generate mashups based on DWsources. All aspects of the mashup lifecycle are sup-ported: starting with the acquisition of new sourcesvia the assisted generation of mashup graphs until thefinal execution.

1.1 Challenges

As our approach is focused on DW sources that aredesigned for human visitors, neither the fully auto-mated access to sources nor their combination is triv-ial. Particularly, we have to cope with the followingissues:

• Form interaction: In order to fill out the respectiveform fields, the user input has to be matched to alegal combination of input element assignments.We assume in this context a single stage interac-tion, i.e. the form field is filled with meaningfulcombinations and submitted, which directly leadsto the result page.

• Data record extraction and labeling: Each resultpage usually contains multiple data records whichcluster related information, similar to a row in alabeled table. These data records need to be iden-tified, extracted and labeled correctly. For thispurpose we use the fully automated Web data ex-traction tool ViPER (Simon and Lausen, 2005).The interested reader is referred to (Laender et al.,2002) for a brief survey of other existing tools andtheir capabilities.

• Fuzzy result lists: A formally defined QI alwaysreturns exact results, i.e. if more than one result isreturned, all results are known to be equally rel-evant. However Deep Web sources normally re-turn a list of results that match the input criteria tosome extent, often ranked by relevance.

• Data cleaning: For each column of a data recordthe correct data type has to be determined and thedata has to be transformed into a canonical repre-sentation, e.g. to bridge different representationsof numbers. Additionally it might be necessary toconvert the data to another reference system, forinstance another currency.

• Assisted mashup generation: Manually combin-ing different QIs to a mashup graph is only fea-sible in a small world-scenario, because the usereasily looses track of possible and meaningfulcombinations. Therefore it is of paramount im-portance to assist the user during the generationphase.

• Combinatorial explosion: As mentioned aboveDW sources normally return result lists. Thiscan lead to a combinatorial explosion in possi-ble value combinations. E.g. let source Q3 bethe sink in a mashup graph with two incomingdata edges from sources Q1 and Q2. For an ex-haustive search we would have to query Q3 with|Q1| ∗ |Q2| value combinations2, or in the generalcase for a data mashup graph with one sink andn incoming data edges Πn

i=1|Qi| value combina-tions need to be considered. Moreover, a typicaldata mashup graph would more likely have mul-tiple levels as the one shown in Figure 1, whichmeans that the combinations additionally multiplyalong each path as well.

1.2 Contributions

We propose a DW-based mashup framework calledFireSearch, which:

• Supports the semi-automatic acquisition of newDW sources (Section 2),

• Assists the user in combining these sources to amashup graph based on a recommendation func-tion (Section 3),

• And finally executes mashup graphs with specialconsideration for the limiting factors of an onlineWeb scenario (Section 4).

The main components of the framework have beenimplemented as an extension for the Firefox3 browserand tested on Linux and Windows platforms.

1.3 Running Example

When planning to buy an electronic device, it is oftendesirable to aggregate information from several trust-worthy sources considering different aspects. It mightfor instance not be advisable to buy the device at thestore that offers the cheapest price, if the customerservice is disappointing. Therefore the user wants todefine a mashup graph that assembles the necessaryinformation based on the following DW sources:

• Source Q1 is the official Web site of a trustwor-thy magazine, which regularly performs extensivetests of new electronic devices and publishes theresults on the Web site,

• Source Q2 is a searchable, community-drivenWeb portal with user experience-based ratings andreviews of Web retailers,

2Here |Qi| denotes the number of results of source Qi.3http://www.mozilla.com/en-US/firefox/

MASHING UP THE DEEP WEB - Research in Progress

59

City:

Street:

Electronic Device:

Sunset Boulevard

Los Angeles

TV Set

Q1

Q5

Q2

Q4

Q3

Figure 1: Data mashup graph that collects data about TVsets.

• Source Q3 is an online database of Web retailersand their addresses, searchable by the name of thestore,

• Source Q4 is a route and map service, and

• Source Q5 is a price comparison service whichsearches multiple stores for the cheapest price ofelectronic devices.

Figure 1 shows a mashup graph that yields the de-sired results. The user is presented with an interfacewhere she can enter the necessary data that is neededto initialize the mashup. First Q1 is accessed to findthe best-rated TV sets and then for each TV set thecheapest stores are determined by Q5. Afterwards Q2and Q3 are queried in parallel to obtain reviews andaddresses for each store and finally Q4 computes thedistance between the initial entered address and eachstore. As a result the user is presented with a tabu-lar view of the aggregated data and can now decidewhich TV set meets her desired criteria.

2 ACQUISITION OF NEW DEEPWEB SOURCES

Because our framework is implemented as an exten-sion to the Firefox browser, new DW sources can beintegrated while surfing the Web similar to Piggy-Bank (Huynh et al., 2007). But unlike PiggyBankwe consider dynamic Web pages that can only beaccessed by filling out HTML forms. To transformsuch a DW source into a machine-processable QI, wehave to identify the implicitly associated signature,i.e. the labels and datatypes of the input argumentsof the form and the labels of datatypes of the datarecords that are buried in the result page. Addition-ally each signature has to be associated with a con-figuration that assures that the correct form fields arefilled in and the data records in the result page are ex-tracted and labelled correctly. Because our frameworkis geared towards a non-expert user, the technical de-

tails of the configuration generation are transparent.It is sufficient to label the relevant input form fieldelements and the column headers of a tabular repre-sentation of the result page.

2.1 User Vocabulary

The labeling of DW sources is inspired by the idea ofsocial bookmarking: each user has a personal, evolv-ing vocabulary of tags. Here a tag is the combinationof a string label with an XML datatype (Biron andMalhotra, 2004). The user can generate new tags anytime during the labeling process and can thereby suc-cessively build up her own terminology. The associ-ated datatype is used for data cleaning purposes andto check that only meaningful comparison operatorsare used in the final mashup. As a means to organizethe vocabulary the user can specify relationships be-tween different tags, if she wants. For this purpose weidentified three types of meaningful relationships thusfar, called tag rules:

1. Equality: Two tags refer to the same concept (in-dicated by≡). If the tags refer to similar conceptsbut are given with respect to different referencesystems, e.g. different currencies, then the usercan register conversion functions to assure com-patibility,

2. Compound tag: Two or more tags that are con-catenated (indicated by +) are equal to anotherconcept,

3. Subtag: One tag is more specific than another tag(indicated by v).

Table 1 illustrates the abovementioned tag rules withan example. The system assures that the user can onlyspecify rules that are datatype compatible to encour-age meaningful rules.

2.2 Deep Web Query Interfaces

To enable easy and accurate labeling of input argu-ments the relevant tags can be dragged to the elementsof the selected form. When the user submits the form,her label assignments and form submission related in-formation, i.e. the POST or GET string is captured.As a result the system can now interact with the Websource and delegate values for the input arguments ap-propriately. In the next step, the result page is ana-lyzed and converted into a tabular representation.The analysis of the result page is done by the fullyautomatic data extraction system ViPER (Simon andLausen, 2005). ViPER suggests identified structureddata regions with decreasing importance to the user

WEBIST 2008 - International Conference on Web Information Systems and Technologies

60

Table 1: Supported tag rules.

# Relationship Example1 Equality film≡movie2 Equality and Function price-eur≡ convert(price-usd,price-eur)3 Compound tag firstName+ familyName≡ name4 Subtag posterPicturev picture

Figure 2: User-defined vocabulary, depicted as tag cloud.

based on visual information. In general the fore-most recommendation meets the content of interestand thus is suggested by default. On the other hand itis also possible to opt for a different structured regionif desired. Regardless of the selection the extractionsystem always tries to convey the structured data intoa tabular representation. This rearrangement enablesto automatically clean the data and serves as a com-fortable representation for labeling. Again, the usercan label the data by dragging tags of her personalvocabulary as depicted in Figure 24; this time to thecolumns of the resulting table consisting of extractedinstances. The picture at the top shows an excerpt ofthe original Web page and the tabular representationat the bottom shows the relevant part of the ViPERrendition of this page. The arrows indicate the tagsthe user has dragged to the column headers. Hav-ing in mind that the data originates from a backenddatabase the extraction and cleaning process can beseen as reverse engineering on the basis of material-ized database views published in HTML. For the datacleaning process the rule-based approach presented in(Simon et al., 2006), which allows to fine-tune theextraction and cleaning of data from structured Webpages, is used. For the basic datatypes, such as in-teger and double a heuristics-based standard parseris provided. The user has the chance to addition-ally implement and register her own parsers, e.g. a

4The vocabulary is depicted as tag cloud (Hassan-Montero and Herrero-Solana, 2006) based on the frequencyof the used tags.

parser specifically for addresses, which can be arbi-trarily sophisticated. The system then decides at run-time which parser is used based on the annotated datatype.The assigned labels constitute the functionality char-acteristics of a QI. We additionally consider two non-functional properties for each QI: the average re-sponse time, denoted tavg, and the maximum numberof possible parallel invocations of the QI, denoted k.We can determine tavg by measuring the response timeof a QI over the last N uses and then taking the arith-metic mean, i.e. tavg := 1

N ∗ ΣNi=1ti. Here, t1 is the

response time of the last invocation, t2 the responsetime of the invocation before that and so on, and Ndenotes the size of the considered execution history,e.g. ten. We initially set tavg := ∞.Since each QI is essentially a relational DB with aform-based query interface, which is designed to han-dle requests in parallel, we can invoke a QI multipletimes with different input parameters in parallel. Ob-viously, the number of parallel executions needs tobe limited by an upper bound, because otherwise wewould perform a Denial of Service attack on the Websource. To assure this, we set k := 3 for all QIs fornow, but we are working on a heuristic to update kto reflect the capabilities of the underlying QI morefaithfully. In the remainder of the paper the followingmore formal definition of a QI is used:Definition 1 (Query Interface (QI)). A query inter-face Qi is the quintuple (I,O,k, tavg,d), where I is theset of tags that have been dragged to input arguments(the input signature) and O is the set of tags that havebeen dragged to columns in the tabular representationof the result page (the output signature), k is the max-imum number of parallel invocations, tavg is the aver-age response time, and d is a short description of theQI.

The provided input arguments of a QI are usedto fill out the form, which is then submitted. Thedata records contained in the result page are then ex-tracted, cleaned and returned as RDF triples (Manolaand Miller, 2004). RDF is used as data format becausedata can be represented in an object-centered fashionand missing or multiple attributes are supported. Theintegration of data from different sources, explainedin Section 4, is another place where the properties of


61

RDF are convenient. Generally, an RDF graph is de-fined as a set of triples oidi := (si, pi,oi), where thesubject si stands in relation pi with object oi. In thefollowing we use ID(si) = {oid1,. . . ,oidn} to denotethe set of triples where the subject is identical to si,akin to the definition of an individual proposed in(Wang et al., 2005). Similarily, π2(ID(si)) := {pi |(si, pi,oi) ∈ ID(si)} is the set of all RDF properties ofan individual. In the remainder of the paper, we usethe notation πi(R) to project to the ith position of ann-ary relation R, where 1 ≤ i ≤ n, and the terms tagand property will be used interchangeably either toidentify an item in the user’s vocabulary or to denoteits counterpart in the RDF result graph.

3 ASSISTED MASHUPGENERATION

The previous section has illustrated how new DWsources can be integrated and how QIs are formallydefined. We can use this information to assist usersin generating data mashup graphs without writing asingle line of code. The basic idea is that a subsetT of the set of all used tags T can be interpreted asproperties of π2(ID(si)) analogously to the UniversalRelation paradigm presented in (Maier et al., 1984).In fact, (Davulcu et al., 1999) used the same approachto describe a user query, with the difference that theattributes were defined by the system and in our sce-nario each user can utilize her own vocabulary. Thisalleviates the unique role assumption5 in our case tosome extent because the user has a clear conceptionof the meaning of the tag. Additionally since we as-sist the user in incrementally building a mashup graphshe can further decide on the intended meaning of thequery by selecting appropriate sources.Figure 3 illustrates the mashup generation process.The user first drags the desired tags into the goal tagspane (shown at the bottom). Here she can addition-ally specify the desired constraints, e.g. that the priceshould be smaller than 450 EUR. Based on these goaltags, the system assists her in choosing QIs that cancontribute to the desired results. By clicking on eachstart tag the system recommends relevant input QIs.She can select the ones she likes and then iterativelyrefine the mashup by invoking the recommendationfunction for each input QI again. When she is satis-fied with the generated data mashup graph, the systemcollects all missing input arguments. These constitutethe global input for the data mashup and are shown

5The name of a tag unambigiously determines a conceptin the real world.

Figure 3: Excerpt of a result specification for an electronicdevice.

Price (in EUR) Product Rating Product Review …

Q1Q5

QI

Q1

Description RankProduct reviews and ratings forelectronic devices… (more)

Q8

Q9… …

0.42

Product reviews and ratings forbooks… (more)

0.16Movie information… (more)

0.42

…

Recommendations for Q5:

Figure 4: Ranked list of proposed QIs, if the recommenda-tion system is invoked for Q5.

in the box above the mashup graph in Figure 1. Af-ter finishing the mashup generation, the user can storethe generated mashup graph for future use as well.Each time the user invokes the recommendation sys-tem, a ranked list of the best matching QIs is com-puted and shown based on the coverage of input ar-guments and the utilization frequency in the past (cf.Figure 4). More formally, let Qselected denote the se-lected QI. Thereby, the rank of each QI with respectto Qselected can be computed as follows:

Definition 2 (QI Rank). R(Qi) := w1 ∗coverage(Qi)+w2 ∗utilization(Qi), where:

• coverage(Qi) := |ext(π2(Qi)) ∩ (π1(Qselected)||(π1(Qselected)| ; ext(T )

expands the original set of tags T with all tags thatare either equal or more specific with respect tothe user’s tag rules,

• utilization(Qi) := f (Qi)ftotal

; f (Qi) is the amount oftimes Qi has been used in data mashups before,and ftotal is the total amount of used QIs. Ifftotal = 0, i.e. the first mashup graph is generated,f (Qi)ftotal

:= 1,

• Σni=1wi = 1, where n is the number of available

QIs. Intuitively, the weights wi balance the influ-ence of coverage vs. utilization frequency of a QI,

• Since coverage(Qi) ∈ [0,1] and utilization(Qi) ∈[0,1], given the constraints on the weights above,we can assure that R(Qi) ∈ [0,1], where 1.0 indi-cates a perfect match and 0.0 a total mismatch.


62

We additionally distinguish between two specialcases: If the rank function is computed for a start tagTi, we define π1(Qselected) := {Ti} and if two QIs havethe same rank, they are re-ordered with respect to theiraverage response time tavg.

The coverage criterion favors QIs that can providea large portion of the needed input arguments. Theutilization criterion captures the likelihood of a QI tobe relevant without regard for the signature. There-fore it is intended to provide an additional hint whenchoosing between different QIs that match the inputsignature to some extent. Thus, w1 should be chosen,such that w1� w2, e.g w1 = 0.8 and w2 = 0.2.Figure 4 shows a use of the recomendation function.The user has already identified two QIs she wants touse and now wants a recommendation for an input toQ5. The system then presents her with a ranked listof matching QIs and connects the QIs appropriatelyafter she made her choice. To assure a meaningfuldata flow between the QIs, we require that the finalmashup graph is weakly connected, i.e. replacing alldirected edges with undirected edges results in a con-nected (undirected) graph.

4 MASHUP EXECUTION

The mashup graph determines the order, in which theQIs are accessed. All QIs whose input vertices haveeither already been accessed or that do not have aninput vertice, can be queried in parallel. The accessto a QI consists of two phases: first the valid inputcombinations need to be determined based on the in-put vertices and the output values of the QI have to beintegrated afterwards.

4.1 Computing Query Combinations

As mentioned above it is not advisable to query allpossible value combinations. Therefore we need away to identify the top m combinations, which aremost likely to yield relevant results. (Raghavan andGarcia-Molina, 2001) presented three different rank-ing value functions for improving submission effi-ciency in a Deep Web crawler scenario. They in-crementally collected possible value assignments forform elements and assigned weights to them. Aweight is increased each time a value assignment isused successfully in a form submission and decreasedotherwise.DW sources usually return a ranked list of results,

Input QIs. Results are identified via their rankings:

• Q1 = {1,2,3,4}• Q2 = {1,2,3}• Q3 = {1,2}

Resulting combinations:

# Roverall R1 R2 R3

1 3 1 1 12 4 1 1 23 4 1 2 14 4 2 1 1

Figure 5: Query combination computation for a QI withthree different input QIs where m = 4.

which we interprete as an assigned weight6. There-fore we can similarily compute a rank for each valuecombination based on those weights. Thus Roverall :=Σn

i=1Ri is the overall ranking, where Ri is the rankingof a value from the ith DW source. Since the ranks ofthe input QIs usually depend on their predecessors aswell, the notion of accumulated ranks is introducedin the next section, which considers the access his-tory of each QI. Figure 5 illustrates the computationfor three QIs which each return a different number ofresults7. In this example m is four and therefore wewould query the first three combinations in parallel,since we decided to have k = 3 in Section 2.2, andafterwards the fourth combination.

4.2 Data Integration

The final goal of the integration process is to alignthe triples that were generated during query time insuch a way, that each set π2(ID(si)) contains all re-quired goal tags that were initially specified by theuser. Hence, if multiple Web QIs contribute proper-ties to the overall result as in this scenario, differentsets ID(si) need to be merged, split or aggregated inthe data reconciliation process.Tables 2 and 3 are used to further illustrate the datareconciliation process. Table 2 shows an excerpt ofthe state of the global triple store after the first sourceof the mashup (Q1) depicted in Figure 1 has been ac-cessed and all results have been inserted. Since thiswas the first QI, which has been accessed, no aggre-gation, splitting or merging operations are necessary.

Table 3 shows the global store after the next QI(Q5) has been accessed. The original triple set ID(s1)

6If no ranking information is provided, we assign to eachvalue the rank 1.

7Note that in a real-life scenario the accumulated ranksare usually not integers.


63

Table 2: Triple Store before accessing Q5.

# S P O1 s1 product-name ”Sharp Aquos”2 s1 product-rating 80.03 s1 source Q14 s1 rank 1

. . . . . . . . . . . .

Table 3: Triple Store after accessing Q5.

# S P O1 s1 product-name Sharp Aquos2 s1 product-rating 80.03 s1 source Q14 s1 rank 15 s1 store-name Amazon6 s1 price-eur 1120.007 s1 rank 18 s1 source Q59 s3 product-name Sharp Aquos

10 s3 product-rating 80.011 s3 source Q112 s3 rank 113 s3 store-name Dell14 s3 price-eur 1420.0015 s3 source Q516 s3 rank 2. . . . . . . . . . . .

has been split in two branches: one with the informa-tion about the price for the device at Dell (the triple setID(s3)) and the other with information about Amazon(the triple set ID(s1)) and aggregated with the relatedpricing data. This split is important, since if all datahad been aggregated around s1 the connection be-tween store and price would have been lost. Anotheraspect which is shown in Table 3 is that we accumu-late the source and ranking information. The sourceinformation is used to fill the according upcoming QIswith the correct data and the ranking information isused to estimate the accumulated ranking of a sub-ject, which is used as input for the query combina-tion computation described in Section 4.1. The ac-cumulated rank is the arithmetic mean over all ranks;Raccumulate := 1

n Σni=1Ri, where n is the number of rank-

ings.After all branches have finished executing, all out-put values occuring from different QIs in a branchare automatically associated with each other depend-ing on the query history, i.e. the respective inputvalues. Hence in our running example the branches(Q1,Q5,Q2) and (Q1,Q5,Q3,Q4) are already inte-grated. To correlate the information between thisbranches we merge the respective sets ID(si) based

on their common properties. Note that this is alwayspossible, because as described in Section 3 we onlyallow weakly connected, directed mashup graphs andtherefore each set ID(si) in a different branch sharesat least on property.Finally the results are selected with an automaticallygenerated SPARQL (Prud’hommeaux and Seaborne,2007) query, where all goal tags occur in the SELECTclause and the user-specified constraints are checkedin the WHERE clause. The integrated results are thenpresented in a tabular representation, as if the user hadjust queried a given RDF store, which already con-tained all relevant information.

5 RELATED WORK

Even nowadays many frameworks for querying het-erogeneous sources present the end user with a pre-defined query interface or a fix query language shehas to learn in order to pose more complex queries tothe system. One of the main drivers behind the ideaof Web 2.0 is that users can control their own data.That might be one reason why data-centric mashuptools are such a strong trend at the moment. Here theuser is a first-class citizen and can arrange her sourcesin an intuitive ad hoc fashion. Representative frame-works to build such mashups based on distributed ser-vices or structured data are for instance YahooPipes8,IBM’s QEDwiki9 and MashMaker (Ennals and Garo-falakis, 2007). YahooPipes provides an interactivefeed aggregator and manipulator to quickly gener-ate its own mashups based on available XML relatedservices. Additionally, YahooPipes realizes data ag-gregation and manipulation by assembling data pipesfrom structured XML feeds by hand. QEDwiki issimilar in spirit and is implemented as a Wiki frame-work, which offers a rich set of commands with astronger focus on business users. MashMaker sup-ports the notion of finding information by exploring,rather than by writing queries. It allows to enrich thecontent of the currently visited page with data fromother Web sites. Our system differs in the way thatwe focus on unstructured resources and aim to pro-vide a user with strong automatic mechanisms fordata aggregation. As a consequence we concentrateon the problems yielding from Web resource integra-tion, such like form field interaction, automatic con-tent extraction, and data cleaning. Furthermore we as-sist the user during the assembly of the mashup withour recommendation service, whereas in the above

8http://pipes.yahoo.com/pipes/9http://services.alphaworks.ibm.com/

qedwiki/


64

examples the sources have to be manually wired to-gether. In contrast to MashMaker our idea is not toenrich the content of one Web page but to provide anintegrated view of multiple sources to aggregate in-formation about a specific topic.Since our approach is focused on DW sources, weneed to solve the problem of extracting and label-ing Web data in a robust manner. (Laender et al.,2002) have shown in a survey the most prominent ap-proaches to Web data extraction. More recent frame-works for this task are Dapper10, which is focusedon extraction of content from Web sites to generatestructured feeds. It provides a fully visual and interac-tive web-based wrapper generation framework whichworks best on collections of pages. Thresher (Hogueand Karger, 2005), the extraction component of thestandalone information management tool Haystack(Karger et al., 2005), facilitates as well a simple butstill semi-supervised and labor-intensive wrapper11

generation component. Another more complex su-pervised wrapper generation tool, where the genera-tion of the wrapper is also done visually and interac-tively by the user, is Lixto (Baumgartner et al., 2001).Our proposal does not rely on an interactive wrappergeneration process which is feasible at a small scalewhere sources can be manually wrapped. Instead auser only has to label form elements and automati-cally extracted content with concepts of her own vo-cabulary. Thus, the integration of new Web resourcesis totally automated except of the annotation process.This meets the problem of large scale information in-tegration where many tasks have to be automated andwrapper maintenance is an issue.Another related field are Web search engines, whichusually offer a keyword-based or canned interfaceto DW sources. Recently proposed complete frame-works for querying the Web, especially the deep Webare (He et al., 2005) and (Chang et al., 2005) for in-stance. They cover tasks like Deep Web resource dis-covery, schema matching, record linkage and seman-tic integration tasks. Although there are many sim-ilarities between these approaches and our proposedsystem, they are operating in a bottom-up approach,starting with the Web sources, whereas we model theworld from a user-oriented perspective. Additionallywe support an ad hoc combination of heteregenousdata sources, which is not supported by these frame-works.Because our underlying data model is an adaptedversion of the Universal Relation assumption we

10http://www.dapper.net/11A wrapper in this context is only concerned with the

extraction of data from Web pages and does not provide anytransformation of query capabilities.

share some elements with the architecture proposedin (Davulcu et al., 1999). Especially the idea to boot-strap the mashup generation process with a set of goaltags is similar to the way queries can be specified intheir framework. However, in our approach the usercan organize his vocabulary without any dependencyon domain experts, and the way queries are build ontop of the initial specification is different as well.Finally, the Semantic Web application Piggy Bank(Huynh et al., 2007) is also focused on the conversionof Web information into semantically enriched con-tent, but requires as well skilled users to write screenscraper programs as information extraction compo-nents. The main idea is that users can share their col-lected semantically enriched information. We sharewith Piggy Bank the idea to semantically enrich theWeb and the idea of an omnipresent browser exten-sion accompanying a user during his daily surf expe-rience. In contrast we do not store the information, in-stead we are interested in querying fresh informationon-the-fly with fast and robust extraction and annota-tion components.

6 FUTURE WORK

To integrate more sophisticated DW sources whichmake extensive use of JavaScript, we are currently in-vestigating the use of navigation maps as proposed in(Davulcu et al., 1999). Moreover this would allow usto incorporate sources with multi-form interactions,where the final result pages are only shown after sev-eral interactive user feedback loops.As a side-effect or our implementation efforts, wewere surprised that there seems to be no support forshared property values between different subjects inexisting RDF stores. As this can happen easily in ourapproach when a split occurs, as described in Section4.2, we are implementing an RDF store, which is opti-mized for the needs of our framework, at the moment.Finally, the user-centric nature of our approach makesit an excellent candidate for integrating collabora-tive intelligence or human computation (von Ahn andDabbish, 2004), i.e. to ease the large-scale annota-tion of DW sources, which is in a way similar tothe way tagging is done in social networks, such asdel.icio.us12. Currently we are researching the use ofmachine learning approaches to recommend tags fornew DW sources based on the user’s labeling behav-ior in the past.

12http://del.icio.us/


65

7 CONCLUSIONS

Many real-life queries can only be answered by com-bining data from different Web sources. EspeciallyDeep Web (DW) sources offer a vast amount of high-quality and focused information and are therefore agood candidate for automatic processing. In this pa-per we presented a framework which allows to con-vert these sources into machine-accessible query in-terfaces (QIs) by tagging the relevant input argumentsand output values. Since the framework is geared to-wards non-expert users the whole acquisition can bedone without writing a single line of code.In the next step users can iteratively build a mashupgraph in a bottom up fashion, starting with the desiredgoal tags. Each vertex in the resulting mashup graphrepresents a QI, and edges organize the data flow be-tween the QIs. To reduce the modeling time and in-crease the likelihood of meaningful combinations, shecan invoke a recommender for each vertex, which re-turns a ranked list of possible new input QIs.The execution strategy for the thus generated mashupgraphs is based on the idea to process the most likelyvalue combinations for each QI in parallel whileavoiding to access a particular DW source too oftenin a given time frame. Although a modular architec-ture based on an analysis of the main challenges hasbeen proposed, more experimental work needs to bedone to evaluate the practicability of our frameworkand to fine-tune the QI recommender.

REFERENCES

Baumgartner, R., Flesca, S., and Gottlob, G. (2001). VisualWeb Information Extraction with Lixto. In VLDB,pages 119–128.

Biron, P. V. and Malhotra, A. (2004). XMLSchema Part 2: Datatypes Second Edition.http://www.w3.org/TR/xmlschema2/.

Chang, K. C.-C., He, B., and Zhang, Z. (2005). TowardLarge Scale Integration: Building a MetaQuerier overDatabases on the Web. In CIDR, pages 44–55.

Davulcu, H., Freire, J., Kifer, M., and Ramakrishnan, I. V.(1999). A Layered Architecture for Querying Dy-namic Web Content. In SIGMOD Conference, pages491–502.

Ennals, R. and Garofalakis, M. N. (2007). MashMaker:Mashups For the Masses. In SIGMOD Conference,pages 1116–1118.

Hassan-Montero, Y. and Herrero-Solana, V. (2006). Im-proving Tag-Clouds as Visual Information RetrievalInterfaces. In InScit2006.

He, B., Patel, M., Zhang, Z., and Chang, K. C.-C. (2007).Accessing the Deep Web. Commun. ACM, 50(5):94–101.

He, H., Meng, W., Yu, C. T., and Wu, Z. (2005). WISE-Integrator: A System for Extracting and IntegratingComplex Web Search Interfaces of the Deep Web. InVLDB, pages 1314–1317.

Hogue, A. and Karger, D. R. (2005). Thresher: Automatingthe Unwrapping of Semantic Content from the WorldWide Web. In WWW, pages 86–95.

Huynh, D., Mazzocchi, S., and Karger, D. R. (2007). PiggyBank: Experience the Semantic Web Inside YourWeb Browser. J. Web Sem., 5(1):16–27.

Karger, D. R., Bakshi, K., Huynh, D., Quan, D., and Sinha,V. (2005). Haystack: A General-Purpose InformationManagement Tool for End Users Based on Semistruc-tured Data. In CIDR, pages 13–26.

Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., andTeixeira, J. S. (2002). A Brief Survey of Web DataExtraction Tools. SIGMOD Record, 31(2):84–93.

Maier, D., Ullman, J. D., and Vardi, M. Y. (1984). On theFoundations of the Universal Relation Model. ACMTrans. Database Syst., 9(2):283–308.

Manola, F. and Miller, E. (2004). RDF Primer.http://www.w3.org/TR/rdf-primer.

Prud’hommeaux, E. and Seaborne, A. (2007). SPARQLQuery Language for RDF. http://www.w3.org/TR/rdf-sparqlquery/.

Raghavan, S. and Garcia-Molina, H. (2001). Crawling theHidden Web. In VLDB, pages 129–138.

Simon, K., Hornung, T., and Lausen, G. (2006). LearningRules to Pre-process Web Data for Automatic Integra-tion. In RuleML, pages 107–116.

Simon, K. and Lausen, G. (2005). ViPER: Augmenting Au-tomatic Information Extraction with Visual Percep-tions. In CIKM, pages 381–388.

von Ahn, L. and Dabbish, L. (2004). Labeling Images Witha Computer Game. In CHI, pages 319–326.

Wang, S.-Y., Guo, Y., Qasem, A., and Heflin, J. (2005).Rapid Benchmarking for Semantic Web KnowledgeBase Systems. In ISWC, pages 758–772.


66

mashing up the deep web - research in progress€¦ · abstract: deep web (dw) sources offer a...

Documents