senbazuru: a prototype spreadsheet database management system

4
Senbazuru: A Prototype Spreadsheet Database Management System Zhe Chen Michael Cafarella Jun Chen Daniel Prevo Junfeng Zhuang University of Michigan Ann Arbor, MI 48109-2121 {chenzhe, michjc, chjun, drpre, jimmyzhu}@umich.edu ABSTRACT Spreadsheets have become a critical data management tool, but they lack explicit relational metadata, making it difficult to join or integrate data across multiple spreadsheets. Be- cause spreadsheet data are widely available on a huge range of topics, a tool that allows easy spreadsheet integration would be hugely beneficial for a variety of users. We demonstrate that Senbazuru, a prototype spreadsheet database management system (SSDBMS), is able to extract relational information from spreadsheets. By doing so, it opens up opportunities for integration among spreadsheets and with other relational sources. Senbazuru allows users to search for relevant spreadsheets in a large corpus, prob- abilistically constructs a relational version of the data, and offers several relational operations over the resulting ex- tracted data (including joins to other spreadsheet data). Our demonstration is available on two clients: a JavaScript- rich Web site and a touch interface on the iPad. During the demo, Senbazuru will allow VLDB participants to search spreadsheets, extract relational data from them, and apply relational operators such as select and join. 1. INTRODUCTION Spreadsheets are an extremely popular data management tool, allowing users to complete a range of data tasks com- monly associated with relational systems: projection, sort- ing, aggregation, and simple ETL (Extract, Transform and Load) jobs. The sheer number and diversity of spread- sheets is remarkable; our Web crawl using the URLs in the ClueWeb09 corpus [4] found 410,554 spreadsheet files scat- tered across 51,252 domains. Moreover, the data is often the result of intense human focus and effort. These spread- sheets should be prime targets for ad-hoc data integration and analysis. For example, Policy expert Fred wants to see whether the strength of the connection between smoking and lung cancer is consistent across all U.S. states. Fred does not have the relevant data at hand, but assumes it is “out there” on the Web somewhere. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 39th International Conference on Very Large Data Bases, August 26th - 30th 2013, Riva del Garda, Trento, Italy. Proceedings of the VLDB Endowment, Vol. 6, No. 12 Copyright 2013 VLDB Endowment 2150-8097/13/10... $ 10.00.. In one sense, the user is fortunate: different branches of the government have collected the data relevant to his task and made them available online, likely via two separate down- loadable spreadsheets, one for the smoking statistics and one for the lung cancer statistics. Unfortunately, finding such data via current search engines is quite tedious. In our case, Fred would need to issue a text query, then re- view all the returned documents before finding the relevant spreadsheets. Moreover, because spreadsheets do not have explicit relational schema, the user cannot benefit from soci- ety’s huge investment in data integration tools that work on relational databases. Instead, Fred will likely have to write custom code to combine the two spreadsheets. Extracting relational data from spreadsheets would enable traditional database management tools, such as data inte- gration systems, to be applied to spreadsheet data. There is a body of recent work that performs database-style op- erations inside a spreadsheet interface [6, 8, 11], but these approaches are not useful for spreadsheet datasets that al- ready exist. Other studies have attempted to transform spreadsheet data into the relational model, making further integration among spreadsheets possible. Some extraction systems require explicit sheet-specific user-provided rules [2, 7], which might yield good results for a single spreadsheet but they are not practical in our setting: the corpus is very large and it is impractical for users to manually transform all of them. Abraham and Erwig[1], and Cunha et al. [5] auto- matically infer some spreadsheet structure, but they cannot process the hierarchical spreadsheets that constitute much of the most interesting data. Figure 1 shows a hierarchi- cal spreadsheet downloaded from the U.S. Census Bureau. 1 This type of spreadsheets is commonplace in the Web: we have found about 32.5% spreadsheets in the Web are hierar- chical, and more than 60% in popular Internet domains [3]. Therefore, to build a useful end-to-end spreadsheet man- agement system, we should be able to extract relational data from a large number of existing spreadsheets, including the hierarchical ones. The main technical challenges are: Extraction – The spreadsheet in Figure 1 shows the smoking rate among different U.S. demographic groups. Each row describes a different smoking rate configura- tion. For example, 13.7 in the value region represents smoking people that are Male, White, and 65 years and over in the attribute region, and it yields an annotat- ing relational tuple at bottom. To construct such a re- lational tuple is not straightforward: the spreadsheet does not explicitly indicate which cells are attributes, 1 The U.S. Census Bureau: http://www.census.gov/

Upload: others

Post on 03-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Senbazuru: A Prototype Spreadsheet DatabaseManagement System

Zhe Chen Michael Cafarella Jun Chen Daniel Prevo Junfeng ZhuangUniversity of Michigan

Ann Arbor, MI 48109-2121{chenzhe, michjc, chjun, drpre, jimmyzhu}@umich.edu

ABSTRACTSpreadsheets have become a critical data management tool,but they lack explicit relational metadata, making it difficultto join or integrate data across multiple spreadsheets. Be-cause spreadsheet data are widely available on a huge rangeof topics, a tool that allows easy spreadsheet integrationwould be hugely beneficial for a variety of users.

We demonstrate that Senbazuru, a prototype spreadsheetdatabase management system (SSDBMS), is able to extractrelational information from spreadsheets. By doing so, itopens up opportunities for integration among spreadsheetsand with other relational sources. Senbazuru allows usersto search for relevant spreadsheets in a large corpus, prob-abilistically constructs a relational version of the data, andoffers several relational operations over the resulting ex-tracted data (including joins to other spreadsheet data).Our demonstration is available on two clients: a JavaScript-rich Web site and a touch interface on the iPad. During thedemo, Senbazuru will allow VLDB participants to searchspreadsheets, extract relational data from them, and applyrelational operators such as select and join.

1. INTRODUCTIONSpreadsheets are an extremely popular data management

tool, allowing users to complete a range of data tasks com-monly associated with relational systems: projection, sort-ing, aggregation, and simple ETL (Extract, Transform andLoad) jobs. The sheer number and diversity of spread-sheets is remarkable; our Web crawl using the URLs in theClueWeb09 corpus [4] found 410,554 spreadsheet files scat-tered across 51,252 domains. Moreover, the data is oftenthe result of intense human focus and effort. These spread-sheets should be prime targets for ad-hoc data integrationand analysis. For example,

Policy expert Fred wants to see whether the strength of theconnection between smoking and lung cancer is consistentacross all U.S. states. Fred does not have the relevant data athand, but assumes it is “out there” on the Web somewhere.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 39th International Conference on Very Large Data Bases,August 26th - 30th 2013, Riva del Garda, Trento, Italy.Proceedings of the VLDB Endowment, Vol. 6, No. 12Copyright 2013 VLDB Endowment 2150-8097/13/10... $ 10.00..

In one sense, the user is fortunate: different branches of thegovernment have collected the data relevant to his task andmade them available online, likely via two separate down-loadable spreadsheets, one for the smoking statistics andone for the lung cancer statistics. Unfortunately, findingsuch data via current search engines is quite tedious. Inour case, Fred would need to issue a text query, then re-view all the returned documents before finding the relevantspreadsheets. Moreover, because spreadsheets do not haveexplicit relational schema, the user cannot benefit from soci-ety’s huge investment in data integration tools that work onrelational databases. Instead, Fred will likely have to writecustom code to combine the two spreadsheets.

Extracting relational data from spreadsheets would enabletraditional database management tools, such as data inte-gration systems, to be applied to spreadsheet data. Thereis a body of recent work that performs database-style op-erations inside a spreadsheet interface [6, 8, 11], but theseapproaches are not useful for spreadsheet datasets that al-ready exist. Other studies have attempted to transformspreadsheet data into the relational model, making furtherintegration among spreadsheets possible. Some extractionsystems require explicit sheet-specific user-provided rules [2,7], which might yield good results for a single spreadsheetbut they are not practical in our setting: the corpus is verylarge and it is impractical for users to manually transform allof them. Abraham and Erwig[1], and Cunha et al. [5] auto-matically infer some spreadsheet structure, but they cannotprocess the hierarchical spreadsheets that constitute muchof the most interesting data. Figure 1 shows a hierarchi-cal spreadsheet downloaded from the U.S. Census Bureau.1

This type of spreadsheets is commonplace in the Web: wehave found about 32.5% spreadsheets in the Web are hierar-chical, and more than 60% in popular Internet domains [3].

Therefore, to build a useful end-to-end spreadsheet man-agement system, we should be able to extract relational datafrom a large number of existing spreadsheets, including thehierarchical ones. The main technical challenges are:

• Extraction – The spreadsheet in Figure 1 shows thesmoking rate among different U.S. demographic groups.Each row describes a different smoking rate configura-tion. For example, 13.7 in the value region representssmoking people that are Male, White, and 65 years andover in the attribute region, and it yields an annotat-ing relational tuple at bottom. To construct such a re-lational tuple is not straightforward: the spreadsheetdoes not explicitly indicate which cells are attributes,

1The U.S. Census Bureau: http://www.census.gov/

which cells are values, or which attributes describewhich values. Also, the spreadsheet does not clearlystate how to assemble the attributes into coherent sets.For example, the left-hand attributes in rows 26 and32 belong to the same set (race), but row 25 does not.This grouping information is critical when construct-ing any relational format of the data. In summary,Figure 1 shows a clean, high-quality spreadsheet, re-covering relational data from it requires us to: (1) de-tect attributes and values; (2) identify the hierarchicalstructure of left and top attributes if any exists; (3)generate relational tuples; and (4) assemble the tuplesinto high-quality relational tables.• Repair – Extracting metadata is brittle: even a single

extraction error can yield wrong tuples that will poi-son any downstream applications. For example, if theextractor fails to recognize that row 20 annotates row26, the extracted relational tuples from row 27 to 31will all be incorrect, as none contains any mention ofthe attribute Male. Thus, it should always be possibleto manually repair any automatic extraction errors.

We demonstrate that Senbazuru, a prototype spreadsheetdatabase management system (SSDBMS), is able to extractrelational information from a large number of spreadsheets;doing so opens up opportunities for data integration amongspreadsheets and with other relational sources. Senbazuruconsists of three functional components we view as criticalfor a useful SSDBMS:

• Search – Using a textual search-and-rank interface,the search component allows a user to quickly locaterelevant datasets in a huge Web spreadsheet corpus.

• Extract – The extract component is composed of abackground extraction pipeline that automaticallyobtains relational data from spreadsheets, and a re-pair interface that allows users to manually repairextraction errors. Moreover, this component automat-ically exploits commonalities among errors to proba-bilistically reapply one user fix to other similar mis-takes, thereby minimizing explicit manual interven-tion.

• Query – The query component supports basic rela-tional operators, especially selection and join, whichthe user can apply to spreadsheet-derived relations.

We have a working prototype of Senbazuru, available asa desktop Web application and also as an iPad application.Recall our policy expert Fred, now newly equipped withan SSDBMS. Fred can use Senbazuru’s search componentto rank datasets by relevance to his text query, smoking bystate. He then picks the best hit. The extract component au-tomatically generates a relational table for the spreadsheet,allowing Fred to fix an extraction error with a single drag-and-drop. Fred can also use the query component to selecta portion of the data about smoking by states, and join withanother dataset about cancer by states. Overall, Senbazururequires only a few clicks and two text queries, and yields anew dataset that is exactly what Fred wants.

In the rest of this paper, we will give an overview of thebackend framework of Senbazuru (Section 2), describe howthe end-user can interact with Senbazuru’s interface duringthe demonstration with a walk-through example (Section 3),and conclude with a brief summary of technical problems thesystem addresses (Section 4).

Left

Attr

ibut

es

Top Attributes

Values

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

Male, total White, total 65 years and over

Total smokers 13.71990Relational

Tuple:

Figure 1: A screenshot of a spreadsheet with im-plicit hierarchical attributes in its left-hand column.At bottom, the relational tuple corresponds to thehighlighted value 13.7.

2. SYSTEM FRAMEWORKIn this section, we describe the backend software frame-

work that allows Senbazuru to offer the standard SSDBMSservices: search, extract, and query, as shown in Figure 2.

Search – The search component helps the user to retrieverelevant datasets in a potentially enormous corpus via Websearch-style relevance ranking. Currently, Senbazuru hasindexed more than 1,800 spreadsheets collected from theU.S. Census Bureau. For each spreadsheet in the corpus, theindexer uses the Python xlrd package to extract text fromeach cell, then uses Apache Lucene to index the text. Whena query arrives, the searcher uses the inverted index andTFIDF-style ranking to sort the datasets by relevance. Aswith a standard Web search, the results of search comprisea list of potentially useful objects for the user to examine.

Extract – The extract component consists of a sequenceof pipeline components that convert the data in each spread-sheet into the relational model. Most of the extract takesplace when the corpus of spreadsheets is first added to thesystem, before any query arrives. A small portion of the ex-tract is deferred until the user provides manual error repairs.

1. Frame Finder – First, each raw spreadsheet is sentto the frame finder. This module identifies data frame struc-tures in each spreadsheet. A characteristic data frame layoutis shown in Figure 1: a rectangular value region of numericvalues (outlined with the dashed rectangle), with additionalrectangular attribute regions above (labeled top attribute andoutlined with a solid rectangle) and to the left (labeled leftattribute, similarly outlined). We use a conditional randomfield (CRF) [9] to assign a semantic label to each non-emptyrow in the spreadsheet, thus finding the data frame. Themodule passes the data frame to the next step and discardseverything else in the raw spreadsheet.

2. Hierarchy Extractor – The next stage is the hier-archy extractor, which recovers the attribute hierarchies forthe data frame’s attribute regions. In each attribute region,we want to know which attributes describe which other at-tributes. For example, in Figure 1, the attribute at row 31

SpreadsheetIndex

Searcher Relation Constructor

Tuple Builder

Hierarchy ExtractorIndexer Frame

Finder

Query

SpreadsheetsSearch Extract

Join

Select

Query

Mobile Client

Web Client

Figure 2: The software framework for Senbazuru.

is annotated by attributes at rows 26, 20, and 19. These an-notations are critical in correctly extracting the relationaltuples. Our solution is based on a conditional random field(CRF), in which each random variable represents whetheran attribute pair is parent-child. For example, consider thehierarchy on the left of Figure 1: a variable representing anattribute pair (20, 26) is true while variables for (20, 31)and (24, 25) are false. We define a set of node and edge po-tentials to capture human intuitions about which attributepairs are likely to be parent-child, and also global potentialsor constraints to ensure that the resulting variable assign-ment yields a strict tree structure. Therefore, the output ofthe CRF extraction is a high-quality attribute hierarchy.

The hierarchy extractor includes a small but importantstep that is deferred until query time when a user interactswith the repair interface. When a user observes the hier-archy extractor’s initial output, she has the option to repairthe resulting hierarchy by dragging and dropping attributes,as shown in Figure 3. We translate each repair operationto a set of variables with observed assignments. We alsoadd repair potentials to the CRF to ensure the observedassignments can be probabilistically applied to other simi-lar variables. Every single user repair operation will triggerSenbazuru to re-run the CRF. As a result, the hierarchyextractor makes a single explicit repair action by the userfix many similar extraction errors at once, yielding refinedhierarchies with little user effort.

3. Tuple Builder – The tuple builder is algorithmicallystraightforward, and it generates a relational tuple for eachvalue in the value region, annotated with relevant attributesfrom attribute regions. For example, Figure 1 shows the fullsix-field tuple we want to recover for the outlined value 13.7.

4. Relation Constructor – Finally, the relation con-structor assembles these relational tuples into relational ta-bles. It clusters attributes in different tuples into consistentcolumns and recovers a label for each column. For example,this final step recognizes that attributes Male, total and Fe-male, total should go into the same column, and that the col-umn should be called gender. This clustering model exploitsinformation from the recovered hierarchies as well as exist-ing collections of schema such as Freebase and YAGO [10].Figure 4 (a) shows an example of a recovered relation.

Query – After a user searches for a relevant dataset andrepairs extraction errors, the query component allows her toapply relational operators on the spreadsheet-derived data.In particular, we have implemented select and join. Wewill cover the two relational operators in detail as part ofour interface discussion in Section 3.1.

3. DEMONSTRATIONWe plan to show an end-to-end demonstration of Sen-

bazuru to search, extract, and query spreadsheets. Userscan choose to interact with Senbazuru using either the Webbrowser client or the iPad tablet client. In this section, wefirst describe how to use Senbazuru’s interface, then show awalk-through example.

3.1 User InterfaceThe two client interfaces for Senbazuru are shown in Fig-

ure 4. They allow users to search for data, to view and editthe extract results, and to use query operations.

Search – As with other search-and-rank tools, a usertypes keywords in the search box, and obtains a list of rel-evant spreadsheets. She can then browse results and selectthe most relevant one. For example, in Figure 4 (a), a userenters “smoking” as the search query. She can examine thetop hit’s raw spreadsheet by clicking “Spreadsheet” or checkother relevant spreadsheets by clicking “Next.”

Extract – After selecting the most relevant spreadsheet,a user can use extract to transform the spreadsheet data intoa relational table. She can review the extracted hierarchi-cal metadata by clicking “Data Tree.” Figure 3 shows theinterface for viewing the extracted hierarchies. To repairany extraction errors, she can drag and move a node of thetree from one place to another. After observing the user’srepair action, Senbazuru will automatically re-run extractand display a new tree. For example, as shown on the leftof Figure 3, a user performs a repair by clicking and drag-ging White, total so that it becomes a child of Male, total.Meanwhile, the extract component automatically discoversthat Black, total should also be moved. Thus, the user’s onesingle repair action can trigger multiple fixes, yielding thehierarchy shown on the right. After repairing extraction er-rors, the user can review the generated relations by clicking“Relational Table,” as shown in Figure 4(a).

Query – Users can perform query operations on the ex-tracted relational table. They are not required to write SQLstatements and can apply select and join via the interface:1. Select – The select feature, also called filter, is simi-

lar to executing a selection query on the recovered relation.Clicking “Filter,” a user can use a faceted-search interface tospecify the filter conditions. This interface is automaticallycomposed by Senbazuru. For example, Figure 4 (a) showsthat the user limits the displayed data to smoking statisticsfor people who are Female, Black, and 18 to 24 years old.

2. Join – The join feature allows users to integrate twoarbitrary spreadsheet-derived relations. The user starts by

Before Repair: ''White, total'' After Repair: ''White, total''

Figure 3: Senbazuru recovers the attribute hierar-chy seen in Figure 1, and makes the recovered struc-ture available for manual user repairs.

Search box

"Data Tree": show attribute hierarchies

"Raw Tuples": show relational tuples

"Relational Table": show relational tables

"Next": view the next spreadsheet

"Spreadsheet": show the raw spreadsheet

"Join": integrate with another spreadsheet

"Filter":

(a) (b)

Figure 4: Screenshots of two Senbazuru clients, as a desktop application (a) and an iPad application (b).

applying the search and extract features as described above.Once the user has found a good result, she clicks “Join”and enters a second text query. The query yields a secondranked result list. She chooses a relevant join target fromthis ranked list and obtains a correctly extracted relation.She indicates which columns from each dataset should beused as an equijoin key.

For example, Figure 4 (b) shows a screenshot of the joinprocess on our iPad client. When the user touches a column,Senbazuru highlights it in bright yellow, as shown on the leftof the figure. Meanwhile, Senbazuru faintly highlights a col-umn on the right-hand table to indicate a possible join. Theuser can drag the column, highlighted in bright yellow, fromthe left-hand relation to the right and release it. This actionindicates the join key; Senbazuru executes the appropriatejoin and shows the user the resulting brand-new relation.

3.2 A Walk-through ExampleWe will demonstrate Senbazuru’s workflow through Fred’s

example. Recall that our policy expert Fred wants to findthe relationship between smoking and lung cancer across theU.S. states. Using Senbazuru, Fred can obtain the requesteddata through the following steps:

1. Search for “smoking by state.”

2. Browse the returned spreadsheets and select the mostrelevant one.

3. Click “Data Tree” and check the correctness of theextracted attribute hierarchies.

4. Repair the hierarchies if any extraction errors exist.

5. Click “Data Table” to review the extracted relation.

6. Repeat the process from step 1 to get the relationaltable for the most relevant lung cancer spreadsheet byquerying “lung cancer by state.”

7. Specify join columns to create a new table.

8. Review the result, using the faceted select interface.

In addition to Fred’s example, VLDB attendees are wel-come to try many other interesting queries, such as “em-ployment statistics 2010” and “Michigan GDP.”

4. CONCLUSIONSenbazuru is a prototype spreadsheet management sys-

tem. It searches a large number of Web crawl spreadsheets,and it automatically transforms spreadsheets into relationswhile allowing users to fix the extraction errors effectivelyand efficiently. Finally, it supports selection queries on theresulting relations and join queries to integrate arbitraryspreadsheets. We believe a demo of Senbazuru will be ofgreat interest to the VLDB attendees and the database com-munity as a whole.

5. ACKNOWLEDGMENTSThis project is supported by National Science Foundation

grants IIS-1054913 and IIS-1064606, as well as gifts fromDow Chemical, Yahoo!, and Google.

6. REFERENCES[1] R. Abraham and M. Erwig. Ucheck: A spreadsheet type

checker for end users. J. Vis. Lang. Comput., 18(1):71–95,2007.

[2] Y. Ahmad, T. Antoniu, S. Goldwater, andS. Krishnamurthi. A type system for statically detectingspreadsheet errors. In ASE, pages 174–183, 2003.

[3] Z. Chen and M. Cafarella. Automatic web spreadsheet dataextraction. In VLDB Workshop on Semantic Search overthe Web, Trento, Italy, 2013. ACM.

[4] 2009. ClueWeb09, http://lemurproject.org/clueweb09.php/.

[5] J. Cunha, J. Saraiva, and J. Visser. From spreadsheets torelational databases and back. In PEPM, pages 179–188,2009.

[6] P. J. Guo, S. Kandel, J. M. Hellerstein, and J. Heer.Proactive wrangling: Mixed-initiative end-userprogramming of data transformation scripts. In UIST,pages 65–74, 2011.

[7] V. Hung, B. Benatallah, and R. Saint-Paul.Spreadsheet-based complex data transformation. In CIKM,pages 1749–1754, 2011.

[8] B. Liu and H. Jagadish. A spreadsheet algebra for a directdata manipulation query interface. In ICDE, pages417–428, 2009.

[9] D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Tableextraction using conditional random fields. In SIGIR, pages235–242, 2003.

[10] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a coreof semantic knowledge. In WWW, pages 697–706, 2007.

[11] J. Tyszkiewic. Spreadsheet as a relational database engine.In SIGMOD, pages 195–206, 2010.