tableview: a visual interface for generating preview...

4
TableView: A Visual Interface for Generating Preview Tables of Entity Graphs Sona Hasani University of Texas at Arlington [email protected] Ning Yan * Huawei U.S. R&D Center [email protected] Chengkai Li University of Texas at Arlington [email protected] Abstract—This paper introduces TableView, a tool that helps generate preview tables of massive heterogeneous entity graphs. Given the large number of datasets from many different sources, it is challenging to select entity graphs for a particular need. TableView produces preview tables to offer a compact presen- tation of important entity types and relationships in an entity graph. Users can quickly browse the preview in a limited space instead of spending precious time or resources to fetch and explore the complete dataset that turns out to be inappropriate for their needs. In this paper we present: 1) a detailed description of TableView’s graphical user interface and its various features, 2) a brief description of TableView’s preview scoring measures and algorithms, and 3) the demonstration scenarios we intend to present to the audience. Keywords-entity graph; knowledge graph; usability; summa- rization I. I NTRODUCTION There is a revolutionary increase of gigantic, heterogeneous entity graphs that represent entities and their relationships in various domains. An entity graph is a directed multi- graph where each vertex represents an entity and each edge represents a directed relationship from its source entity to its destination entity. Entity graphs are often represented as Resource Description Framework (RDF) triples. Some of the most salient real-world entity graphs are knowledge bases such as DBpedia [1], YAGO [2], Probase [3], Freebase [4], and Google’s Knowledge Vault [5]. Numerous applications use entity graphs in areas such as search, recommendation systems, business intelligence, and health informatics. An example of a very small entity graph is displayed in Fig. 1. Due to the increasing number of entity graphs from many sources and oftentimes inadequate information available about them, it is extremely challenging to select the one that satisfies a particular need. A data worker may need to tackle many challenges before she can start any real work on an entity graph. While some descriptions may be provided by the data sources, the data worker is not able to get a direct look at an entity graph before fetching it. Moreover, downloading a dataset and loading it into a database can be a daunting task. Although a schema graph is much smaller than the cor- responding entity graph, it is not small enough for easy presentation and quick preview. For instance, there are 190K vertices and 1.6M edges in a snapshot of the film domain of Freebase, and the corresponding schema graph consists of *The work was performed while the author was with UT-Arlington. 50 entity types (vertices) and 136 relationship types (edges). Several studies have suggested to generate a summary of the schema graph, by schema summarization techniques. Some of these methods work on relational and semi-structured data instead of graph data [6], [7], [8], while others produce trees or graphs as output instead of flat tables [8], [9], [10]. The most closely related work to our approach clusters the tables in a database but does not reduce the complexity of database schema [6], [7]. Yan et al. [11] introduced preview tables to assist data workers in obtaining a quick preview of a given entity graph before they decide to spend more time and resources to fetch and investigate the complete graph. They show a few randomly selected sample tuples in each preview table to facilitate a better understanding of the data graph. In order to select the best preview, they proposed a scoring system, including coverage-based and random-walk based scoring methods for entity types and coverage-based and entropy-based scoring methods for relationships. In this paper we present TableView, a tool based on [11] that automatically produces preview tables for entity graphs. Fig. 2 is a possible preview of the entity graph in Fig. 1. It consists of two preview tables. The upper table includes attributes Film, Director and Genres, and the lower table has attributes Film Actor and Award Winners. Film and Film Actor are the key attributes of the generated preview tables, marked by the underlines beneath them. Attributes Director and Genres in the upper table are considered highly related to Film entities. The two tables contain four and two sample tuples, respectively. II. USER I NTERFACE In this section we describe various features implemented in TableView. Fig. 3 presents its graphical user interface which consists of four main segments: configuration panel, preview panel, information panel, and manually generated preview tables. Below we describe these components. Configuration Panel: This panel is designed for setting the parameters of the preview. The user can customize the follow- ing fields. Domain: The user can select among several domains provided by the dataset or pre-processed for the dataset. Key attributes: The user will specify the desired number of key at- tributes which determines the number of tables in the preview. Non-key attributes: The user will specify the number of non-

Upload: others

Post on 08-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TableView: A Visual Interface for Generating Preview ...ranger.uta.edu/~cli/pubs/2018/tableview-icde18demo-hasani.pdf · coverage-based and random-walk based scoring methods for entity

TableView: A Visual Interface for GeneratingPreview Tables of Entity Graphs

Sona HasaniUniversity of Texas at Arlington

[email protected]

Ning Yan∗

Huawei U.S. R&D [email protected]

Chengkai LiUniversity of Texas at Arlington

[email protected]

Abstract—This paper introduces TableView, a tool that helpsgenerate preview tables of massive heterogeneous entity graphs.Given the large number of datasets from many different sources,it is challenging to select entity graphs for a particular need.TableView produces preview tables to offer a compact presen-tation of important entity types and relationships in an entitygraph. Users can quickly browse the preview in a limited spaceinstead of spending precious time or resources to fetch andexplore the complete dataset that turns out to be inappropriatefor their needs. In this paper we present: 1) a detailed descriptionof TableView’s graphical user interface and its various features,2) a brief description of TableView’s preview scoring measuresand algorithms, and 3) the demonstration scenarios we intend topresent to the audience.

Keywords-entity graph; knowledge graph; usability; summa-rization

I. INTRODUCTION

There is a revolutionary increase of gigantic, heterogeneousentity graphs that represent entities and their relationshipsin various domains. An entity graph is a directed multi-graph where each vertex represents an entity and each edgerepresents a directed relationship from its source entity toits destination entity. Entity graphs are often represented asResource Description Framework (RDF) triples. Some of themost salient real-world entity graphs are knowledge basessuch as DBpedia [1], YAGO [2], Probase [3], Freebase [4],and Google’s Knowledge Vault [5]. Numerous applicationsuse entity graphs in areas such as search, recommendationsystems, business intelligence, and health informatics. Anexample of a very small entity graph is displayed in Fig. 1.

Due to the increasing number of entity graphs from manysources and oftentimes inadequate information available aboutthem, it is extremely challenging to select the one that satisfiesa particular need. A data worker may need to tackle manychallenges before she can start any real work on an entitygraph. While some descriptions may be provided by the datasources, the data worker is not able to get a direct look atan entity graph before fetching it. Moreover, downloading adataset and loading it into a database can be a daunting task.

Although a schema graph is much smaller than the cor-responding entity graph, it is not small enough for easypresentation and quick preview. For instance, there are 190Kvertices and 1.6M edges in a snapshot of the film domainof Freebase, and the corresponding schema graph consists of

*The work was performed while the author was with UT-Arlington.

50 entity types (vertices) and 136 relationship types (edges).Several studies have suggested to generate a summary of theschema graph, by schema summarization techniques. Someof these methods work on relational and semi-structured datainstead of graph data [6], [7], [8], while others produce treesor graphs as output instead of flat tables [8], [9], [10]. Themost closely related work to our approach clusters the tablesin a database but does not reduce the complexity of databaseschema [6], [7].

Yan et al. [11] introduced preview tables to assist dataworkers in obtaining a quick preview of a given entity graphbefore they decide to spend more time and resources to fetchand investigate the complete graph. They show a few randomlyselected sample tuples in each preview table to facilitate abetter understanding of the data graph. In order to selectthe best preview, they proposed a scoring system, includingcoverage-based and random-walk based scoring methods forentity types and coverage-based and entropy-based scoringmethods for relationships. In this paper we present TableView,a tool based on [11] that automatically produces preview tablesfor entity graphs.

Fig. 2 is a possible preview of the entity graph in Fig. 1.It consists of two preview tables. The upper table includesattributes Film, Director and Genres, and the lower table hasattributes Film Actor and Award Winners. Film and FilmActor are the key attributes of the generated preview tables,marked by the underlines beneath them. Attributes Directorand Genres in the upper table are considered highly relatedto Film entities. The two tables contain four and two sampletuples, respectively.

II. USER INTERFACE

In this section we describe various features implemented inTableView. Fig. 3 presents its graphical user interface whichconsists of four main segments: configuration panel, previewpanel, information panel, and manually generated previewtables. Below we describe these components.

Configuration Panel: This panel is designed for setting theparameters of the preview. The user can customize the follow-ing fields. Domain: The user can select among several domainsprovided by the dataset or pre-processed for the dataset. Keyattributes: The user will specify the desired number of key at-tributes which determines the number of tables in the preview.Non-key attributes: The user will specify the number of non-

Page 2: TableView: A Visual Interface for Generating Preview ...ranger.uta.edu/~cli/pubs/2018/tableview-icde18demo-hasani.pdf · coverage-based and random-walk based scoring methods for entity

Fig. 1. An excerpt of an entity graph.

Film Director Genrest1 Men in Black Barry Sonnenfeld {Action Film, Science Fiction}t2 Men in Black II Barry Sonnenfeld {Action Film, Science Fiction}t3 Hancock Peter Berg -t4 I, Robot Alex Proyas {Action Film}

Film Actor Award Winnerst5 Will Smith Saturn Awardt6 Tommy Lee Jones Academy Award

Fig. 2. A preview of the entity graph in Fig. 1.

key attributes in the preview which indicates the total numberof non-key attribute columns of the preview tables. Previewtype: The user can choose their desired type of preview froma drop-down list of three options. A concise preview is theoptimal preview with the specified number of key and non-key attributes. A tight preview is a concise preview where thedistance between every two key attributes of the preview is atmost equal to some specific value specified by the user. Onthe other hand, a diverse preview is a concise preview wherethe distance between every two key attributes in the preview isat least equal to a distance specified by the user. Distance: Ifthe user selects tight or diverse preview as the type, this fieldwill show up and the user can specify the desired distancebetween key attributes in the preview. Number of the records:The user can specify how many sample records she wouldlike to see in each preview table. Attribute scoring: For keyattribute scoring the user can choose between coverage-basedscoring and random-walk based scoring, while for non-keyattribute scoring she can select coverage-based scoring andentropy-based scoring.

Preview Panel: When the configuration parameters are set,the user needs to click the Show Preview button to see theautomatically generated preview in the preview panel. Thispanel has three segments, including preview tables, schemagraph, and summarized schema graph. The user can chooseto see the preview tables and the schema graph side by side.If the user clicks on any column’s header, the correspondingnode or edge in the graph is highlighted and the user can

have a better understanding of the surrounding entities of thatparticular item. The user can hide, display or delete the samplerecords in each preview table. Moreover, each column canbe renamed and non-key attribute columns can be reordered.The user can also rearrange the preview tables by draggingeach table to a desired location. In this panel, the user canchoose to see the complete schema graph, or the summarizedschema graph which is the sub graph of the schema graphcorresponding to the generated preview.Information Panel: When the user clicks on any item in thepreview panel, additional information for that item is displayedin the information panel. For example, when the user clickson a key attribute in a preview table, or a node in a schemagraph, such additional information includes entity type, ID,and number of the entities in the dataset that belong to themarked entity type. Similarly, if the user clicks on a non-keyattribute of a preview table or an edge of a schema graph,additional information such as relationship type, ID, sourceand destination entity types is presented.Manually Generated Preview Tables: TableView enablesdata experts to handcraft their own preview tables and cus-tomize them. Fig. 4 shows the graphical user interface forhandcrafting preview tables. After selecting the dataset, do-main, number of key attributes, and number of non-keyattributes, the user can manually design her own preview.The schema graph of the selected domain and the URL tothe dataset are provided to the user. Based on the selecteddataset and domain, a list of key attributes under the selecteddomain are shown to the user. When the user chooses a keyattribute, the list of relationships to or from that key attributeare presented to her. Each preview table must contain onekey attribute and at least one non-key attribute. Once the userselects a non-key attribute for the chosen key attribute, a newtable is added to the preview with the selected key and non-key attributes. The user can add more non-key attributes to anyof generated preview tables as long as the total number of thenon-key attributed is not over the specified limit. The numbersof remaining key and non-key attributes are updated at eachstep. The user is able to delete the key and non-key attributesfrom the partially generated preview. When the preview isready, the user can export the preview in PDF format.

III. SYSTEM OVERVIEW

A preview is a set of preview tables, where each table hasa key attribute, corresponding to an entity type, and a setof non-key attributes, each corresponding to a relationshipbetween the key attribute and another entity in the entitygraph. There is thus a large space of possible previews for agiven schema graph. The preview is designed to help the usersattain a quick understanding of the entity graph, therefore,it must fit into a limited display space. This desideratum iscaptured by a constraint on the size of the preview, in termsof the numbers of key and non-key attributes. Furthermore, inorder to control how tight or diverse the preview can be anadditional constraint on the pairwise distance between preview

Page 3: TableView: A Visual Interface for Generating Preview ...ranger.uta.edu/~cli/pubs/2018/tableview-icde18demo-hasani.pdf · coverage-based and random-walk based scoring methods for entity

ConfigurationPanel PreviewPanel InformationPanel

PreviewType:

Fig. 3. User interface of TableView.

Fig. 4. User interface for handcrafted preview tables.

tables is considered. We formulate the problem as finding apreview with the highest score among those possible previewssatisfying the size and distance constraints. In the followingwe briefly explain the scoring measures used in TableView.

A. Scoring Measures and Algorithms

The score of a preview is simply aggregated from individualpreview tables’ scores, by summation. The score of each pre-view table is calculated from the product of its key attribute’sscore and the summation of its non-key attributes’ scores.

For key attribute scoring we implemented coverage-basedand random-walk based methods. Given an entity graphGd(Vd, Ed) and its corresponding schema graph Gs(Vs, Es),the key attribute τ of a candidate preview table T correspondsto an entity type, i.e., τ ∈ Vs. The coverage-based scoringmeasure defines the score of τ as the number of entities thatbelong to that type:

Scov(τ) = |{v|v ∈ Vd ∧ v has type τ}|

In order to calculate the random-walk based score of a keyattribute, we consider a random-walk process over a graph Gconverted from the schema graph Gs(Vs, Es), inspired by thePageRank algorithm [12]. The edge between τi and τj in Gis weighted by the number of relationships in the entity graphbetween entities of types τi and τj .

For non-key attributes scoring we implemented coverage-based and entropy-based methods. The coverage-based scoringmeasure defines the score of a non-key attribute γ as thenumber of relationships that belong to that type:

Sτcov(γ) = |{e|e ∈ Ed ∧ e has type γ}|

To calculate the entropy-based score of a non-key attribute,we measure the value of a non-key attribute γ(τ, τ ′) (orγ(τ ′, τ))—the edge between entity types τ and τ ′ in theschema graph—by the entropy of γ (H(γ)) to estimate theamount of information it provides to T .

Sτent(γ) = H(γ) =∑j=1

nj|t.γ|

log(|t.γ|nj

),

where nj is the number of tuples in T that have the same jthattribute value u on non-key attribute γ(τ, τ ′) (or γ(τ ′, τ)).|t.γ| is the number of tuples in T with non-empty values onγ(τ, τ ′) (or γ(τ ′, τ)).

We developed a dynamic programming algorithm for find-ing the optimal concise preview and an Apriori-style algorithmfor finding optimal tight/diverse previews based on thesescoring measures. Interested readers can find more details inYan et al. [11].

B. System Implementation

The algorithms in TableView are implemented in C++ andPython 2.7. The system is hosted on a Dell T100 server with

Page 4: TableView: A Visual Interface for Generating Preview ...ranger.uta.edu/~cli/pubs/2018/tableview-icde18demo-hasani.pdf · coverage-based and random-walk based scoring methods for entity

TABLE ISIZES OF ENTITY/SCHEMA GRAPHS.

Domain # of vertices # of edgesbook 6M / 91 15M / 201film 2M / 63 18M / 136music 27M / 69 187M / 176TV 2M / 59 17M / 177people 3M / 45 17M / 78

Dual Core Xeon E3120 processor, 6MB cache, 4GB RAM,and two 250GB RAID1 SATA hard drives running Ubuntu8.10. We used Django for implementing the web interface andJavascript, HTML, and CSS for parsing the results on the clientside. For graph visualization, we used D3.js.

The entity graph used in our demonstration is a dump ofFreebase on September 28, 2012. The dataset is importedinto a MySQL database. We can use the same approach forgenerating the preview tables for any entity graph that providesthe type information for entities and relationship. In Freebase,the entire entity graph is partitioned into several domains. Wepre-processed five of its domains, including film, book, music,TV, and people. Table I presents the size of each domain.

IV. DEMONSTRATION PLAN

In this section we describe a few scenarios that we intendto present to the audience.

A: Automatically generating preview tablesUsing the following steps, the user will be able to automati-cally generate a preview. A1) Select Freebase from the datasetdrop-down list. A2) Select one of the domains from the domaindrop-down list. A3) Select the number of the key-attribute. A4)Select the number of the non-key attributes. A5) Select thekey-attribute scoring method. A6) Select the non-key attributescoring method. A7) Click the Show Preview button to see theautomatically generated preview in the preview panel.

B: Interacting with the generated previewThe following steps will show the user how to interact withthe generated preview in scenario A. B1) Delete a previewtable. B2) Delete one of the non-key attributes of a previewtable. B3) Refresh the preview to replace the deleted key andnon-key attributes. B4) Delete a sample record from one of thetables in the preview. B5) Hide and show the sample records ina table. B6) Rearrange the tables. B7) Rearrange the non-keyattributes of the tables. B8) Rename non-key attributes.

C: Additional informationThe following steps will show the user how to find thepreview tables in the schema graph and look up the additionalinformation about key attributes and non-key attributes in theschema. C1) Click on the key attribute or non-key attributes ofa preview table and check the highlighted nodes and/or edgesin the schema graph. C2) Click on any column on the previewtables and check the additional information in the informationpanel. C3) Click the Export button to generate a PDF file ofthe preview.

D: Generating a preview manuallyThe following steps will guide the user towards handcraftinga preview for a given schema graph. D1) Select one of thedomains in the domain drop-down list. D2) Select the numbersof key and non-key attributes. D3) Click the Manual Previewbutton. D4) Select a key attribute from the key attribute list.D5) Select non-key attributes from the non-key attribute list.D6) Check the schema graph and the numbers of remainingkey attributes and non-key attributes. D7) Delete one ofthe non-key attributes from the table. D8) Delete the keyattribute of a table. D9) Design and customize the completepreview by selecting the key attributes and correspondingnon-key attributes. D10) Click on each table and check thecorresponding sub-graph in the schema graph. D11) Exportthe preview in PDF format.

In addition to TableView, we implemented the summariza-tion algorithm from Yang et al. [6]. We will demonstrate theschema summary of Freebase generated by their algorithm andcompare it with the preview tables produced by TableView.

ACKNOWLEDGMENTS

The authors have been partially supported by NSF grants1408928, 1719054 and NSF-China grant 61370019. Any opin-ions, findings, and conclusions in this publication are thoseof the authors and do not necessarily reflect the views ofthe funding agencies. We also thank Rudresh Ajgaonkar andAaditya Kulkarni for their contributions.

REFERENCES

[1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, , and Z. Ives,“DBpedia: A nucleus for a Web of open data,” in ISWC, 2007, pp. 722–735.

[2] F. M. Suchanek, G. Kasneci, and G. Weikum, “YAGO: a core ofsemantic knowledge unifying WordNet and Wikipedia,” in WWW, 2007,pp. 697–706.

[3] W. Wu, H. Li, H. Wang, and K. Q. Zhu, “Probase: a probabilistictaxonomy for text understanding,” in SIGMOD, 2012, pp. 481–492.

[4] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase:a collaboratively created graph database for structuring human knowl-edge,” in SIGMOD, 2008, pp. 1247–1250.

[5] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy,T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web-scaleapproach to probabilistic knowledge fusion,” in KDD, 2014, pp. 601–610.

[6] X. Yang, C. M. Procopiuc, and D. Srivastava, “Summarizing relationaldatabases,” PVLDB, vol. 2, no. 1, pp. 634–645, 2009.

[7] X. Yang, C. M. Procopiuc, and D. Srivastava, “Summary graphs forrelational database schemas,” PVLDB, vol. 4, no. 11, pp. 899–910, 2011.

[8] C. Yu and H. V. Jagadish, “Schema summarization,” in VLDB, 2006,pp. 319–330.

[9] Y. Tian, R. A. Hankins, and J. M. Patel, “Efficient aggregation for graphsummarization,” in SIGMOD, 2008, pp. 567–580.

[10] N. Zhang, Y. Tian, and J. M. Patel, “Discovery-driven graph summa-rization,” in ICDE, 2010, pp. 880–891.

[11] N. Yan, S. Hasani, A. Asudeh, and C. Li, “Generating preview tablesfor entity graphs,” in SIGMOD, 2016, pp. 1797–1811.

[12] S. Brin and L. Page, “The anatomy of a large-scale hypertextual websearch engine,” in WWW, 1998, pp. 107–117.