datamockups: designtoolfor content-poweredmockups48197/et… · digital mockups of a website...

110
DataMockups: design tool for content-powered mockups Master’s Thesis Sybil Ehrensberger <[email protected]> Prof. Dr. Moira C. Norrie Alfonso Murolo Global Information Systems Group Institute of Information Systems Department of Computer Science ETH Zurich 16th September 2015

Upload: others

Post on 18-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • DataMockups: design tool forcontent-powered mockups

    Master’s Thesis

    Sybil Ehrensberger

    Prof. Dr. Moira C. NorrieAlfonso Murolo

    Global Information Systems GroupInstitute of Information Systems

    Department of Computer ScienceETH Zurich

    16th September 2015

  • Copyright © 2015 Global Information Systems Group.

  • Abstract

    The focus of the web engineering research community has mainly been on the model-drivenapproach, creating methods and tools to support it. However, web developers in industrytoday use an interface-driven approach by first creating mockups of the web page they aredesigning. The DataMockups design tool developed in the present project aims to assist prac-titioners in their processes. The tool offers developers the possibility to design high-fidelitydigital mockups of a website through a WYSIWYG-style editor. A previously developedtool called DeepDesign was modified to allow the import of content from existing websitesinto the editor. Once the pages have been designed and real content has been entered, thetool semi-automatically recognizes similar elements on the pages and groups them to sug-gest a complete schema with entities, attributes and relationships. If the users are satisfiedwith the schema proposed, the tool creates a new database and populates it with the contentsof the designed pages through the use of a database generation service. A user study andruntime evaluation of the tool showed that the tool was effective and that the generated res-ults corresponded to the participants’ expectations. The study also compared the approachesthat DataMockups and DeepDesign use to recognize similar elements on web pages. Parti-cipants generally found the DataMockups tool easy to use and preferred it to the DeepDesigntool. However some suggestions for improvements emerged, and most of the feedback hasbeen addressed in the final version of the tool. Various recommendations are made for futuredevelopment of the DataMockups design tool.

    iii

  • iv

  • Contents

    1 Introduction 1

    2 Background 32.1 Model-driven development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Interface-driven development . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Combining model-driven and interface-driven approaches . . . . . . . . . 52.4 Similarity measures of websites . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Extracting data from websites . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Design of the DataMockups Tool 113.1 Target audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Design editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.2.1 Goals of the design editor . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 General components of the design editor . . . . . . . . . . . . . . . 123.2.3 Drag and drop component . . . . . . . . . . . . . . . . . . . . . . . . 133.2.4 WYSIWYG component . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.5 Content import component . . . . . . . . . . . . . . . . . . . . . . . 15

    3.3 Element detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.1 Goals of element detection . . . . . . . . . . . . . . . . . . . . . . . 163.3.2 Use of the element detection component . . . . . . . . . . . . . . . 16

    3.4 Schema formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4.1 Goals of schema formation . . . . . . . . . . . . . . . . . . . . . . . 173.4.2 Use of the schema formation component . . . . . . . . . . . . . . . 18

    3.5 Integration with a database generation service . . . . . . . . . . . . . . . . 193.5.1 Goals of integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5.2 Use of the integration component . . . . . . . . . . . . . . . . . . . 19

    4 Architecture and Implementation 21

    v

  • vi CONTENTS

    4.1 Choice of technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Design editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.3.1 Drag and drop positioning . . . . . . . . . . . . . . . . . . . . . . . . 234.3.2 WYSIWYG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.3 Integration with the DeepDesign Chrome extension . . . . . . . . 29

    4.4 Element detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4.1 Selecting elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4.2 Hierarchical clustering and pq-gram edit distance . . . . . . . . . . 324.4.3 Cluster naming and selection . . . . . . . . . . . . . . . . . . . . . . 33

    4.5 Schema formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.1 Cluster classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.2 Internal interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5.3 Data detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    4.6 Integration with the database generation service . . . . . . . . . . . . . . 384.6.1 Overview of the integration with the DB-API-Generator . . . . . . 394.6.2 Schema specification and generation . . . . . . . . . . . . . . . . . 404.6.3 Data insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5 Evaluation 435.1 User study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    5.1.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.1.2 Design editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.3 DataMockups vs. DeepDesign . . . . . . . . . . . . . . . . . . . . . 505.1.4 Database generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    5.2 Element detection evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.1 Setup and tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    6 Conclusion 61

    List of Figures 64

    List of Tables 65

  • CONTENTS vii

    List of Code Snippets 67

    List of Algorithms 69

    A XSD-Definition for the Schema 71

    B User Study Questions 75

    C User Study Task Descriptions 87

    D Excluded User Study Results 91

    E Suggested Improvements to the Tool 93

  • viii CONTENTS

  • 1Introduction

    Website development practices have evolved significantly since the first websites were createdas simple lists of links. Although the web engineering discipline has its origins in softwaredevelopment and still has many similarities to the traditional software development process,new methodological approaches are needed to keep up with current practices. Most of theweb engineering research community still advocates the use of a model-driven approach,although surveys of industry trends show that this approach is not being used by many webdevelopers in the industry today [1].

    For the developers, the website development process usually starts with the design of the userinterface. This can either be a pen and paper sketch, an image designed with tools such asAdobe Photoshop or a mockup already written in HTML and CSS. From these mockups theHTML and CSS pages are either automatically generated or manually created, and then thefront-end functionality is added. Finally, the server-side functionality is implemented. Thisprocess can be done without making use of any formal models at all.

    Instead of convincing today’s web developers to change their process completely to make useof tools that can aid them by generating code for them, the aim of this project is to developa tool to help them create data-intensive websites without changing the current practices ofstarting by creating mockups. The design tool will support the developers in creating realisticmockups with real content that they can then present to their clients as well as allowing thepossibility of testing various aspects of the design under authentic conditions in order toprevent the need for costly post-launch modifications. The mockups are written directly inHTML and CSS, so there is no need to migrate from the mockup to the final website once thedesign is done.

    As Bill Gates already said in an essay in 19961: “Content is King”. This also applies to webdesign practices. The most important part of a website is not how visually pleasing it is or

    1http://web.archive.org/web/20010126005200/http://www.microsoft.com/billgates/columns/1996essay/essay960103.asp, last accessed on September 9, 2015

    1

    http://web.archive.org/web/20010126005200/http://www.microsoft.com/billgates/columns/1996essay/essay960103.asphttp://web.archive.org/web/20010126005200/http://www.microsoft.com/billgates/columns/1996essay/essay960103.asp

  • 2

    whether it follows the newest design trends, but the content in it. The design of the web siteshould make it easier to find and read the content and make it more appealing to look at,but ultimately, the visitors usually do not go to websites to look at the design, but insteadto consume the content (read the article, watch the movie, download the software, buy anitem, etc.). That is why when developing websites, it is important to keep that in mind and toincorporate real data into the mockups at an early stage.

    For cases where the content (or similar content) of the page being designed already existson some other website, the tool also provides the functionality of importing the data from theother website. The imported data is added with all of the styling information from the originalpage left intact. The users can then modify the content and the styling so that it correspondsto the page they have in mind.

    To ease the process of going from a user interface design to a fully functioning website,the tool will detect the relevant content from the mockup semi-automatically and suggest adatabase schema including the attributes of individual entities and the relationships betweenthe entities. Through the use of a dedicated database generation service, it will be possible tocreate a database directly and fill it with the content provided in the mockup.

    The aim of the design tool is to help developers in the creation of the database schema as wellas to speed up the whole process through a tool that can take over the tedious and error-proneparts of the development process. Additionally, it will allow developers who have very limitedor no knowledge at all of database design to create a functioning website with reduced effortand with no additional tools. This is especially useful since the survey mentioned above foundthat 42% of web developers and designers that work in the industry have no formal trainingin computer science or design.

    Contributions

    The DataMockups design tool developed in this project aids developers and designers in cre-ating HTML mockups with real content. The mockups can be edited and styled directlywithin the tool. While many tools already exist that help developers create mockups, thenovelty of the approach used in the new tool lies in the element detection and schema gen-eration that follows the design of the mockup. Because the tool users add real content tothe designed pages, it can then suggest possible data entities from the designed page and putthem into relationships in a semi-automatic manner. These data entities can be modified anddeleted in case the users do not agree with the automatic suggestions. Finally, the tool canautomatically generate a working database through a dedicated database generation servicedeveloped in previous work and add the data from the mockup directly into the database.

    Before describing the tool itself, Chapter 2 presents a brief review of the literature. Chapter 3introduces the main goals of the DataMockups design tool and its functionality. The archi-tecture and implementation details of the tool are described in the next chapter. Chapter 5presents a user study that was conducted to test the design tool and the proposed schema de-tection approach. Finally, the last chapter summarizes the main results, the limitations of thetool and possible future work.

  • 2Background

    The first three sections of this chapter explain the motivation and background for our specifichybrid approach to web development. First, the idea behind model-driven development andthe research that has been done in this area is described. The next section about interface-driven development explains how many practitioners in industry today design websites. Sec-tion 2.3 discusses how previous researchers have integrated the two approaches into a unifiedprocess and included the use of mockups.

    The final two sections provide a general overview of the implementation of tree-edit distancesas well as of varying approaches of data extraction from websites.

    2.1 Model-driven development

    Web development started out as a branch of software development, and as a consequencemany practices from software development, such as a modelling, have been taken over. Thisis why model-driven approaches to web development are common within the web engineeringresearch community, and there are many well-established practices and tools to support thoseapproaches.

    Aragón et al. conducted an analysis of various model-driven web development methods [2].Some of the more frequently discussed methods are OOHDMDA [3], WebML [4] and UWE(UML Web Engineering) [5]. For example, the design of data-intensive web applicationswith one of these methods has been described in detail in [6]. The authors propose firstcreating a data model using ER diagrams, then a hypertext model using WebML [4] andfinally a content management model. Based on those models, the necessary code for thewebsite can be automatically generated. However, all of these methodologies are based on amodel-driven architecture (MDA). This means that in each case models must first be createdbefore the website can be generated.

    3

  • 4 2.2. INTERFACE-DRIVEN DEVELOPMENT

    As concluded by Lang and Fitzgerald [7], although over fifty methods and approaches tomodel-driven web development have been proposed in the literature, very few of them arebeing used in practice. They suggest that a better integration between design tools and designmethods would be useful. Since little research on current web development practices in in-dustry had been done, Norrie et al. conducted a study to find out how web developers andweb designers created websites [1]. It showed that a large portion of today’s web developershave no formal education in computer science. This means that they were not trained as webor software developers and are thus most likely unfamiliar with the modelling practices oftraditional software development. Additionally, the model-driven approach leaves the userinterface design to the very end of the process. This means that it is very hard for the de-veloper’s clients, that is the people who commissioned the website, to be able to understandwhat the final product will look like.

    2.2 Interface-driven development

    The need to create expressive mockups in the web development process has been increas-ing with greater importance of presentation in the development of websites. Expressive andrealistic mockups help the designer or developer to communicate with their client about theprospective website and to get early feedback. Additionally, if a website is poorly designedand not visually pleasing, then the end-users will be less likely to stay on the page. There-fore, the mockup, which represents the final design of the website, is a crucial element in thedevelopment process.

    The results of the survey [1] showed that sketching mockups was a widespread practice andthat more than half of the participants created digital mockups in their development pro-cess. A variety of tools is available to create mockups. There are purely design-based tools,such as Adobe Photoshop, but also online sites that allow developers and designers to createwireframes (e.g. MockingBird1, Balsamiq Mockups2) or even valid HTML and CSS (e.g.templatr.cc3, Dragsponsive4, Pingendo5).

    In addition to creating mockups, it is important to not just fill the mockups with filler text, suchas the typical Latin Lorem ipsum, but to add real content instead. There are several reasonsfor this. The first is to avoid accidentally publishing sites that contain the filler text insteadof the real content. Another reason is that the real content can be substantially different instructure and length from the filler text and thus the mockup would not accurately depict thefinal design, as mentioned in [8]. Finally, having mockups with real content leads to betterfeedback from the client since it is easier for them to comment on not only the look and feelof the page, but also on how to content is displayed.

    While interface-driven approaches are great at creating a user interface and refining the visualrepresentation with the client, there is a lack of tools and approaches on how to proceed afterthe mockup is created. Most designers and developers then manually convert the mockups to

    1https://gomockingbird.com/home2https://balsamiq.com/3http://templatr.cc/4http://dragsponsive.com/5http://pingendo.com/

    https://gomockingbird.com/homehttps://balsamiq.com/http://templatr.cc/http://dragsponsive.com/http://pingendo.com/

  • CHAPTER 2. BACKGROUND 5

    HTML and CSS (if the mockup was not already in that format) and then gradually add front-end and server-side functionality. DENIM [9] is a tool for designers that supports them inthe early design stage by incorporating sketches of different levels into the tool and allowingthem to automatically create HTML and CSS pages from the sketches. However, the tooldoes not offer any support after the HTML pages are created.

    2.3 Combining model-driven and interface-driven approaches

    The web engineering research community has recently started to investigate how the twoapproaches can be combined. Since the model-driven approaches described in Section 2.1do not allow the client to give feedback at early stages of the development process, hybridapproaches such as those described below have been created.

    MockupDD

    In [10], the authors combine model-driven development with agile methodologies. Insteadof starting with data models as is usual in model-driven approaches, they start with user inter-face mockups. These mockups allow them to go through an iterative process with the clientas well, refining the mockups as needed. The whole process, MockupDD (Mockup-DrivenDevelopment) [11], is based on the Structural User Interface (SUI) model which is derivedfrom the mockups. The SUI model can be enriched with tags that describe the navigation andthe content of the pages. The enriched SUI model can be used for automatic code generationto get a working demo of the website and it can also be used to generate other model-drivenweb engineering models (currently WebML and UWE). This approach successfully integratesmockups into the development process, but it still is very much focused on the model-drivenapproach. For developers with no prior knowledge of the modelling tools used, this approach,which requires extensive tagging of multiple different mockups, may seem overly complic-ated.

    MockAPI

    MockAPI [12] uses a similar approach to MockupDD to create API-centric services. It alsostarts by gathering client requirements and creating mockups. The mockup is annotated withthe desired features. Possible annotations include the functionalities related to content suchas the CRUD (create, read, update and delete) operations as well as navigation and customfunctions. These annotations are used to create initial API implementations. This approach isinteresting as no knowledge of model-driven web engineering models is needed and the an-notations are very straightforward and easy to understand. However, this approach is limitedto the generation of APIs and cannot be directly applied to more general web developmentscenarios.

  • 6 2.4. SIMILARITY MEASURES OF WEBSITES

    2.4 Similarity measures of websites

    Every web page can be represented as a specific type of hierarchical data, namely a labelledordered tree. The tag can be considered the root of the tree and every element achild of it. The label of each element is usually the node name. The problem of comparingtwo websites or any two HTML elements in a web page can thus be reduced to comparingtwo trees, referred to as the tree-edit distance. Being able to compare two HTML elements toeach other is an essential part of the new tool. It is needed in the element detection process inorder to be able to determine whether the two elements represent the same type of data. Thissection outlines different ways of computing similarity measures between two trees.

    Tree-edit distance

    Selkow [13] suggests a recursive algorithm that calculates the similarity of trees by countingthe number of operations needed to transform the first tree into the second (or vice-versa).Selkow defines three different edit operations: a) a label change operation; b) an insert oper-ation; and c) a delete operation. Each of these operations has a non-negative cost. The sumof the number of operations for each type multiplied by their cost then results in the tree-editdistance. The algorithm proposed by Selkow is a top-down algorithm and has a runtime of atleast O(n2) where n is the number of tree nodes.In order to efficiently calculate the tree-edit distance, Tai defines a mapping from one treeto another and then uses this mapping in a dynamic programming algorithm that takespolynomial-time to solve the tree-edit distance problem [14].

    XML similarity

    Nierman and Jagadish [15] introduce a similarity measure for XML documents. Their ap-proach is particularly relevant to the new tool since XML and HTML pages are structurallyvery similar and some HTML pages are even valid XML. Since XML elements have notonly child elements but also attributes, the element node is modified slightly so that attributesare added as child nodes to the element node. Similarly to [13], they define different editoperations, but instead of only having three operations, they add two more: the insertion ordeletion of a tree. The use of these two operations, however, is restricted to subtrees that arepresent in both of the trees that are being compared. Otherwise one could simply delete oneof the trees and replace it completely by the other. The minimum tree-edit distance is thencomputed using a dynamic programming approach. The time complexity of this approach isO(|A||B|) where A and B are the two trees being compared.

    pq-grams

    Since computing the tree-edit distance is computationally expensive and leads to a long runtime, Augsten et al. [16] suggest an efficient approximation of the tree-edit distance. The pq-gram edit distance differs from other tree-edit distance algorithms in that it places a higherimportance on modifications to the structure of the tree (i.e. insertions and deletions of non-

  • CHAPTER 2. BACKGROUND 7

    leaf nodes) compared to the other operations. The tree is first extended with null nodes byinserting p − 1 ancestors to the root node and q − 1 children before the first and after thelast child of each non-leaf node and by adding q children to each leaf node. pq-grams arethen defined to be a subtree of this extended tree that is isomorphic with a tree that consistsof an anchor node with p− 1 ancestors and q children. The pq-grams of a tree can generallybe described as all of the subtrees of a specific shape. The pq-gram distance is based on thenumber of pq-grams the two trees have in common. It can be calculated in O(n log n) time,where n is the number of tree nodes, and is significantly faster for large trees than algorithmsthat compute the tree-edit distance.

    2.5 Extracting data from websites

    In order to support the semi-automatic generation of a database schema, the tool needs tobe able to recognize the relevant elements on the page and their relationships between eachother. This work is closely related to the problem of content extraction and informationretrieval. Data extraction from HTML pages is often done through the creation of so-calledwrappers. The wrappers describe the elements of the page that are part of the structure ofthe page and not the relevant data itself, so that the interesting content can be identified andthen extracted. Researchers from the Federal University of Minas Gerais have conducted asurvey of available data extraction tools [17]. The 15 tools they examine in detail use differentapproaches and have various degrees of automation. The early approaches involve manuallyspecifying where in the page the relevant data is positioned or using interactive systems whereusers can specify which data should be extracted [18, 19, 20]. The following sections describethe most relevant related work pertaining to the mostly automatic extraction of content fromwebsites and evaluates them with respect to the requirements of the new design tool.

    RoadRunner

    In contrast to the earlier work described above, RoadRunner [21] does not require any addi-tional information (e.g. labelling, known schema). It can be run completely automatically.RoadRunner takes two similar HTML pages that have the same structure but different content(e.g. two search results web pages generated by two different search terms) as input. Usingthe two input pages, it creates a regular expression wrapper for the pages. The wrapper con-tains all of the common elements between the two pages and allows for differences in placeswhere the data is. To construct the wrapper, the algorithm first assigns the whole first pageas the wrapper, then using the algorithm match, the wrapper is modified to remove or changethe elements that do not match in the second page.

    The RoadRunner approach is interesting as it was one of the first fully-automatic wrapperinduction solutions. However, it is not useful for the new design tool since it requires twosimilar pages that only differ in the data that is to be extracted, to work. The proposed designtool will only have access to one mockup page.

  • 8 2.5. EXTRACTING DATA FROM WEBSITES

    IEPAD

    Yet another approach to the data extraction problem is the one proposed in [22]. The IEPADsystem consists of three main components: an extraction rule generator; a pattern viewer;and an extractor module. The extraction rule generator includes a translator that receives theHTML page, a PAT tree constructor, a pattern discoverer and validator and finally an extrac-tion rule composer. The translator takes the input page and identifies the HTML tags and thetext and then encodes them in binary format. This is then fed to the PAT tree (Patricia tree)constructor. The latter constructs the PAT tree, which is in turn used by the pattern discovererand validator to discover repeated paths. The validator uses two measures (compactness andregularity) to discard some of the paths. The rule composer then creates rules from the pathsthat also allow for inexact matches through a dynamic programming approach of alignment.The rules are displayed in the pattern viewer for the developers to choose which ones theextractor module should use to extract the data.

    In contrast to RoadRunner, IEPAD only needs one page to generate the rules and does notrequire any human interaction before the rules are selected in the pattern viewer. Researchhas shown that IEPAD can achieve high accuracy in experimental results. Nonetheless, theapproach is not very well-suited for the new design tool as it can only deal with flat records(no nested entities) and requires the users of the IEPAD system to be able to understand theextraction rules that are presented to them.

    ExAlg

    Arasu and Garcia-Molina [23] describe an approach that does not rely on any human inputor other learning examples, similarly to RoadRunner and IEPAD. Just like RoadRunner, theExAlg algorithm takes as input a set of similar web pages, but in contrast to both IEPADand RoadRunner, they do not assume that all HTML tags are part of the template, but insteadconsider that they could be part of the data. The algorithm is based on the assumption that allof the pages were created by a template that substitutes the relevant parts with the real data.Their aim is to deduce the template of the pages and extract the values. The algorithm theypresent consists of two modules: the equivalence class generation module and the analysismodule. They define the idea of equivalence classes as a set of tokens that have the samefrequency on each page. The equivalence class generation module finds all of the equivalenceclasses that are large and occur on many pages (LFEQs - Large and Frequently occurringEQuivalence classes). These LFEQs are then used in the analysis module to reconstruct atemplate and extract the data.

    This algorithm is similar to the approach used in RoadRunner in that it requires a set of similarinput pages and then deduces a common template from them. This means that it not suitablefor the proposed tool either, as it cannot provide more than one input page.

    Domain-oriented approach

    A different approach was taken in [24]. In this case, to help the data extraction, the research-ers used knowledge of the domain of the websites, specifically news sites, to improve theautomatic extraction of data. Their approach uses an algorithm to create a restricted top-

  • CHAPTER 2. BACKGROUND 9

    down mapping (RTDM). Tree-edit distances and mappings that this approach relies on werepresented in Section 2.4. To extract the data, they suggest using hierarchical clustering withthe RDTM algorithm as the distance measure to first group pages from a website into clusters.The algorithm then goes through the pages to find a node extraction pattern (ne-pattern), akind of regular expression for trees. These ne-patterns include wildcards that correspond tothe data that should be extracted and is constructed by composing the trees of the pages untilonly one is left. Using the ne-patterns to extract the data is a straightforward process. Theresearchers subsequently exploit knowledge of the domain (i.e. that a news item has a titleand a body) to label the data.

    This approach works well for news sites, as the extracted data always has a similar simplestructure and domain knowledge can be used to label the data. However, the proposed tooldoes not know what types of sites are to be designed and thus cannot rely on other informa-tion. Additionally, it will not have access to multiple pages created by one template, but onlyone designed page.

    DeepDesign

    The approach taken by Murolo and Norrie [25] requires the users to annotate some of therelevant data in the web page before automatically searching for similar elements in the page.Once the users have labelled parts of a sample data record, their algorithm first searchesfor the least common ancestor (LCA) of those annotations. Using the subtree rooted at theLCA, the algorithm searches for similar records by computing the pq-gram distance [16] (seeSection 2.4) between the subtree and all other possible subtrees. If the pq-gram distanceis below a certain threshold, the subtrees are considered possible matches. Labels are usedto exclude subtrees that are similar, but do not contain the same data. Using XPath-basedrelative paths, the algorithm tries to find the labels given by the user in the sample data recordin each of the subtrees. If it is unable to find all of the labels, the subtree is discarded as afalse positive. Finally, a step called local label propagation is done. In this step, the algorithmuses agglomerative hierarchical clustering with a custom distance function to find elementsin the recognized subtrees that could correspond to previously labelled elements. This meansthat an element occurs more than once in the data record.

    The approach described above relies on human input (i.e. the labelling) to find similar sub-trees, and can therefore not be applied directly to the tool proposed in the present paper.Additionally, the approach currently does not support more complex schemas that involverelationships or aggregations.

    The tool proposed in this project has to be able to inspect the contents of a single page andsuggest possible entities and relationships to the users. To accomplish this, the tool needs tobe able to recognize similar repeated elements in the page that could be entities or attributes.None of the previously described approaches satisfy all of the requirements of the tool. Theearly work requires too much user input, and RoadRunner and ExAlg both require morethan one input page. IEPAD and DeepDesign are not able to recognize complex entities andrelationships, and the domain-oriented approach cannot be used since the tool will not haveany knowledge of the domain of the pages to be designed. The proposed tool therefore usesa novel approach that is based on various concepts introduced in this section.

  • 10 2.5. EXTRACTING DATA FROM WEBSITES

  • 3Design of the DataMockups Tool

    The idea behind the design tool is to have a completely self-contained web application thathelps web designers and web developers to create data-intensive websites through the semi-automatic suggestion of a database schema and generation of a populated database for thecreated mockup. This chapter lists the goals of the various components of the tool and de-scribes how they can be used.

    3.1 Target audience

    The tool is aimed at designers and developers that have some basic knowledge of HTML andCSS. It is expected that they can understand and create their own CSS rules, i.e. knowingwhat margins and paddings are, being able to set the font family and font sizes and knowingwhat hexadecimal colour codes are.

    It is not necessary for them to be able to write their own HTML, but they should have a basicknowledge of the most common HTML elements and what they represent. Of course, thedesign tool can also be used by more experienced developers, although they might actuallybe faster and have more flexibility if they wrote the HTML and CSS directly.

    It is not assumed that the users have any knowledge of database design or optimization. How-ever, the tool does require that the users enter the database address, username and passwordfor the database that should be created.

    If the database is not accessible from the tool’s server, the users need to be able to enter twocommands on the command line to be able to execute the scripts which create the databaseand add the data. In this case, they need to have PHP installed and know what a terminal is.

    11

  • 12 3.2. DESIGN EDITOR

    3.2 Design editor

    As described in Section 2.2, creating mockups with realistic content is an important part of thedevelopment process. The design editor is the tool in this project that enables the designersand developers to achieve this. It is where the designers or developers design the general lookand feel of their website and add the content. The following sections describe the goals andrequirements for the design editor and then describe how the tool fulfils these.

    3.2.1 Goals of the design editor

    The design editor should:

    • enable users to create realistic mockups with real content without writing any code

    • allow users to interactively change the design

    • support responsive designs

    • create valid HTML5 and CSS code

    • allow users to integrate content from other websites

    • export the designed mockups to use as real websites

    3.2.2 General components of the design editor

    The general components of the design editor facilitate the creation of responsive designsas well as valid HTML and CSS files that can be exported for use in real websites (seeFigure 3.1).

    Figure 3.1: Main page of the design editor

    Directly below the fixed navigation menu are options. These include the save and discardbuttons as well as the selection of the page that is currently being edited in the design window.

  • CHAPTER 3. DESIGN OF THE DATAMOCKUPS TOOL 13

    It is also possible to create a new page and to export all created HTML and CSS files so thatthey can be used in the actual website later. Below the options bar are the drag and dropelements (discussed further in Section 3.2.3).

    The General pane is to the left of the design window. In this pane, the user can enter somepage information such as the the page title, page author and page description. This informa-tion is stored in tags on the page. The list of included stylesheets is listed directlybelow the previously mentioned page information. The pane also includes the option of re-moving existing stylesheets or adding new ones (internal or external). Under the page infosection, the users can choose the window size either by selecting a pre-defined device and itsscreen size or by entering a custom width and height. This allows the users to see how thetheir design looks on different screen sizes and to test responsive designs.

    The next two sections describe the drag and drop and the WYSIWYG component in moredetail and the third one explains how to import content from another website into the designeditor.

    3.2.3 Drag and drop component

    As its name suggests, this component allows users to drag and drop various elements such astables, images, paragraphs, headers into their design and then edit the contents directly insidethe web page. To add new content to a page, the users can drag and drop various differentHTML elements from a selection at the top of the page (see Figure 3.2) to the appropriateplace in the page. The element is positioned in the vicinity of where it is dropped and notnecessarily exactly in that location due to different sizes of the elements and in order tosupport a responsive design.

    Figure 3.2: Drag and drop elements

    As illustrated in Figures 3.3 and 3.4, when users drag an element over the page, in additionto a slightly lighter version of the dragged element, a purple box appears wherever the userscan drop the element. The options generally are above or below an existing element or to theleft or right of an existing element. For some elements, the option is also available to dropthe element inside an existing element (depending on the type of the existing and the newelement).

    Figure 3.3: Purple overlay top Figure 3.4: Purple overlay left

  • 14 3.2. DESIGN EDITOR

    3.2.4 WYSIWYG component

    With this component, users can interactively change the design of the website through a WhatYou See Is What You Get (WYSIWYG) style editor. They always also have the possibility ofwriting their own custom CSS rules that will then be added to the design or of editing theHTML directly.

    Once the elements are placed on the page, the users can change the elements’ content directlyon the page. By double clicking on the desired element, the content becomes editable. Theusers can either enter the text directly or copy and paste it from a different source.

    It is also possible to change the styling of parts of the element, the whole element or allelements of that type (i.e. paragraph, list item) with two different controls. One of the controlsis a pop-up (YAWE - Yet Another WYSIWYG Editor) that shows up as soon as the usersdouble-click on an element (see Figure 3.5) and the other is the Editing & Styling pane to theleft of the page which is pictured in Figure 3.6.

    The pop-up has undo and redo buttons in the top left corner. In the top right corner the userscan select the scope of the changes. The options for the YAWE pop-up are the selected text(selection option), the whole element (element option), all elements of the class (class option)or all elements of this type on the page (whole page option). In this pop-up, the users canapply different styles such as bold, italic, underline or strike-through to the selected scope.Additionally, the font style and the font size as well as the alignment can be changed.

    For the selected text option only, it is also possible to create or remove a link as well as toinsert images, horizontal lines, headings, paragraphs, ordered and unordered lists. In addition,the selected text can be shifted to the right or left with the increase or decrease indent buttons,respectively.

    Figure 3.5: YAWE - Yet Another WYSIWYG Editor

    The other way of changing the styling is by using the Editing & Styling pane (shown inFigure 3.6) to the left of the design window. It shows options to change the element ID, theelement class, the widths and heights (minimum, maximum and actual) of the element, thepadding and margin as well as the text and background colour. As with the YAWE pop-up,the users can change the scope of the styling. The options here are: this specific element, allelements with the same class and all elements of the same type.

    If there are not enough options offered by the YAWE pop-up and the styling pane, the usersalso have the possibility of creating their own CSS rules (see Figure 3.7). In the Editing &Styling pane, below all of the other options, is a list of applied CSS rules. The users have theoptions to delete or modify existing rules or to create their own CSS rules. The new rules willthen be applied to the page after the Apply rules button is clicked. Below these rules, there isa button to completely remove the whole element.

    For convenience, the HTML code of the currently selected element can be viewed and edited

  • CHAPTER 3. DESIGN OF THE DATAMOCKUPS TOOL 15

    Figure 3.6: Editing & Styling pane

    Figure 3.7: CSS rules

    by clicking the Show HTML code button at the very top of the pane. Once clicked, it opens amodal window that displays the HTML code.

    3.2.5 Content import component

    The content import component gives users the possibility to import content and the associatedstyling information from existing websites and integrate that content into their own page. Todo this, the DeepDesign Chrome extension [25] is used. Once elements are selected andsimilar elements are found, the users need to click the Save to storage button. After that isdone, the users can navigate back to the design editor and open the extension on this page.There they click the button Send to application. Once the button is clicked, the additionalitem depicted in Figure 3.8 appears in the list of drag and drop elements.

    Figure 3.8: DeepDesign drag element

    The element can then be dragged and dropped like any of the other elements (see Sec-tion 3.2.3). Once the element is dropped, the contents of the other website will be inserted atthat position. The styling information from the other website is kept and thus the content will

  • 16 3.3. ELEMENT DETECTION

    show up in the users’ design as it was on the other website. Of course, the users can changethe content and styling just like any other content of the page.

    3.3 Element detection

    Using some of the ideas presented in Section 2.4 and 2.5, the element detection componentaims to detect similar elements on the designed page that can then be used to create a databaseschema with as little human input as possible.

    3.3.1 Goals of element detection

    The goals for the element detection are the following:

    • require as little user input as possible

    • automatically detect all possible elements in the mockup

    • allow users to discard unwanted elements

    • provide an easy naming mechanism for the users

    • be applicable to both list views (many elements) and detail views (a single element)

    3.3.2 Use of the element detection component

    The element detection component is responsible for detecting important elements on the pageand then suggesting them to the users for naming or discarding. To start the element detection,the users navigate to the Detection pane and click the Choose area button. Then they canselect an area of their designed web page. The chosen area is indicated by a purple overlayand all of the individual elements within the area have a purple border. Once they are satisfiedwith the selected area, they click the Done choosing area button. The tool then clusters allelements and displays a list of results in the Detection pane and adds a coloured border aroundthe elements in the page. A sample of a completed detection is shown in Figure 3.9. Theusers can name the clusters they want to keep and discard the ones that are irrelevant. Ifthe users give different clusters the same name, the tool automatically recognizes that andmerges all elements into the same cluster. To better identify which elements are part of acertain cluster, the number of elements is displayed next to the cluster number and, when theusers hover over the entry in the list, the elements are highlighted with the same colour asthe outline. Once the users are finished naming and discarding clusters, they can click onFinished naming/discarding clusters to complete the process.

    The whole process can be repeated on another page of the same project that may containdifferent elements. However, the page needs to be saved first, so that the previous clusterassignments are not lost. Then the users can select a different page and complete the process.This allows users to assign properties to elements that were not present in the first page. Ofcourse, this process can be repeated for as many pages as necessary.

  • CHAPTER 3. DESIGN OF THE DATAMOCKUPS TOOL 17

    Figure 3.9: Completed clustering with element highlighting

    At any point in time, it is possible to discard all clustering information by clicking the buttonRemove/Discard all clustering information.

    3.4 Schema formation

    In the schema formation component, the detected elements from the previous section areclassified as entities and attributes and the relationships between them are inferred from thepositioning within the page. A database schema is created and the contents of the page aredetected. Both are stored in an internal interface so that they can be used later in the databasegeneration step.

    3.4.1 Goals of schema formation

    The goals for the schema formation are the following:

    • automatically determine the type for each cluster (entity, attribute, relationship)

    • automatically establish the relationships between detected elements

    • be able to combine elements from different pages

    • display the schema and relationships as well as the detected content to the users

    • allow the users to modify the computed schema and relationships to their wishes

    • store the schema and relationships in a well-defined interface

  • 18 3.4. SCHEMA FORMATION

    3.4.2 Use of the schema formation component

    Once the users are done clustering, they are encouraged to look at the results in the Schema& Data pane (see Figure 3.10). There all of the detected entities with their attributes and therelationships are listed. In this pane, the users have the possibility to review and modify theproperties of the entities, attributes and relationships (as shown in Figure 3.11). The name ofthe entities, attributes or relationships and the type of the attributes can be changed. It is alsopossible to discard the entity, attribute or relationship entirely. Additionally, the cardinalitiesof the entities in the relationships can be viewed and changed. Each entity in a relationshipcan have a cardinality of either 1 or N. This allows the users to change the overall type ofthe relationship (i.e. 1-N, N-N, etc.) in case the did not agree with what the tool detectedautomatically.

    Figure 3.10: Schema and detected data

    (a) For attributes

    (b) For relationships

    Figure 3.11: Editing options

    To help users understand the detected schema and relationships, the associated data from thedesign page is displayed below the schema in the Detected Data section. It lists all of theentities that were detected with all of their attributes and relationships. The users can reviewthe data before it is filled into the database, as the displayed data is exactly what will beinserted into the database at a later point in time.

    In cases where the users have previously saved the clustering information, they have thepossibility to view the schema and relationships as well as the detected data at a later point

  • CHAPTER 3. DESIGN OF THE DATAMOCKUPS TOOL 19

    in time. To do this, they click on the Schema & Data pane and then on the Add button. Addchecks the current page for saved cluster elements and adds them to the existing schema andupdates the relationships. The new data from the page is then also automatically added tothe Detected Data section. The Clear button removes all of the schema, relationships anddetected data completely and the Reload button is a shortcut for a Clear and Add sequence.

    With these buttons, it is also possible to add elements from many different pages into thesame schema and properly detect the relationships between them. This can be achieved bynavigating to each page in turn and clicking the Add buttons for each page. The schema thenrepresents the schema over all pages and the content from all of the pages is included in theDetected Data section.

    3.5 Integration with a database generation service

    In order to use the computed schema to create an actual database, a previously developeddatabase generation service is used. The service, called DB-API-Generator, was developedspecifically to be used in the present project [26]. To be able to communicate with the service,the internal representations of the schema and relationships are converted into a specific XMLschema definition format specified by the DB-API-Generator. The service returns differentscripts that are then executed to create the database and to insert the data from the samplepages into the newly-created database.

    3.5.1 Goals of integration

    The integration with the database generation service should not be noticeable to the users.The goal is to use the service to create the database and fill the created tables with the contentsupplied in the mockups as seamlessly as possible.

    The format to transmit the information to the database generation service should be platform-independent, so that the same interface can be used for many different languages and plat-forms.

    The users can specify which database should be used. If desired and possible (i.e. the databaseserver needs to be accessible from the design tool) the database is created directly by the tooland the data is inserted into the database. Otherwise the tool provides users with scripts andinstructions on how to create and fill the database themselves.

    3.5.2 Use of the integration component

    Once the clustering is done and a schema is generated, the next step is to ask the users forsome general information about their database, such as the database address, username andpassword. Figure 3.12 shows the form that is shown after the button Create database in theSchema & Data pane is clicked (refer to Figure 3.10).

  • 20 3.5. INTEGRATION WITH A DATABASE GENERATION SERVICE

    Figure 3.12: Form for database information

    There is also an option called Result format. The two options, Execute directly and Exportscript, allow the users to specify whether the tool should create the database and fill in thedata for the users, or just give them the necessary scripts so that they can execute the scriptsthemselves. It is only possible to choose Execute directly if the database server is accessiblefrom the design tool. For both result formats, a link is generated where the users can downloada .zip archive with all the necessary PHP scripts so that the database with all the data can berecreated at a later point in time. The commands needed to recreate the database and fill itare:

    php createDBandTablesScript.phpphp scriptSkeletonAUTO.php

  • 4Architecture and Implementation

    This chapter describes the overall architecture and implementation details of theDataMockups design tool. First, the choice of technologies is discussed. The overall ar-chitecture is presented in Section 4.2. In the next section, the details of the design editorimplementation are specified. Section 4.4 describes how the element detection works, andthe following section outlines how the schema is formed from the detected elements. The lastsection describes the integration with the database generation service.

    4.1 Choice of technologies

    Since DataMockups is a design tool that helps developers and/or designers create data-intensive websites, it was decided to leverage the rendering capabilities of browsers by mak-ing DataMockups a web application. As such the tool makes use of several different webtechnologies. The server side is implemented in PHP using the CodeIgniter Web Frame-work1. The tool runs on an Apache Web Server and makes use of a MySQL database to storethe project information.

    The front-end is implemented entirely in HTML5, CSS and JavaScript. It relies heavily onvarious JavaScript libraries and frameworks, most importantly on Bootstrap2 and jQuery3 forthe overall design and DOM manipulations.

    Other libraries used are:

    • Font Awesome4 to display font icons1http://www.codeigniter.com/2http://getbootstrap.com/3https://jquery.com/4http://fontawesome.github.io/Font-Awesome/

    21

    http://www.codeigniter.com/http://getbootstrap.com/https://jquery.com/http://fontawesome.github.io/Font-Awesome/

  • 22 4.2. OVERALL ARCHITECTURE

    • lodash5 as a utility library for some basic JavaScript operations

    • spectrum6 to choose colours

    • highlight.js7 to handle syntax highlighting

    • interact.js8 for the resizing of elements

    • clusterfck9 to do hierarchical clustering

    • jqgram10 to compute the pq-gram tree-edit distance, although it was slightly modifiedto allow for synchronous calls

    4.2 Overall architecture

    The overall architecture is a typical client-server architecture and is presented in Figure 4.1.However, most of the functionality is implemented purely on the client-side. The serveris connected to a MySQL database that stores all of the project data (i.e. which projectsexist, general information about the projects and where their files are located). The server isresponsible for: a) creating and storing projects; b) saving the designed HTML pages withtheir CSS; c) creating new HTML pages to be designed; d) exporting all HTML and CSS files;and e) handling all communication between the client and the database generation service aswell as creating and populating the database (more details are provided in Section 4.6).

    DB-‐API-‐Generator  DataMockups  Server  

    Clients  

    DataMockups  DB  

    Figure 4.1: Overall architecture

    5https://lodash.com/6https://bgrins.github.io/spectrum/7https://highlightjs.org/8http://interactjs.io/9http://harthur.github.io/clusterfck/

    10https://github.com/hoonto/jqgram

    https://lodash.com/https://bgrins.github.io/spectrum/https://highlightjs.org/http://interactjs.io/http://harthur.github.io/clusterfck/https://github.com/hoonto/jqgram

  • CHAPTER 4. ARCHITECTURE AND IMPLEMENTATION 23

    The client is responsible for: a) designing the page in the design editor (Section 4.3); b) ele-ment detection through clustering (Section 4.4); and c) building the schema and inferring therelationships (Section 4.5).

    4.3 Design editor

    The design editor supports many different operations, many of which were non-trivial toimplement. The next three sections describe some of the more challenging and interestingimplementation details of the design editor. However, it is not an exhaustive list of all imple-mentation steps needed for the design editor to work.

    4.3.1 Drag and drop positioning

    The drag and drop (DnD) component is implemented using the new native HTML5 DnDsupport. Unfortunately, this also means that is not yet supported by all browsers (namelynot supported by all mobile browsers and only partially by Internet Explorer). Since it is notfeasible to design a website on a mobile phone or tablet, it does not matter that the mobilebrowsers do not support it. All of the features needed for this project are supported by InternetExplorer version 10 or higher, so the partial support is considered sufficient.

    The interesting part about the positioning is not which elements should be draggable or whathappens when the users start dragging them, but instead deciding where elements can bepositioned, how to show this to the users and determining how exactly the dragged elementwill be inserted into the HTML structure.

    As noted in Section 3.2.1, one of the goals of the design editor is to create valid HTML5code. This means that the elements cannot be positioned inside any arbitrary element (e.g.a element cannot be contained within a

    element). Based on the W3C WorkingDraft of HTML 5.111, an ElementValidator library was created. This library has a fewpublic functions:

    • isValidCombination(child, parent)Checks whether the child element can be contained within the parent element. Bothchild and parent have to be HTML elements.

    • isPhrasingContent(el)Checks whether the element is one of the phrasing content elements. Phrasing contentelements are elements that mark up text (e.g. , , ). el can be anode name or an HTML element.

    • isHeading(el)Checks whether the element is a heading (one of - ). el can be a nodename or an HTML element.

    The only function used in the drag and drop component is the isValidCombinationfunction. However, the other two functions (isPhrasingContent and isHeading) areused in different parts of the application.

    11http://www.w3.org/TR/html51/, last accessed September 9, 2015

    http://www.w3.org/TR/html51/

  • 24 4.3. DESIGN EDITOR

    When users drag one of the available elements (drag element) over an element in the designwindow (target element), a purple overlay appears (see Figure 4.2). This overlay can bepositioned over different parts of the target element: left, right, top, bottom and in some casesinner left, outer left, inner top and outer top.

    Figure 4.2: Purple overlay when dragging

    The position of the overlay is determined by two factors. One factor is the current pos-ition of the mouse with respect to the target element. The other is whether the drag andtarget element are valid HTML combinations. This is computed with two calls to theisValidCombination function of the ElementValidator library: one with the dragelement and the target element and the other with the drag element and the target element’sparent as arguments. Table 4.1 shows how valid combinations and mouse position combineto achieve each of the overlay positions.

    Valid combinations Mouse position (x, y) Overlay positionfalse x < 25% w leftfalse x > 75% w rightfalse 25% w < x < 75% w, y < 50% h topfalse 25% w < x < 75% w, y > 50% h bottomtrue x < 10% w outer lefttrue 10% w < x < 25% w inner lefttrue 75% w > x > 90% w inner righttrue x > 90% w outer righttrue 25% w < x < 75% w, y < 20% h outer toptrue 25% w < x < 75% w, 20% h < y < 50% h inner toptrue 25% w < x < 75% w, 50% h > y > 80% h inner bottomtrue 25% w < x < 75% w, y > 80% h outer bottom

    Table 4.1: Position of the overlay(x, y) are the coordinates of the mouse with respect to the target element

    w is the width of the target element and h is the height of the target element

    The users can drop the drag element whenever an overlay is visible. If they try to dropan element when no overlay is visible, it is invalid and nothing happens. If they drop theelement on an overlay, the positioning algorithm described in Algorithm 4.1 determines howthe element is inserted into the HTML document.

  • CHAPTER 4. ARCHITECTURE AND IMPLEMENTATION 25

    Algorithm 4.1 Positioning algorithm1: function POSITIONELEMENT(dragElement, targetElement, side)2: parentV alidCombo← isV alidCombo(dragElement, targetElement.parent)3: insideF lex← targetElement.parent has display set to flex4: switch side do5: case top-outerOnly or outerTop6: if parentV alidCombo and not insideF lex then7: targetElement.before(dragElement)8: else9: POSITIONELEMENT(targetElement.parent, dragElement, side)

    10: end if11: case bottom-outerOnly or outerBottom12: if parentV alidCombo and not insideF lex then13: targetElement.after(dragElement)14: else15: POSITIONELEMENT(targetElement.parent, dragElement, side)16: end if17: case top-innerOnly or innerTop18: targetElement.prepend(dragElement)19: case bottom-innerOnly or innerBottom20: targetElement.append(dragElement)21: case left-outerOnly or outerLeft22: if not insideF lex then23: wrap targetElement in flex-container24: end if25: targetElement.before(dragElement)26: case right-outerOnly or outerRight27: if not insideF lex then28: wrap targetElement in flex-container29: end if30: targetElement.after(dragElement)31: case left-innerOnly or innerLeft32: if targetElement is not flex-container then33: wrap targetElement in element of same type with class flex-container34: end if35: targetElement.prepend(dragElement)36: case right-innerOnly or innerRight37: if targetElement is not flex-container then38: wrap targetElement in element of same type with class flex-container39: end if40: targetElement.append(dragElement)41: end switch42: end function

  • 26 4.3. DESIGN EDITOR

    In Algorithm 4.1, side values are the values as determined by the position of the over-lay, except for elements where valid combinations are false. For those elements, theoverlay position is appended with -innerOnly or -outerOnly, depending on whether thedrag element and the target element form a valid combination (for the former) or whetherthe drag element and the target element’s parent form a valid combination (for the lat-ter). A flex-container is a element wherethe flex-container class means that the display is set to flex. isValidCombo is theElementValidator.isValidCombination function described previously. The in-sertion of the elements is done with the jQuery before and after or prepend and appendfunctions. The first two add the drag element before or after the target element, respectively,and the last two add the drag element as the first or last child element of the target element.

    An example of the positioning is illustrated in Code Snippets 4.1 and 4.2. For clarity andreadability, only the content of the is displayed and the formatting has been modi-fied. The user starts with a very plain page (as shown in Code Snippet 4.1) and wants to add amenu to the left of the paragraph. To do this, the user selects the unordered list from the list ofdrag elements and drags it on to the page. When the left overlay appears over the paragraph,the user drops the element. Based on the valid combinations and the mouse position, the sidevalue is left-outerOnly. The positioning algorithm (see Algorithm 4.1) first checks whetherthe target element (in this case the

    element) is already inside a flex-container. Sincethe element is not in a flex container (its parent is the element), the target elementis wrapped in a new flex-container. Then the drag element (the element) is insertedbefore the

    element inside the .

    Title

    This is a sample page and hereis a general description ofthe page.

    Code Snippet 4.1: Sample page

    Title

    An itemAnother itemYet another thing

    This is a sample page andhere is a generaldescription of the page.

    Code Snippet 4.2: Sample page after insertion

    4.3.2 WYSIWYG

    Creating a WYSIWYG editor is a problem that has been solved many times, both commer-cially and with open-source products. The aim of this project’s editor is to have a simple,light-weight solution that has styling options that can be applied to any type of HTML ele-ment in the document and not only on selected elements. A solution that

  • CHAPTER 4. ARCHITECTURE AND IMPLEMENTATION 27

    might seem very well-suited for this task is the Froala editor12, but it is only available witha commercial license. Other open-source editors, such as NicEdit13 and CKEditor14, wereevaluated for this project, but rejected for various reasons (e.g. lack of visual appeal, tooheavy).

    The editor should also support different scopes (see Section 3.2.4), such as being able tochange the style for the HTML element (element), all elements with the same class (class)or all of the elements of that type (whole page). None of the editors that were evaluatedsupported that functionality. The WYSIWYG functionality for the design tool was thus im-plemented using an existing library (JSS15), which had to be modified in many respects, anda new library was created specifically for this tool (i.e. YAWE, described below).

    JSS

    JSS is a simple JavaScript library to retrieve, set and delete CSS stylesheet rules, both forinline tags and the relevant external stylesheets. The original library supports thefollowing features:

    • setting properties based on a selector• getting individual rules by selector (not necessarily set via JSS)• retrieving all rules that were set via JSS• removing individual rules or all rules that were set via JSS

    In order to support the needed functionality for this project, the library was modified andextended to support the following features:

    • applying all functions to a different document (e.g. iframe)• setting, getting and removing individual properties by name• removing rules that were not added by JSS• providing a function to export all of the rules, added by JSS and otherwise, so that they

    can be saved in a *.css file• adding support to retrieving shorthand rules (e.g. border, margin)

    Unfortunately, the JSS library does not support media queries at this time. This means thatit is not possible to set or retrieve any @media CSS rules. Originally the intention was toextend the library to support them, but due to time restrictions this was not possible.

    The modified JSS library was used for the YAWE library described below, for the other stylingoptions in the Editing & Styling pane as well as for the custom CSS rules. Additionally, theexport feature was used to save the designed mockups with all of their styling in HTML andCSS files.

    12https://www.froala.com/wysiwyg-editor13http://nicedit.com/index.php14http://ckeditor.com/15https://github.com/Box9/jss

    https://www.froala.com/wysiwyg-editorhttp://nicedit.com/index.phphttp://ckeditor.com/https://github.com/Box9/jss

  • 28 4.3. DESIGN EDITOR

    YAWE - Yet Another WYSIWYG Editor

    The YAWE library provides a pop-up with some basic editing options (see Figure 3.5). De-pending on the scope of the styling (i.e. selection, element, class and whole page), the libraryimplements the styling in different ways and different functionalities are available.

    For the selection scope, the library uses the document.execCommand(commandName,defaultUI, value)16 to style the selected text or to insert lists, images and links. Inorder to make use of the command, the designMode of the HTML document has to beset to on. This is done every time the users are in the Editing & Styling pane and click onan element. Normal CSS rules are used for the other scopes, either inline for element or instylesheets for class and whole page. These rules are then added to the stylesheets by the JSSlibrary.

    Functionality document.execCommand CSS Rule(commandName, value)

    Bold bold font-weight: boldItalic italic font-style: italicUnderline underline text-decoration:

    underlineSubscript subscript vertical-align: subSuperscript superscript vertical-align: superLine-through strikeThrough text-decoration:

    line-throughFont fontName, font-family: Font size fontSize, font-size: Alignment: left justifyLeft text-align: leftAlignment: justified justifyFull text-align: justifyAlignment: right justifyRight text-align: rightAlignment: center justifyCenter text-align: centerAdd link createLink, n/aRemove link unlink n/aIndent indent n/aOutdent outdent n/aUndo undo n/aRedo redo n/aInsert image insertImage, n/aInsert horizontal line insertHorizontalRule n/aMake heading heading, n/aInsert paragraph insertParagraph n/aInsert ordered list insertOrderedList n/aInsert unordered list insertUnorderedList n/a

    Table 4.2: Styling commands provided by YAWE

    16https://developer.mozilla.org/en-US/docs/Web/API/Document/execCommand

    https://developer.mozilla.org/en-US/docs/Web/API/Document/execCommand

  • CHAPTER 4. ARCHITECTURE AND IMPLEMENTATION 29

    All of the YAWE functionalities are listed in Table 4.2, along with the arguments for thedocument.execCommand and the corresponding CSS rule. If a particular functionality isonly available for the selection scope, the CSS rule is n/a (i.e. not applicable).

    The specification of document.execCommand is currently only a working draft17, so theimplementation across browsers is not completely consistent and some unexpected behaviouroccasionally occurs. The created HTML markup is not always clean or easy to read. Creatinga perfect WYSIWYG editor is out of scope for the present project but could be the objectivefor future versions of the design editor.

    4.3.3 Integration with the DeepDesign Chrome extension

    In order to integrate the DeepDesign Chrome extension [25] with the design editor, theChrome extension had to be modified. The previous version of the extension did not save theHTML markup or the styling of the found elements when the Save to storage was clicked. Itonly stored the extracted text from those elements. Thus, the extension had to be adapted toallow for another field in the internal storage mechanism of the extension to save the matchedHTML elements with their complete markup. Before the elements could be saved, they hadto be converted into a usable format that was independent of the original site. To accomplishthis, a new function called collectMatchedHTML (see Code Snippet 4.3) was created inDeepDesign.

    The function has as input the root element of a matched data entity. The first step consists ofcomputing the styling for each element and saving it as a data attribute (data-styling),so that users can import the elements with the same styling as in the original site. TheDeepDesign-specific CSS rules are removed before the styling is saved and then reappliedwith the pre-existing extension functions (Tools.filterToolCSSClasses(this)and Tools.reapplyCSSClasses(this, classes)). The CSS rules whose valuescorrespond to the default values are not stored because enumerating every possible CSS prop-erty and its value for each HTML element would lead to very large elements having to bestored and the code would become basically unreadable. This is accomplished by comparingthe computed CSS value to the default value. All of the default values can be found in a listthat was previously compiled from the CSS specifications. Additionally, the browser-specificCSS rules (i.e. rules containing -webkit) are not saved either. All of the processing is doneonly for elements that either have a parent or a child element that is labelled (i.e. has the classConstants.classes.clicked), because those are the only ones that are relevant.

    Once the styling is saved in the data attribute of the original HTML elements of the page, allof the elements are cloned. It is necessary to clone the HTML elements because otherwise theextension would modify the HTML elements on the original page. Again, only the elementsthat either have a parent or a child element that is labelled are kept. To reduce the size of theelements, all JavaScript event handlers are removed from the elements, since in most casesthey would not work in the other page anyways. The CSS rules that are stored in the dataattribute data-styling are extracted and added as inline styling rules. Finally, the labelsare added to each element in the data attribute data-cluster. The main element, which doesnot have a name in the extension, is given the name unknown. As described in Section 4.4.3,

    17 https://w3c.github.io/editing/execCommand.html

    https://w3c.github.io/editing/execCommand.html

  • 30 4.3. DESIGN EDITOR

    this is the same mechanism as the DataMockups tool uses to save the labels. The last stepin the function is to save the cloned HTML in the internal storage. The processing by thefunction collectMatchedHTML is done for every data record that is found on the originalpage.

    1 function collectMatchedHTML(rootNode) {2

    3 var original = jQuery(rootNode);4 original.find(’*’).filter(function(index, element) {5 return jQuery(this).closest(’.’ + Constants.classes.clicked).length

    !== 0 ||↪→6 jQuery(this).find(’.’ + Constants.classes.clicked).length !==

    0;↪→7 }).each(function() {8 var classes = Tools.filterToolCSSClasses(this);9 var cssValues = css(this);

    10 Tools.reapplyCSSClasses(this, classes);11 var properties = Output.convertToCSSObject(cssValues.keys,

    cssValues.vals);↪→12 jQuery(this).data(’styling’, properties);13 });14

    15 var elts = original.clone(true);16 original.find(’*’).removeData(’styling’);17

    18 // Remove useless DOM elements19 elts.find(’*’).filter(function(index, element) {20 return jQuery(this).closest(’.’ + Constants.classes.clicked).length

    === 0 &&↪→21 jQuery(this).find(’.’ + Constants.classes.clicked).length ===

    0;↪→22 }).remove();23

    24 // Remove all Javascript handlers25 events = { ’onclick’: null, ’ondblclick’: null,26 ’onmousedown’: null, ... };27 elts.prop(events);28

    29 elts.find(’*’).each(function() {30 Tools.filterToolCSSClasses(this);31 var jThis = jQuery(this);32 jThis.css(jThis.data(’styling’));33 jThis.attr(’data-cluster’, jThis.data(Constants.strings.datalabel));34 });35

    36 elts.attr(’data-cluster’, ’unknown’);37 this.addToMatchedHTMLOutput(elts.get(0));38 }

    Code Snippet 4.3: Some of the modifications to the DeepDesign extension

    Because of the above-mentioned changes, the new version of the Chrome extension can nowbe used to import content into the design editor. When the user clicks Send to application, theDeepDesign drag element is shown and the user can drag it into the design window. However,

  • CHAPTER 4. ARCHITECTURE AND IMPLEMENTATION 31

    instead of the contents of the drag element being inserted when dropped, the HTML elementsfrom the extension are inserted instead. Since the styling is saved inline with the elements,the contents appear just as in the original document, with the exception of any styling rulesthat are not set for the transferred elements. In this case the global styling rules of the designpage (e.g. some margins or padding settings) apply.

    4.4 Element detection

    Many steps are needed to recognize possible relevant elements in a designed mockup. First,with some help from the user the element detection component selects the interesting con-tent area of the designed page, clusters all of the elements, and finally gives the user theopportunity to name the interesting clusters and discard the useless ones. Since many of theapproaches described in Section 2.5 require more than one page as input, they could not beused for this tool. The approach used in this project has many similarities with Murolo andNorrie’s approach [25]. However, instead of starting with the annotation of one data record,the approach for this tool requires less human input for the clustering and labelling. Just likeDeepDesign, the tool uses hierarchical clustering and the pq-gram distance to find similarelements, but in a different way. The whole process is described in more detail in the nextthree subsections.

    4.4.1 Selecting elements

    The tool requires users initially to simply select a general area that contains all of the records.This minimizes user input while still excluding unnecessary elements such as the title of thepage or the navigation menu. Making the users choose an area usually results in a decreasednumber of DOM elements that need to be clustered, which in turn leads to a shorter runtime.All of the elements that are completely contained in the selected area as well as their parentelements are included in the selection set (as illustrated in Figure 4.3). Including the parentelements ensures that all of the desired elements are considered, because it is not alwaysobvious on the web page how high or wide individual elements are.

    All of the elements that are not visible on the page are excluded from the selection set ofelements. An element is considered visible if it has a width or height that is greater than zeroand it does not have the CSS properties visibility: hidden or opacity: 0. Thisfunctionality can be turned off by setting EXCLUDE_INVISIBLE to false in the source codeof the application.

    To reduce the number of elements even further, all DOM elements that have exactly one childnode and no text outside of that child node are excluded. Since ultimately the content of theweb page and not the exact DOM structure is of interest, this exclusion reduces the runtimewith no negative impact on accuracy.

  • 32 4.4. ELEMENT DETECTION

    Figure 4.3: Selected area for element detection

    4.4.2 Hierarchical clustering and pq-gram edit distance

    Once the elements are chosen, a hierarchical clustering algorithm with a custom distancemetric is applied. The custom distance metric is the pq-gram edit distance as described in [16](see Section 2.4). This distance metric was chosen mainly for performance reasons, since inmost cases there are a lot of DOM elements that need to be clustered.

    In this project, the data are the DOM elements and the labels are the node names (i.e. div,p, span). The children of each node are the children of the DOM elements. In addition tothe child elements, each HTML class attribute of the element is added as an additional child.This allows CSS properties to be taken into account without having all of the CSS rules addedindividually thus keeping runtime to a minimum.

    The existing JavaScript implementation18 is used to implement the pq-gram edit distance.However, the code has been slightly modified to allow synchronous calls (the existing codeonly allowed asynchronous calls) so that it can be used with the clusterfck19 library. The pand q values are left at the default values (p = 2, q = 3).

    The clusterfck library is a JavaScript library for hierarchical clustering. The func-tion for the clustering, clusterfck.hcluster(items, metric, linkage,threshold), takes four arguments. The items are all of the selected DOM elements asexplained in Section 4.4.1. The pq-gram distance is used for the metric and the linkagecriteria is set to average, meaning the distance between two clusters is an average of thedifferences between all of the items in each cluster. The threshold is a stopping criteriafor the algorithm and is set to 0.8. This means that when all of the clusters are more thanthreshold apart from each other, the clustering is stopped and the current set of clusters is

    18https://github.com/hoonto/jqgram19http://harthur.github.io/clusterfck/

    https://github.com/hoonto/jqgramhttp://harthur.github.io/clusterfck/

  • CHAPTER 4. ARCHITECTURE AND IMPLEMENTATION 33

    returned. The library returns a list of JavaScript object where each item represents a clusterin a tree form. Each non-leaf node can have a left and a right subtree. The leaf nodes containthe actual elements.

    Each of the JavaScript tree object is then turned into an array through a recursive function,createCluster. The function traverses the tree and adds all of the leaf nodes to the listsince the distances of the elements within the same cluster are not important for the elementdetection.

    Once the array of clusters is complete, each cluster is assigned an id (a number), a name anda colour and a status (see Code Snippet 4.4). Additionally, the cluster elements are saved asitems and the current background colour of the cluster elements is stored as prevColor.Initially the id of each cluster is their position in the array, the name is Cluster , thecolour is assigned arbitrarily and the status for all clusters is detected. The possible valuesfor the cluster status are: detected, chosen, merged, ignored and removed.

    1 cluster = {2 ’id’: ,3 ’name’: ’Cluster ’ + ,4 ’color’: ,5 ’prevColor’: $(items).css(’background-color’),6 ’items’: items,7 ’status’: ’detected|chosen|merged|ignored|removed’8 };

    Code Snippet 4.4: Cluster attributes

    4.4.3 Cluster naming and selection

    As soon as all of the clusters are identified, they are displayed to the users. The users can thenname the clusters that they want to keep and discard clusters that are irrelevant. The clustersare ordered according to the depth of the elements in the document.

    Every time a cluster is named, a check is first performed as to whether a cluster withthat name already exists and if it does, the two clusters are merged. The status of thecluster is changed from detected to chosen or merged. Afterwards, a data attribute withthe cluster name is added to all elements of the cluster (i.e. for a div element ). This modification of the DOM element means thatclustering can also be preserved between sessions as long as the page is saved. This is par-ticularly useful if users want to take a break between clustering and creating a schema anddatabase. As long as the page is saved, the next step can be done at any point later in timeand the users can refresh or navigate away from the pages without losing their work.

    When elements that have been clustered previously are clustered again, the former clusterassociation is not taken into account during the clustering. However, rather than arbitrarilyassigning a cluster name, the name from the data attribute is used if it is present in at leastone of the cluster items.

    If the user decides to discard a cluster, the colouring is removed from the design page and the

  • 34 4.5. SCHEMA FORMATION

    status of the cluster is changed from detected to discarded. The discarded elements are thenignored in the next step.

    4.5 Schema formation

    After the clustering is done, the database schema (including the relationships) is inferred fromthe structure of the HTML document and the clusters. This is done in a two-step process foreach cluster and happens automatically either after the clustering is done or whenever the Addor Reload buttons in the Schema & Data pane are clicked. First, the cluster is classified asan entity, a relationship or an attribute and then it is stored in the internal interface. This isdescribed in the next two sections. Once the schema is defined, the contents of the element(i.e. the data) are extracted from the design page and then also stored in the internal interface(Section 4.5.3).

    4.5.1 Cluster classification

    The schema and relationships are generated by retrieving all of the cluster names from thedesign page. HTML elements that have a name for the data-cluster attribute are calledcluster elements. To achieve this, the values of the data-cluster attributes of all clusterelements are extracted. For each cluster name, the tool then decides whether it is an entity,an attribute or a relationship. This is determined by four factors: a) whether the element isa child of a cluster element (hasParentCluster); b) whether a cluster element with thesame content appears in many different clusters (isSpread); c) whether the element hasany siblings with the same cluster name (isRepeated); and d) whether the element haschildren that are cluster elements (hasChildClusters). Code Snippet 4.5 shows howthe four factors are calculated. For all of the functions, name is the name of the cluster andjElements are all of the jQuery elements that belong to that cluster.

    1 // Check whether the elements with the data-cluster=name have any2 // parent elements that are cluster elements (i.e. have data-cluster3 // attribute)4 function hasParentCluster(name, jElements) {5 var parents = jElements.parents(’[data-cluster]:first’)6 .not(’[data-cluster="’ + name +

    ’"]’).map(function() {↪→7 return $(this).attr(’data-cluster’);8 }).get();9 return _.uniq(parents).length !== 0;

    10 }11 // Check whether the same cluster element content appears more than a12 // certain threshold (0.95) throughout all cluster elements with13 // the same name14 function isSpread(name, jElements) {15 var allContent = jElements.map(function() {16 return $(this).text();17 }).get();18 var uniqueContent = _.uniq(allContent);19 return (uniqueContent.length / allContent.length) < 0.95;20 }

  • CHAPTER 4. ARCHITECTURE AND IMPLEMENTATION 35

    21 // Check whether the same cluster element appears more than once22 // in the parent cluster element23 function isRepeated(name, jElements) {24 var siblings = jElements.siblings(’[data-cluster="’ + name + ’"]’);25 return siblings.length > 0;26 }27 // Check whether the cluster elements have any child elements that are28 // cluster elements29 function hasChildClusters(name, jElements) {30 return jElements.find(’[data-cluster]’).length > 0;31 }

    Code Snippet 4.5: Cluster type functions

    The threshold of 0.95 in isSpread was chosen based on a few of sample pages. The goalwas to allow the same content to appear a couple of times and still be considered an attribute,but if it occurs often enough, it should be considered a relationship and not just an attribute.Initially, a threshold value of 0.7 was chosen. However, in the user study conducted (refer toSection 5.1 for more details), it was discovered that the threshold value was to low. Basedon those insights, the threshold was increased to 0.95. This means that the same content canoccur less often on a page without being considered a relationship.

    Figure 4.4 shows how the four factors determine the final classification of the cluster type(entity, attribute or relationship). All of the clusters that are classified as a relationship auto-matically are entities themselves, but unlike a normal entity, the relationship is recognized aswell and then stored separately, as described in the next section.

    hasParentCluster

    No

    Yes

    isSpread entity

    isRepeated isRepeated

    1-N relationship N-1 relationship N-N relationship hasChildClusters

    1-1 relationship attribute

    Yes

    Yes Yes

    Yes

    No

    No No

    No

    Figure 4.4: Classifying clusters

  • 36 4.5. SCHEMA FORMATION

    4.5.2 Internal interface

    In order to later be able to integrate the schema and the relationships with the data-base generation service, they are stored in two separate JavaScript data structures,DataMockups.schema and DataMockups.relationships. The structure ofthese objects is shown in Code Snippets 4.6 and 4.7. In these snippets, and are auto-generated unique identifiers. In a separate look-up object,DataMockups.namesDictionary, the unique identifiers are mapped to the clusternames.

    DataMockups.schema = {: {

    "attributes": {: ,:

    },"generated_attributes": {: ,:

    },"relations": {: ,

    }}

    }

    Code Snippet 4.6:Schema data structure

    DataMockups.relationships = {: {"m1": {

    "identifier": ,"cardinality":

    },"m2": {

    "identifier": ,"cardinality":

    }}

    }

    Code Snippet 4.7:Relationship data structure

    The keys of the schema object (DataMockups.schema) are the entity identifiers. Theattributes value comprises the attributes of that entity along with their type. The generated at-tributes value comprises the final attributes of that entity along with their type. The reason thatthere are two values for attrib