exploring knowledge graphs for adaptive...

Exploring Knowledge Graphs forAdaptive Learning

Xueqin Cui

A report submitted for the courseCOMP8755 Individual Computing Project

Supervised by: Dr. Qing Wang, Dr. Zhenchang XingThe Australian National University

May 2018

c© Xueqin Cui 2018

Except where otherwise indicated, this report is my own original work.

Xueqin Cui25 May 2018

Acknowledgments

First of all, I would like to thank my supervisors Dr. Qing Wang and Dr. ZhenchangXing for their guidance during the project. Thank them for providing the continuoussupport and constructive suggestions, especially when I faced the difficulty at thebeginning of the second semester. Without their help and assistance, the project couldnot be accomplished.

Also, I want to thank the course convener Prof. Peter Strazdins for organizing theregular practice meetings and providing the helpful advice on presentation and re-port writing for the Computer Science Individual Project students.

Last but not least, my sincere thanks to my parents for their love and support all thetime. Without their encouragement when I felt upset, I cannot insist on the projectand finally finished it. Also, I want to express my thanks to the new family members,Corgi Toto and Coco. Thank you for providing the funny photos and videos to makeme laugh.

Thank you all who helped me with the project and cheered me up.

iii

Abstract

Knowledge graph is a popular way to represent knowledge and due to the trendof Artificial Intelligence nowadays, how to construct a knowledge graph becomes aheated topic. Currently, there are two separate lines to construct a knowledge graph:modeling in a formal language and mining from the data. In this project, I aim tobring these two lines together and investigate how these two lines could affect eachother in the area of constructing the knowledge graph to support the adaptive learn-ing.

To model a knowledge graph, I investigated three tools and finally decided to useNeo4j to represent the graph. With the knowledge graph modeled by the human ex-pert, I developed a pipeline to process and model the graph in Neo4j. The knowledgegraph modeled by human expert is highly formalized but it only contains a limitednumber of facts and relationships.

I also developed two mining algorithms to construct the knowledge graph, accordingto different types of the data. The first one is to mine the graph from the human-selected tags like Stack Overflow post tags and the tag pairs which are consideredas frequent will be added to the graph. The second algorithm is to use the headingsin a tree-structure webpage (like Wikipedia) to describe the relationship. Bread FirstSearch was used to crawl the website and a set of anchors was used to filter the infor-mation and control the size of the graph. The experiments indicated the correlationbetween the graph and the data used, as well as the importance of the anchors duringthe mining process. Compared with the expert-modeled graph, mined graph was ofless accuracy but the graph still included most facts.

v

Contents

Acknowledgments iii

Abstract v

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Relate Work 52.1 Modeling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Methodology 73.1 Modeling and representation . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Mining from different types of data . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 From human-selected tags . . . . . . . . . . . . . . . . . . . . . . 83.2.2 From tree-structure webpages . . . . . . . . . . . . . . . . . . . . 93.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Results and Discussion 134.1 Graph modeled by the domain expert . . . . . . . . . . . . . . . . . . . . 134.2 Graph mined from the human-selected tags . . . . . . . . . . . . . . . . 134.3 Graph mined from the tree-structure webpages . . . . . . . . . . . . . . 14

4.3.1 Compare with the expert modeled graph . . . . . . . . . . . . . . 154.3.2 Accuracy of the relationship tuples . . . . . . . . . . . . . . . . . 164.3.3 Comparison between different anchoring strategy . . . . . . . . . 174.3.4 Comparison between different sets of anchors . . . . . . . . . . . 184.3.5 Use NLP tools for language level understanding . . . . . . . . . 20

5 Conclusion 275.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Bibliography 29

Appendices 31

vii

viii Contents

Appendix 1: Project Description 33

Appendix 2: Study Contract 35

Appendix 3: Artefact Description 39

Appendix 4: README File 43

Appendix 5: Expert Modeled Knowledge Graph 49

List of Figures

1.1 Part of Google’s Knowledge Graph [2]. . . . . . . . . . . . . . . . . . . . 11.2 Knowledge Graph including the domain knowledge of Entity-Relationship

Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.1 Tree structure of the Wikipedia page ’Database’. . . . . . . . . . . . . . . 11

4.1 Knowledge graph representation in Neo4j based on the graph providedby Dr. Qing Wang (Appendix 5). . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Knowledge Graph generated from tags of all Stack Overflow poststagged with ’database’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Knowledge Graph generated from tags of all Stack Overflow poststagged with ’database’ but with dominant node ’database’ ignored. . . 22

4.4 Knowledge graph mined from Wikipedia pages with Stack Overflowtags as anchors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.5 Knowledge graph mined from Wikipedia pages with facts in the do-main expert modeled graph as anchors. . . . . . . . . . . . . . . . . . . . 24

4.6 Part of the knowledge graph mined from Wikipedia pages with factsin the domain expert modeled graph as anchors. . . . . . . . . . . . . . . 25

ix

x LIST OF FIGURES

List of Tables

3.1 Example of Neo4j commands to add a node or an edge. . . . . . . . . . 8

4.1 Example of the relation tuples in the mined graph with Stack Overflowtags as anchors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Number of nodes/edges in the crawled graph for each depth. . . . . . . 184.3 Top 10 nodes of the graph mined with Stack Overflow post tags as

anchors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Top 10 nodes of the graph mined with the facts in the expert-modeled

knowledge graph as anchors. . . . . . . . . . . . . . . . . . . . . . . . . . 204.5 Example of relation triples retrieved and their labels. . . . . . . . . . . . 21

xi

xii LIST OF TABLES

Chapter 1

Introduction

This chapter is an introduction to the project which addresses the background andthe objectives of the project.

1.1 Background

Knowledge Graph is a knowledge base which stores and represents the data in theform of a connected graph. The nodes are the facts, and the edges represent therelationships between the facts. With the knowledge graph, facts are connected moreclearly, and it will be easier for the applications to query the facts. Considering thebenefits, knowledge graph is being used by many systems, and a famous example isGoogle. The knowledge graph was first proposed by Google to improve their searchengine. With the knowledge graph, users could quickly take a look at the relateddata without any further searching [2].

Figure 1.1: Part of Google’s Knowledge Graph [2].

Since the knowledge graph has its advantage in presenting the conceptually con-

1

2 Introduction

nected data; it will be suitable to use it in the area of adaptive learning. Adaptivelearning is an educational method that computers are involved in the learning pro-cess to customize the learning experiences of learners. The adaptive learning systemadapts the presentation of educational material according to students’ learning needs,as indicated by their responses to questions, tasks and experiences. To model theconceptual ideas, we need some representation of facts and their relationships, andthe knowledge graph is what we need. Each fact will be represented as a node in theknowledge graph, and their relationships are stored as the edges.

Figure 1.2: Knowledge Graph including the domain knowledge of Entity-Relationship Model.

Currently, there are two ways to construct the knowledge graph. The first one is tomodel with a formal language; the knowledge graph generated will be formal andprecise, but modeling may require heavy workload from domain experts. The secondway will be to mine the knowledge graph from the significant amount of data, likethe massive online resources. This method does not require much work from experts,but since the knowledge graph is totally generated from the data, the results willhighly rely on what data we are using, and it may not be that precise.

Considering that the two ways have both pros and cons, this project aims to combinethe two lines together. The knowledge graph could be mined first from the dataand then use the result as a sketch to further model with experts’ adjustment andcomplement, or vice versa, we start from the graph that modeled by the domain

§1.2 Objectives 3

expert and then extends the graph by mining the data.

1.2 Objectives

This project aims to explore the potential of knowledge graphs in supporting adaptivelearning in the area of databases. The specific objectives are described as follows:

• Develop a modeling technique to represent knowledge graphs based on domainknowledge.

• Develop a data mining technique to identify relevant knowledge and discoverrelationships between different pieces of knowledge.

• Investigate how knowledge graphs modeled by human experts relate to knowl-edge graphs discovered from data mining/machine learning techniques.

1.3 Contributions

My work is to bring the two lines of constructing the knowledge graph togetherwhich includes the following contributions:

• Construct the knowledge graph from the human-selected tags.

• Crawl the tree-structure webpages to seize the relation between the facts repre-sented by each page.

• Store and represent the knowledge graph with the modeling and the visualizingtool.

1.4 Outline

The project report is structured as follows:

• Chapter 1 is an introduction to the project including the background, the objec-tives and my contributions.

• Chapter 2 talks about the related work of modeling and mining the knowledgegraph.

• Chapter 3 lists the methodologies that I used to build and visualize the knowl-edge graph.

• Chapter 4 presents the knowledge graphs generated and some experiments Idid on the graphs.

• Chapter 5 concludes this project and proposes some future work.

4 Introduction

Chapter 2

Background and Relate Work

This chapter provides a review of the relate work of the two lines of constructing aknowledge graph.

2.1 Modeling Tools

There are many choices of the modeling tools that I could use for formalizing theknowledge graph, for example, protégé[11], GRAKN.AI[1] and Neo4j[3].

protégé is a free, open source ontology editor and a knowledge management sys-tem developed by Stanford University [11]. It provides a formalized way to modelthe knowledge graph in the scope of ontology.

GRAKN.AI is an open-source, distributed hyper-relational database for knowledge-oriented systems which provides a concept-level schema that fully implements theEntity-Relationship (ER) model [1].

Neo4j is a graph database management system which provides the feature of databasemodeling in addition to the native graph storage and processing [3].

2.2 Mining Algorithms

There are many data mining algorithms used to build a knowledge graph and [7]and [12] are the two techniques that may be relevant to the project.

TechLand [9] is what Chen, Xing and Han mined from the Stack Overflow posttags. Stack Overflow is an on-line Q&A forum consisting of questions of computerprogramming. For each post in Stack Overflow, you will be asked to select at mostfive tags to describe the question. Considering the reasoning efforts when people tagthe post, it can be assumed there is some correlation between the tags attached to thesame post. Accordingly, by applying the the fast algorithm for association rules[4],tag pairs with high frequency will be filtered out. By applying this algorithm, Chen

5

6 Background and Relate Work

and Xing built the Technology Landscape (TechLand) from the Stack Overflow whichgives a big picture on different domains in software engineering. In the TechLand,highly-related programming concepts are connected to each other centered at somekey concept in that domain.

Zhao et al. [12] also used Stack Overflow as the data source, but the tag wiki,the high level description for each tag was used this time. Another technique wasdeveloped to retrieve the domain specific knowledge graph from the content ofweb pages and the technique is called HDSKG [12]. For all tags in Stack Over-flow, there are web pages, called tag wiki, set up for each of them, for example,https://stackoverflow.com/tags/database/info provides a wiki for tag ’database’.Tag wiki usually provides a description of that specific tag and sometimes may in-clude some technical details as well. After pre-processing the web page to the text,NLP tools are used to chunk the candidate relations triples and the relation triples.Human sources will then be involved to label the triples to indicate what kind oftriples are needed. The human efforts invested into the labeling is a big limitation ofthis technique, so the final step is to train a semi-supervised SVM classifier. The semi-supervised learning started with a small amount of labeled data and then unlabeleddata were added to get their labels. HDSKG constructs the knowledge graph withmore detailed relationship between the nodes which could help to do the reasoningon the knowledge graph.

Chapter 3

Methodology

In this chapter, I will present the techniques I developed regarding the two lines toconstruct a knowledge graph in the domain of the database knowledge.

3.1 Modeling and representation

As mentioned in Chapter 2, there are three tools that I could consider to model theknowledge graph. Considering the high formalization requirement of the knowledgegraph, protégé [11], the ontology editor is what I first thought. However, the ontologymodeling requires deep understanding of mathematics especially in the area of logic.I did not have any experience in the relevant field and thus other tools would be abetter choice for me.

GRAKN.AI [1] is a distributed database for knowledge-oriented systems. Withinthis database, graphs are modeled with two layers: the ontology layer and the datalayer. The ontology layer is performing as the schema in the database managementsystem and the data layer contains the instances. Usually, objects are defined by theontology layer, and the detailed information is stored in the data layer, but when mod-eling the conceptual ideas in the adaptive learning, we do not have a clear boundarybetween the schema and the instance. The facts that we are modeling can be con-sidered as some high-level schema but the concrete example as well. Under thiscircumstance, GRAKN.AI maybe not that suitable for this project.

Different from GRAKN.AI having strict two-layer schema definition, Neo4j [3] ismore flexible in terms of presenting the nodes and the edges. Nodes and edges allcan be added to the graph without a formal definition of the fact. The flexibility ismore convenient for me to model the graph at the beginning stage when I do nothave the clear understanding of the whole graph. I developed a pipeline to transferthe tuple representation of the graph into Neo4j commands to create the knowledgegraph.

Table 3.1 shows an example of how to model the tuple (entity, IS, data_structure) inNeo4j. We would first use the CREATE commands to add the two nodes with the

7

8 Methodology

# Add Command

1 Node CREATE(entity : Fact{name :′ entity′})2 Node CREATE(data_structure : Fact{name :′ data_structure′})3 Edge CREATE(entity)− [: IS]− > (data_structure)

Table 3.1: Example of Neo4j commands to add a node or an edge.

type as ’Fact’ and the corresponding name (Command 1 and 2). After both nodesadded, an edge with IS relationship is established between the two nodes by theCREATE command as well (Command 3). With similar command patterns, we couldbuild the knowledge graph in Neo4j based on the relationship tuples.

3.2 Mining from different types of data

According to the different types of the data sources, I developed two mining methodsto construct the knowledge graph, which is mining from the human-selected tagsand mining from the tree-structure webpages.

3.2.1 From human-selected tags

Nowadays there are a lot of online Q&A forums which provide a platform for peopleto ask questions and search for the answers. When people are posting their questions,to better categorize the posts, they are usually required to add the tags that coulddescribe the question. The tags could be considered as the concepts relevant to thequestions. Since the tags are selected by the people who ask the question, reason-ing by humans has been involved and thus the tags for the same question could bethought of as having some relationship. This kind of relationship is what we arelooking for to construct the knowledge graph.

Algorithm 1 Mine from human-selected tags

1: procedure frequent_tag_pair({< t1, t2, ..., tn >}, t_sup, t_con f )2: num_post← total number of the posts3: tag_counter[ti]← number of posts with tag ti4: pair_counter[ti, tj]← number of posts with tag pair < ti, tj >5: for all tag pairs < ti, tj > do6: if pair_counter[ti, tj]/num_post > tsup then7: if pair_counter[ti, tj]/tag_counter[ti] > tcon f then8: add < ti, tj > to the graph

9: if pair_counter[ti, tj]/tag_counter[tj] > tcon f then10: add < tj, ti > to the graph

Algorithm 1 describes the methodology how I mine the graph from the human-

§3.2 Mining from different types of data 9

selected tags. The algorithm is based on the algorithms developed Agrawal andSrikant in 1994 [4] and inspired by Chen and Xing’s work in 2016 [8]. The basic ideaof this algorithm is to pick out all frequent tag pairs and add them to the graph. Theinput would be a list of tag tuples where each tag tuple represents the tags added fora specific post and the preset parameter t_sup representing the lower bound of thefrequency of a tag pair among all post tags while t_con f is the lower bound of thefrequency of a tag pair among all posts with one of the tags in the pair.

The first part of the algorithm (Line 2 to 4) is to count the number of the posts,number of posts with the specific tag and the number of appearance of tag pair inall posts. The second part (Line 5 to 10) is the selection of the frequent pair and addto the graph. The tag pair will be filtered twice. The first time, it was examined byits frequency in all posts (Line 6). If the frequency is more substantial than t_sup,we will do further checking on the number of the pair with the number of either tag(Line 7 and 9). If either frequency is greater than t_con f , the tag pair with the tagchecking with as the first tag in the pair will be added to the graph.

3.2.2 From tree-structure webpages

Mining from the human-selected tags could extract the relationship between two facts,but it cannot describe what that relationship is. I aim to build the knowledge graphwith relationship added. Tree-structured webpages will be another source what wecould consider to use for mining. Web pages are connected with each other accordingto the linkage between them and the tree-structure could be used to provide therelationship between the pages.

Figure 3.1 shows an example of the tree-structured webpage (Wikipedia page of’Database’). From the figure, we can see that under the big concept ’Database’, thereare some section headings to describe the content under that sections and there aresmaller sections headings under each first-level heading. The kind of expansion ofthe section headings is like the leaf growing in a tree. For this tree-like structure, ifanother webpage is linked under the specific section of the paragraph, the deepestheading of that section will be used as the description of the relationship. For exam-ple, if there is some page appearing under ’Applications’ section, we can think thenew page is talking about some application of the database so ’Applications’ couldbe a description between ’Database’ and the new page.

Algorithm 2 shows how we implemented this algorithm of webpage mining whichmay have a more popular name, "web crawling". Start from the initial page with a setof anchors (used for filtering out the irrelevant information); I apply the breadth-firstsearch with a preset limit on the depth of the graph. A key step is to retrieve thetree structure (Line 9) where we label each paragraph with the deepest heading, andanother key step is to filter out the new page by matching it with the set of anchors(Line 11). The set of anchors consists of crucial information that we want to capture

10 Methodology

Algorithm 2 Mine from tree-structure webpages with anchors

1: procedure breadth_first_search_crawl(starting_page, anchor, depth)2: queue← {}3: queue.enqueue(starting_page)4: added← {}5: added[starting_page]← 06: while queue is not empty do7: page← queue.dequeue()8: if added[page] < depth then9: structure← RETRIEVE_TREE_STRUCTURE(page)

10: for (paragraph, heading) in structure do11: if exists new_page & new_page matches any anchor then12: add < page, heading, new_page > to the graph13: if new_page not in added then14: queue.enqueue(new_page)15: visited[starting_page]← visited[page] + 1

in the knowledge graph, and if the new page cannot match with any of the anchors,it will not be considered to add to the graph. There are two ways of matching: fullmatch or partial match. A full match means the name of the new page must be thesame as the anchor and the partial match means part of the name matches with partof the anchor. How much matching we require here will influence the number ofnodes we have in the graph.

3.2.3 Summary

The two mining algorithms both provide possibility to mine the knowledge graphfrom the large amount of data on-line with less human efforts involved to expandthe size of the graph. I tried both algorithms on different sources of data along withdifferent parameters. Details about the experiments and the results are illustrated inChapter 4.

§3.2 Mining from different types of data 11

Figure 3.1: Tree structure of the Wikipedia page ’Database’.

12 Methodology

Chapter 4

Results and Discussion

In this chapter, I will present the knowledge graphs generated by the two lines(modeling and mining) and investigate them with each other to find if exists anysimilarities or differences.

4.1 Graph modeled by the domain expert

In Chapter 3, I mentioned that I would use Neo4j as the tool to model and representthe graph that we are modeling. Based on the knowledge graph modeld by Dr. QingWang (see Appendix 5), I had the representation of the graph using Neo4j (see Figure4.1). Each fact is modeled as a node and the relationship edge connects them. Thereare in total 54 nodes, and 78 edges and the relationships include: IS, HAS, CONTAIN,DEFINE, FOR, SET. The relationships in the expert-modeled graph is described in aformal language, and the top 2 relationships are ’IS’ and ’HAS’ which describes thebelonging relationships between the facts. As I already stated in Chapter 3, the resultalso shows that expert modeled graph describes the relationships precisely but thesize of the graph is limited by the human resources involved.

4.2 Graph mined from the human-selected tags

By using the algorithm we discussed in Section 3.2.1, I constructed the knowledgegraph based on the tag tuples of all Stack Overflow posts tagged with ’database’.

With the tag tuples downloaded from the Stack Overflow database, I applied thealgorithm and had the result with nodes representing the tags which are also thefacts we are trying to model and the edges with the weights which is a measurementof the frequentness of the tag pair. By using the graph visualization tool suggestedby Chen and Xing [8], I imported the nodes and the edges into Gephi [5] and thenapplied the Louvain method [6] for community detection and the Force Atlas 2 layout[10] for display purpose, I had the graph in Figure 4.2.However, from the figure, we can see that the node ’database’ dominates the wholegraph and we cannot see more specific relationships between other nodes. It is pos-sibly because that I used all posts with ’database’ tagged. To solve this problem, I

13

14 Results and Discussion

Figure 4.1: Knowledge graph representation in Neo4j based on the graph provided by Dr.Qing Wang (Appendix 5).

repeated the experiments mentioned above but ignoring the ’database’ tag. Similarly,after I had the nodes, edges and the weights, I imported the graph into Gephi [5] andthen applied the community detection and the resulting graph is shown in Figure 4.3.

We can see that the graph is clearly divided into different communities and each com-munity has a very clear centered node. Some centered nodes are still too big, and it isnot clear enough for illustration purpose. It is possible to use the same way I treated’database’ to remove the dominance. However, considering the maximum number oftags in a post from Stack Overflow is five, I already remove one tag ’database’, if Ikeep removing another tag, there will be at most three tags remaining in some posts,and this may cause the loss of the relationships between the nodes.

There is another problem in the graph shown in Figure 4.3. Most center nodes,like ’sql’, ’php’, ’java’, are not the facts we are trying to model as the domain knowl-edge. Only the brown nodes with center at ’database-design’ are the facts that wewant in the knowledge graph. A possible reason for this may be the nature of thedata source. Stack Overflow is a forum focusing on practical coding or programmingquestions, so as a result, the graph mined from it will have more nodes describingthe practical concepts but not the conceptual facts we need to learn.

4.3 Graph mined from the tree-structure webpages

In Chapter 3.2.2, I talked about an algorithm to mine the graph from the tree-structured webpages. An example of this kind of webpage will be Wipipedia (wiki)

§4.3 Graph mined from the tree-structure webpages 15

Figure 4.2: Knowledge Graph generated from tags of all Stack Overflow posts tagged with’database’.

pages. A wiki page usually talks about some concepts which could be modeled as thefacts in the knowledge graph. An individual wiki page may link to other wiki pageswhich are related to the current page. For example, for the wiki page "database",there is a hyperlink to the "relational database" wiki page and these two facts aremodeled connected as well in the graph provided by the domain expert. Besides,since Wikipedia has a standard format for stating the hyperlinks to the other pages,the linkage between different pages could be easily retrieved.

4.3.1 Compare with the expert modeled graph

To check the effectiveness of this mining algorithm, I compared the tuples mined withthe domain expert modeled graph. Some facts in the graph like ’Entity-relationshipmodel’, ’data integrity’ but there are quite a lot of facts not shown in the tuples. Onereason is the fact is too general like ’relationship’, and Wikipedia does not have agood representation of this kind of fact. Another reason is that the fact does nothave a separate page so not mined based on the checking of the hyperlinks. The factappears in some of the webpages but not as a link just as some text. For example,


in the Wikipedia page ’Entity-relationship model’, ’entity type’ appears seven timesincluding the references but it is not mined since all appearances are in the form oftext but not a link. There are other similar cases that the fact is not significant enoughto have an individual wiki page, but actually, it appears in some of the pages already.

4.3.2 Accuracy of the relationship tuples

By applying the algorithm to mine the graph from the tree-structure web pages Iintroduced in Chapter 3.2.2, I have the list of tuples to be added to the graph. Tocheck if the relationship describes with accuracy, I compare the tuples with the expert-modeled graph. To better illustrate the results, I divide the tuples into four categories:precise relationship, imprecise relationship, ambiguous relationship and fact as arelationship.

# Type Example

1 Precise relationship (Database; Models; Relational_model)

2 Imprecise relationship (SQL; Procedural_extensions; Perl)

3 Ambiguous relationship (Primary_key; highly_related; Relational_model)

4 Fact as a relationship (Data_modeling; Entity_relationship_diagrams; Data_model)

Table 4.1: Example of the relation tuples in the mined graph with Stack Overflow tags asanchors.

Precise relationship

The first category of the relationship is the precise relationship which means thatthe relationships I mined described precisely the relationship between the facts. Fortuple # 1 in Table 4.2, (Database; Models; Relational_model), it says the relationshipbetween ’Database’ and ’Relational_model’ is ’Models’, and actually, the relationalmodel is a model of the database.

Imprecise relationship

Another category is the imprecise relationships like tuple #2 in Table 4.2. The tupleindicates that Perl is one of the procedural extensions of SQL. However, this is notthe truth. If we have a look at the wiki page ’SQL’ we can discover that ’Perl’ appearsin a paragraph with the heading as ’Procedural extensions’. However, it is not underthe major paragraph talking about the extensions but a shorter paragraph at the endof the section is talking about PostgreSQL (which is the real procedural extension). Ifwe not only focus on the high-level structure information but to study the paragraphor even sentences, we will be able to extract this relationship differently.


Ambiguous relationship

Tuple #3 (Primary_key; highly_related; Relational_model) is categorized as an am-biguous relationship. Relationship ’highly related’ is not from the data as a headingbut my definition. When retrieving the tree structure of the wiki pages, one specialcase will be the first paragraph in each wiki page that there is no heading availablefor it. Since usually, the first paragraph includes the information that is most relevantto the page, I use ’highly related’ to describe this kind of relationship. Since thisrelationship only indicates that the two nodes are related but not providing any detailinformation, we cannot say it is precise but we cannot say it is wrong neither sincethe two nodes are really highly related, so I describe it as an ambiguous relationshipwhich is not precise nor imprecise.

Fact as a relationship

The last category, fact as a relationship is a special one since it is not talking aboutthe accuracy but the relationship itself. ’Entity relationship diagrams’ appear as therelationship in tuple # 4 from Table 4.2. However, the entity relationship diagramis more like the fact what we want to put into the graph but not the relationshipsbetween two nodes. Based on my approach, headings are all treated as a relationshipbut not nodes, but actually, the headings sometimes are the facts as well. The rea-son is that Wikipedia is open to everyone to edit and people would have a differentunderstanding of how to structure a piece of knowledge which make the structurewell-organized but at that same time with some freedom to change as well. To bettermine the facts from the headings we may need to remodel the structure of the webpages or use a more formalized texts.

The different categories of the relationships mined indicate that mining results havelower accuracy of the relationship tuples compared with expert modeled graph andmore work need to be done to improve the accuracy, for example, a lower-levelinterpretation of the web pages.

4.3.3 Comparison between different anchoring strategy

I have mentioned in Chapter 3.2.2 that the use of a set of anchors could help tofilter the information and reduce the number of nodes to be put in the graph. Toshow how anchors could help to control the size of the graph, I conducted someexperiments. The set of Stack Overflow post tags is used as the anchors, and for bothexperiments, we started the crawling from the wiki page ’Outline of Database’ whichis like a dictionary for database concepts and the experiments were repeated withdepth as 1,2,3. As I discussed earlier in Chapter 3, there are two ways to check if anew page matching any of the anchors: full match and partial match and both waysare included in the experiment to show the difference.


Data set Depth Number of Tuples

Without anchors 1 142

2 5343

3 167368

With anchors 1 15

(full match) 2 215

3 669

With anchors 1 126

(partial match) 2 5866

3 84310

Table 4.2: Number of nodes/edges in the crawled graph for each depth.

Table 4.1 shows the results of the experiments. The reduction of the number of tuplesis enormous from without anchors to with fully matching anchor. Starting fromdepth 1, the number of tuples with anchor matched is much lower than that withoutanchors. However, even for depth 3, the number of tuples is still below 1000 which isnot a significant size of a knowledge graph in this project since I aim to have severalthousand of the nodes. Mining without anchors leads to too much information andfull matching results in too few nodes, so I used the partial matching as a strategyto control the size. Though for depth 1 and 2, the number of tuples is similar tothe number without anchors, start from depth 3, the number of the tuples generatedwithout anchors is around twice as many as the one with anchors, which means Ieffectively filtered out the irrelevant information with the set of anchors but at thesame time maintain the size of the graph at a reasonable level based on my goals ofthis project.

From the comparison between different anchoring strategies, we can see that useof a set of anchors could help to filter the information we want to keep or abandonand the anchors could be also used to control the size of the graph.

4.3.4 Comparison between different sets of anchors

To mine the graph with anchors, an important topic is how to select the appropriateanchors. As an indication of what information we need, anchors will guide theknowledge graph. To experiment how the set of anchors could influence the graph,I tried the mining with two different sets of anchors: Stack Overflow post tags andfacts (nodes) in the expert-modeled graph.

Stack Overflow post tags as anchors

Table 4.3 includes the top 10 nodes (descending order of the degree of the node) fromthe graph generated with the Stack Overflow post tags as the anchors. From the list,


we can see that only 2 out of 10 nodes are the facts that we are looking for: ’Database’,’Database design’. All the other nodes are not talking about the database domainknowledge. As we mentioned in Chapter 4.2, this may result from the usage of StackOverflow. Most questions there are asking for the advice on programming, so thegraph generated by the anchor will also be oriented to the practical direction.

# Node Degree

1 Digital_object_identifier 1578

2 Database 692

3 Microsoft_Windows 646

4 Information_design 558

5 Participatory_design 545

6 Google 543

7 Hardware_interface_design 528

8 Software_design 527

9 Database_design 511

10 Design 507

Table 4.3: Top 10 nodes of the graph mined with Stack Overflow post tags as anchors.

Similar to how I deal with the graph I obtained from mining the human-selected tags,I imported the nodes and the edges into Gephi [5] and then applied the Louvainmethod [6] for community detection and the Force Atlas 2 layout [10] for displaypurpose. The community detection result shown in Figure 4.4 implies the same obser-vation as we can see from the table of top nodes. The boundary of the communitiesare not clear, and there are not specific centers in the graph.

Facts in expert-modeled graph as anchors

To make sure that more relevant information added to the graph, I changed the setof anchors as the facts (nodes) in the knowledge graph modeled by Dr. Qing Wang(see Appendix 5) and repeated the same experiment. Table 4.4 is the top 10 nodesfrom the graph generated this time. Different from the result in Table 4.3, except thatthe node ’European Union’ is not what we are looking for, all other 9 out of 10 nodesare what we are looking for. This number is almost the counterpart of the numberof nodes we need for Table 4.3. The weird node ’European Union’ is caused by thepartial matching strategy. Under partial matching strategy, if any of the token in thename of the new page matches with any token of the anchors, the new page willbe considered as valid fact and add to the graph, there is a fact in the anchor called’union type’ which will be matched under the current strategy though the ’union’they are referring to are entirely different concepts.


# Node Degree

1 Database 433

2 Database_model 225

3 European_Union 223

4 Relational_model 185

5 Relational_database 160

6 Data_mining 159

7 Data_warehouse 156

8 Database_normalization 148

9 Data_model 139

10 Database_design 137

Table 4.4: Top 10 nodes of the graph mined with the facts in the expert-modeled knowledgegraph as anchors.

The community detection result also proves this. I used the same process for thevisualization of the graph: imported the nodes and the edges into Gephi [5], applyingthe Louvain method [6] for community detection and the Force Atlas 2 layout [10]for display. Figure 4.5 shows the communities in the graph with the facts in thedomain-expert modeled graph as anchors. Different from Figure 4.4, there are clearclusters between different group of nodes. The cluster of green nodes far away fromthe major body of the graph are the nodes related to the node ’European Union’, andthis is what I am expecting. The nodes with color green are not what I want so thedistance between them and the principal body of the graph is great.

A closer look at the graph (Figure 4.6) shows the representation of different commu-nities. For example, blue nodes are mainly about the database concepts while thepinks nodes are the techniques related to the database.From the comparison between different sets of the anchors, we see the importance ofanchors in mining a graph. We could not only use anchor to help control the size ofthe graph but also to generate a graph with more information we desired.

4.3.5 Use NLP tools for language level understanding

The graph generated by crawling the Wikipedia pages can still be improved. Cur-rently I only use the heading to model the relationship; however, this kind of de-scription is too brief. Even with the same heading, different paragraphs under theheading could talk about different topics. Therefore, we should try to retrieve the re-lationship between the concept nodes in a lower level, for example, the language level.

Zhao et al. [12] successfully developed a technique, called HDSKG, to harvest therelationships between the nodes at the language level by using the NLP tools and the


semi-supervised SVM classifier. I also tried the technique for the ’database’ Wikipedia.However, I encountered the difficulty when labeling the triples. The triples I obtainedwere mostly of low quality which could not be used for training the classifier.

# Candidate relation triples

1 (DBMS; provides; various_functions)

2 (performance; have_grown_in; orders)

3 (conceptual_data_model; involves; analysis)

4 (popular_NoSQL_systems; include; MongoDB)

Table 4.5: Example of relation triples retrieved and their labels.

Table 4.5 shows some examples of the relation triples I retrieved with HDSKG method-ology. From the table, we can see that only triple number 1 to 3 all talk about sometrivial facts. Triple 4 provides some knowledge on MongoDB that it is one of theNoSQL systems. A lot of trivial triples like triple 1-3 are generated which cannotoffer effective information for the knowledge graph.

Besides, for this technique, there is a challenge about the data size as well. I onlyobtained the triples from one Wikipedia page and compared with the total numberof triples trained in Zhao et al.’s paper [12], the size is too small for training a decentclassifier. I did not explore more with the NLP techniques due to the time limit of theproject, but this could be one of the future work to be done. Mining algorithms couldbe combined with the NLP to generate trustful triples, and not only HDSKG couldbe used, but other NLP tools like OpenIE could also be considered as well.

The use of NLP tools could be of great help to better understand the web contentswhich has been proved by previous work. However, to better cater the context ofthis project, the implementation and the model need more detailed investigation andfurther adjustment and this could be one of the future work to improve.


Figure 4.3: Knowledge Graph generated from tags of all Stack Overflow posts tagged with’database’ but with dominant node ’database’ ignored.


Figure 4.4: Knowledge graph mined from Wikipedia pages with Stack Overflow tags asanchors.


Figure 4.5: Knowledge graph mined from Wikipedia pages with facts in the domain expertmodeled graph as anchors.


Figure 4.6: Part of the knowledge graph mined from Wikipedia pages with facts in thedomain expert modeled graph as anchors.

Chapter 5

Conclusion

From the results I have discussed in Chapter 4, we can see that the two lines ofconstructing the knowledge graph could be brought together to benefit each other.

The expert modeled knowledge graph is highly abstract and formalized, but it onlycontains a limited number of facts. However, the expert-modeled knowledge graphcould provide a sketch for the mining techniques, for example, to use the fact nodesas the anchors for the mining algorithms.

For the knowledge graph mined from the data, we can see that the result mainlyrelies on the data source including the information included and the accuracy of therelationship. Compared with the human expert modeled knowledge graph, most ofthe facts could be included, but the description of the relationships could be impre-cise. Also, the set of anchors acts a vital role in the graph construction. The anchorscould be used to filter out the irrelevant facts and control the number of the nodesand edges in the graph as well. At the same time, they are also critical since they willorient the graph to the direction that the anchors are focusing.

5.1 Future Work

There are some directions we may continue to explore the future work, including:

• retrieve the lower-level structure from the web pages to provide more detailedinformation;

• further investigate the use of NLP tools to do the web content analysis;

• try a different type of data source, for example, a more formalized source like atextbook talking about the domain knowledge.

Also, more study could be conducted on the application of the knowledge graphconstruction to better support the adaptive learning and the work may provide thetheoretical basis for the applications like the intelligent search of knowledge, reason-ing or a recommender system.

27

28 Conclusion

Bibliography

[1] GRAKN.AI - The Knowledge Graph. https://grakn.ai/.

[2] Knowledge - Inside Search - Google. https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html.

[3] The Neo4j Graph Platform - The #1 Platform for Connected Data. https://neo4j.com/.

[4] Rakesh Agrawal, Ramakrishnan Srikant, et al. Fast algorithms for mining associ-ation rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages487–499, 1994.

[5] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: An opensource software for exploring and manipulating networks, 2009.

[6] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefeb-vre. Fast unfolding of communities in large networks. Journal of statistical me-chanics: theory and experiment, 2008(10):P10008, 2008.

[7] Chunyang Chen, Sa Gao, and Zhenchang Xing. Mining analogical libraries inq&a discussions–incorporating relational and categorical knowledge into wordembedding. In Software Analysis, Evolution, and Reengineering (SANER), 2016 IEEE23rd International Conference on, volume 1, pages 338–348. IEEE, 2016.

[8] Chunyang Chen and Zhenchang Xing. Mining technology landscape from stackoverflow. In Proceedings of the 10th ACM/IEEE International Symposium on EmpiricalSoftware Engineering and Measurement, page 14. ACM, 2016.

[9] Chunyang Chen, Zhenchang Xing, and Lei Han. Techland: Assisting technologylandscape inquiries with insights from stack overflow. In Software Maintenanceand Evolution (ICSME), 2016 IEEE International Conference on, pages 356–366. IEEE,2016.

[10] Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, and Mathieu Bastian.Forceatlas2, a continuous graph layout algorithm for handy network visualiza-tion designed for the gephi software. PloS one, 9(6):e98679, 2014.

[11] Mark A Musen. The protégé project: a look back and a look forward. AI matters,1(4):4–12, 2015.

29

https://grakn.ai/

https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html

https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html

https://neo4j.com/

https://neo4j.com/

30 Bibliography

[12] Xuejiao Zhao, Zhenchang Xing, Muhammad Ashad Kabir, Naoya Sawada, JingLi, and Shang-Wei Lin. Hdskg: Harvesting domain specific knowledge graphfrom content of webpages. In Software Analysis, Evolution and Reengineering(SANER), 2017 IEEE 24th International Conference on, pages 56–67. IEEE, 2017.

[13] Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. Statsnowball:a statistical approach to extracting entity relationships. In Proceedings of the 18thinternational conference on World wide web, pages 101–110. ACM, 2009.

Appendices

31

Appendix

Appendix 1: Project Description

33

1 Project Title

Exploring Knowledge Graphs for Adaptive Learning

2 Project Description

This project aims to explore the potential of knowledge graphs in supportingadaptive learning in the area of databases. In general, there are two separatelines of research about constructing knowledge graphs. One line is to learnor mine knowledge graphs from a large amount of data, such as Web pages orQ&A forums. The other line is to manually model knowledge graphs usinga formal language, such as the Semantic Web formalisms RDF and OWL.This project aims to bring these two separate lines of research together inthe context of building knowledge graphs that capture knowledge aboutrelational databases.

The specific tasks include the following:

1. Conduct a literature review on modelling and mining knowledge graphs;

2. Develop a modelling technique to represent knowledge graphs basedon domain knowledge;

3. Develop a data mining/machine learning technique to identify relevantknowledge and discover relationships of different pieces of knowledge;

4. Investigate how knowledge graphs modelled by Human experts relateto knowledge graphs discovered from data mining/machine learningtechniques.

5. Write up a project report.

3 Learning Objectives

On the completion of the project, the following learning objectives are ex-pected to achieve:

• Have a good understanding for the literature of knowledge graphs;

• Have good skills for modeling and mining knowledge graphs;

• Be able to test the developed techniques and evaluate its effectivenessand efficiency;

• Be able to structure a project report and write convincingly of projectoutcomes.

1

Appendix

Appendix 2: Study Contract

35

Appendix

Appendix 3: Artefact Description

39

ArtefactDescriptionThis artefact is developed to bring two lines of constructing a knowledge graph (mining andmodeling) together.

Listoffiles

├──README.md├──README.pdf├──description.md├──description.pdf├──modeling│├──create_neo4j_graph.py│├──input││├──knowledge-graph.txt│├──output││├──neo4j-commands.txt├──mining│├──human-selected-tags││├──mine_from_tags.py││├──input│││├──tags.csv││├──output│││├──nodes.csv│││├──edges.csv│││├──nodes_without_dominant.csv│││├──edges_without_dominant.csv│├──tree-structure-webpages││├──mine_from_webpages.py││├──input│││├──stackoverflow-anchor.txt│││├──expert-graph-anchor.txt││├──output│││├──result_stackoverflow-anchor_1.txt│││├──result_stackoverflow-anchor_2.txt│││├──result_stackoverflow-anchor_3.txt│││├──result_expert-graph-anchor_1.txt│││├──result_expert-graph-anchor_2.txt│││├──result_expert-graph-anchor_3.txt│││├──result_expert-graph-anchor_4.txt│├──graph-visualization││├──stackoverflow.gephi││├──wikipedia.gephi

All files included in this artefact was developed by myself with the following documentedexceptions:

modeling/input/knowledge-graph.txt: tuple representation extracted from theknowledge graph provided by Dr. Qing Wang (see Appendix 5)mining/human-selected-tags/input/tags.csv: post tags of all 'database' tagged postsdownloaded from Stack Overflow on Sep 4, 2017mining/tree-structure-webpages/input/stackoverflow-anchor.txt: extracted from thetags in tags.csv downloaded from Stack Overflowmining/tree-structure-webpages/input/expert-graph-anchor.txt: extracted from thenodes in the knowledge graph provided by Dr. Qing Wang (see Appendix 5)

Testing

All codes are tested by running with the real world data and the data sets used are included inthe input/ folder.

Experiments

Experiments are conducted for two mining techniques:

From human-selected tagsFrom tree-structure webpages

Miningfromhuman-selectedtags

The experiment was conducted to study the effect of removing the dominant node ('database')in the graph.

Hardwareused

MacBook Pro (13-inch, 2017)

Datasetused

tags.csv: Stack Overflow post tags of all posts tagged with 'database' (downloaded on Sep 4,2017)

Results

nodes.csvandedges.csv: nodes and edges added to the graph with all post tagsnodes_without_dominant.csvandedges_without_dominant.csv: nodes and edges with thedominant node ignored

Miningfromtree-structurewebpages

An experiment was conducted to study the use of anchor when mining the graph from thetree-structure webpages.

Hardwareused

MacBook Pro (13-inch, 2017)

Datasetused

Wikipedia pages and two different sets of anchors:

stackoverflow-anchor.txt: Stack Overflow post tags as anchorsexpert-graph-anchor.txt: facts in the expert modeled graph as anchors

Results

result_stackoverflow-anchor_depth.txt: tuples added to the graph with Stack Overflow posttags as anchorsresult_expert-graph-depth.txt: tuples added to the graph with facts in the expert modeledgraph as anchors

Appendix

Appendix 4: README File

43

READMEThis artefact includes two folders corresponding to the two lines of constructing a knowledgegraph, which are mining and modeling.

For each technique, there will be a individual folder including:

source codeinput folderoutput folder

Environmentrequirement

The artefact was developed and tested with Python 3.6.4 under macOS High Sierra. Python 3is recommended for the running environment.

Modeling

The folder includes the pipeline to create the commands for constructing a knowledge graphin Neo4j.

Prerequisite

Python 3 with sys library installed.

Input

A text file with each line is a tuple in the knowledge graph like: entity_setisdata_structure

Howtorun

pythoncreate_neo4j_graph.pyinput_file

Example: pythoncreate_neo4j_graph.pyinput/knowledge-graph.txt

Output

neo4j-commands.txt: Neo4j commands that could be run on the Neo4j server

Notes

To run the Neo4j commands, please have Neo4j server installed. Please use the linkhttps://neo4j.com/download/ to download and install Neo4j.

Mining

There are two techniques developed under mining:

From the human-selected tagsFrom tree-structure webpages

FromHuman-SelectedTags

The folder human-selected-tags includes the algorithm to build the knowledge graph based onthe human-selected tags.

Prerequisite

Python 3 with sys and string library installed.

Input

A csv file including the tags of all posts with 'database' tagged downloaded from StackOverflow

Howtorun

pythonmine_from_tags.pyinput_filet_supt_conf

Example: pythonmine_from_tags.pyinput/tags.csv0.00070.15

Output

nodes.csv: csv file with all the nodes with their ids and labels

edges.csv: csv file with all the edges with their weights

FromTree-StructureWebpages

The folder tree-structure-webpages includes the algorithm to construct the knowledge graphbased on crawling the tree-structure webpages.

Prerequisite

Python 3 with sys, re, urllib.request, collections library installed.

Input

A txt file with a list of anchors. Each line in the file should represent a anchor like 0,database

Howtorun

pythonmine_from_webpages.pystarting_pageanchor_filedepth

Example: pythonmine_from_webpages.pyOutline_of_databasesinput/expert-graph-anchor.txt1

Output

result_some-anchor_depth.txt: tuples to be added to the knowledge graph

GraphVisualization

There is a third folder under modeling which includes the graph visualization result shown inGephi

Prerequisite

Gephi installed: please use this link https://gephi.org/ to download and install Gephi.

Input

Stackoverflow.gephi: graph mined from the Stack Overflow post tags including twoworkspace:

all posts with database tag

database tag ignored

Wikipedia.gephi: graph mined from the Wikipedia pages including two workspace:

Stack Overflow post tags as anchorfacts in expert modeled graph as anchor

Howtorun

1. After opening Gephi, select File->Open...2. From the list of files shown, select the gephi file (end with .gephi) and then click Open.3. And you will see the graph visualization result on the screen.

48 Appendix 4: README File

Appendix

Appendix 5: Expert ModeledKnowledge Graph

Appendix 5 is the knowledge graph modeled by the domain expert, Dr. Qing Wangwhich includes the facts related to ’database’ domain knowledge.

49

Scanned by CamScanner

exploring knowledge graphs for adaptive...

Documents