mining domain landscape from social q&a...

30
Mining Domain Landscape from Social Q&A Websites Tonghui Yuan [email protected] Supervisor: Dr. Zhenchang Xing [email protected] A project report submitted in partial fulfilment of the degree of Becholar of Advanced Computing (Honours) in Department of Computer Science College of Engineering and Computer Science The Australian National University May 2017

Upload: others

Post on 24-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Mining Domain Landscape

from Social Q&A Websites

Tonghui Yuan [email protected]

Supervisor: Dr. Zhenchang Xing

[email protected]

A project report submitted in partial fulfilment of the degree of Becholar of Advanced Computing (Honours)

in Department of Computer Science College of Engineering and Computer Science

The Australian National University

May 2017

Page 2: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Acknowledgements

I would like to convey my highest respect and deepest gratitude to everyone who has helped me to complete this individual project and this report.

Above all, I would like to express my sincere thanks to Dr. Zhenchang Xing, my project supervisor, for his patient guidance and valuable suggestions. Thank you for guiding me through the entire project over the past several months.

Special thanks to Chunyang Chen for being my technical supervisor. Thanks very much for his clear instructions and encouraging my study and helping me during the whole process of the project.

I would also like to thank the project coordinator Dr. Weifa Liang. Thank you for all your effort put in organising study sessions and giving us useful suggestions on giving presentations and writing report.

Finally, I would like to give thanks to all my great friends. Thank Tudor Barbulescu and Jay Hansen for giving me valuable suggestions and help me step towards the right direction when I got lost in this project. Thank Huade Huang, Patrick McCawley, Yangfan Zhang and Zhuoqi Qiu for doing proofreading and grammar revising for this report. Thank you very much!

�2

Page 3: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Declaration

This project is conducted between Feb 2017 to May 2017 in COMP4560 Advanced Computing Project for the degree of Bachelor of Advanced Computing (Honours) in the department of computer science in ANU.

Except where otherwise indicated, this project is entirely my own work.

Tonghui Yuan

u5833138

Australian National University

26 / 05 / 2017

�3

Page 4: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Abstract

To understand the math landscape is crucial. However, numerous of mathematical terminologies cause difficulties in creating a summary of math landscape. To learn some specific mathematical terms, people currently rely on some online information, such as online journals, discussion of forums and blogs. As it is useful, the online information also has limitations, including lack of comprehensive summary of math landscape and dispersion in distinct resource. This project utilises the fact that Mathematics StackExchange users tag their questions with mathematical terms to help summarise the content of the questions; in addition, an algorithm is designed in this project to mine the association rule between each pair of mathematical terms from millions of questions posted on Mathematics StackExchange. After that, the pairs are visualised into a graphical associative network which stands for the landscape of math, with a website eventually being built to display the math landscape. With some limitations of the project, it still proves that the mined math landscape covers most mathematical terms, which has the complex relationships inside. The website with math landscape can better help people get useful information.

Keywords: Association rule mining, Data visualisation, Q&A websites, Math landscape

�4

Page 5: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

List of Figures

Figure_1: Highlight Function 10................................................................Figure_2: Road Map 12.............................................................................Figure_3: Mathematic LandScape 13........................................................Figure_4: Clustering hierarchy 14..............................................................Figure_5: Relational knowledge of “calculus” 14......................................Figure_6: Modularity tool in Gephi 17........................................................Figure_7: One part of math landscape 18..................................................

Figure_8: crowded part 20............................................................

�5

Page 6: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Contents

Acknowledgements 2.............................................................................................Declaration 3............................................................................................................Abstract 4.................................................................................................................List of Figures 5......................................................................................................Contents 6................................................................................................................

Chapter 1Introduction 9.........................................................................................................1.1 Motivation and Contributions 9...........................................................................1.2 Related Work 10................................................................................................

1.2.1 Association rule mining 10.......................................................................1.2.2 Community Detection 11.........................................................................1.2.3 Structured Knowledge and “TechLand” 11..............................................

1.3 Road Map 12......................................................................................................

Chapter 2Algorithm 13...........................................................................................................2.1 What are Mathematical Terms? 13....................................................................2.2 Association Role Mining 14................................................................................

Chapter 3Implementation 17..................................................................................................3.1 Data visualisation 17..........................................................................................3.2 Web Implementation 18.....................................................................................

Chapter 4Evaluation 20...........................................................................................................4.1 The Accuracy of Association Pair Mining 20.......................................................4.2 Limitation 21........................................................................................................

Chapter 5Conclusion and Future Work 22............................................................................References 23..........................................................................................................Appendix 1 - Study Contract 25.............................................................................Appendix 2 - Files Lists 28.....................................................................................

�6

Page 7: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Appendix 3 - Readme File 29.................................................................................

�7

Page 8: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

�8

Page 9: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Chapter 1

Introduction In mathematics, there are many branches, and each branch covers various of terminologies, so the knowledge network of math will be complex. The aim of this project is to generate a summary of the landscape of mathematical terminologies. This report illustrates the whole process of this project, including collecting the raw data from StackExchange websites, processing data, aggregating association pairs, visualising the pairs into graphs and developing a website to show math landscape.

1.1 Motivation and Contributions Mathematics, referred to as the queen of all sciences by Gauss(En.wikipedia.org, 2017), contains a large number of branches and each branch has many topics. Thus, even for a specialist in math, it is hard to give a whole picture of the landscape of the mathematical world. However, knowing the landscape of math is helpful for people from math field, because by understanding the landscape, people can address the cold start problem easily and get a guideline on the right direction to explore deeply in this field.

Two types of information are usually needed to understand the math landscape. The first one is what the current known mathematical terms are and the other one is how one mathematical term correlate with the others. Nowadays, people have basically two approaches to get these two types of information. One approach is reading documents posted on the internet. Some people share their understandings of a mathematical term by writing a journal and posting it online. For example, Elsevier company, a major source of scientific and technical information in the world(Elsevier, 2017), has established a journal named “Linear Algebra and its Applications”. This journal publishes many open access journals that contribute new information or new insights to linear algebra(Elsevier, B V, 2017). So people can read this online journal and get some information of linear algebra. The other method of getting information is posting questions on Q&A websites. Q&A websites, such as Quora and Mathematics StackExchange (Mathematic.SE), are becoming increasingly popular, and more and more people prefer to discuss their questions on these social Q&A websites(Israelsky, 2011).

However, there are some limitations on those two approaches for getting information. Firstly, the information from online articles and the discussion from social Q&A websites are largely based on those authors’ personal opinions; thus, the online information often lacks objectivity. Another weakness of the online information is that no matter a journal or a discussion usually just focuses on one specific mathematical term, not a set of correlated mathematical terms. For instance, when searching “graph-theory” on Google, the returned results are tutorials, definitions, and introductions of “graph-theory”(Google, 2017). In this case, people cannot get the idea of how graph theory corresponds with the other mathematical terms. This causes the lack of consistent summary of the landscape of this mathematical term.

Besides the above two limitations of the online information, online searching is inherently flawed. When people search a mathematical term on Google, some overlapping content will be listed, and some answers are not even relevant to the field of math. For instance, Google returns some similar articles for the query “the use of word ‘integral’ in math” such as “How do you actually use the word ‘integral’” and “What does integral in math mean?”(Google, 2017). The definitions of “integral” given in those online articles are overlapping but different. As a result, beginners in math will get confused about the real meaning of the word ‘integral’. If the query is changed as “the use of word ‘integral’”, Google returns the result like “Use integral in a sentence”[31], an article about semantics. It can be seen from these examples that it is difficult to find useful information about math landscape by each engine. These limitations are the primary motivation for this project.

�9

Page 10: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

To overcome the above limitations and provide people an easy way to explore the landscape of math, a lot of contributions have been made in this project. First of all, more than two million posts have been collected from Mathematics.SE. Based on the fact that Mathematics.SE users tag their questions with the mathematical terms that the questions revolve around, this project has designed an association rule mining algorithm to mine the association rules of the tags. Secondly, community detection has been implemented on the mined association pairs, grouping the tags into different clusters. After that, these pairs have been visualised into an associative network by implementing the force driven layout in D3.js, a JavaScript library for visualising data in web browsers(Bostock, 2017). Eventually, a website has been built to display the graphs. To enhance the efficacy of the website, the tags have been sorted by their frequency and the first 10 frequent mathematical terms have been picked out. Then 20 subgraphs have been generated to display the landscape of each particular mathematical term. Meanwhile, in order to show each cluster more clearly, the highlight function is implemented. As shown in Figure_1, when the mouse points at the “calculus” node, all the other nodes are hidden except those connect with “calculus”.

Figure_1: Highlight Function

1.2 Related Work This section reviews the related work from three aspects: association rule mining, community detection, and knowledge structuring.

1.2.1 Association rule mining

Data Mining (DM), as a new information processing technology, is commonly used to discover hidden value from large databases in various of fields(Deng, 2012). Association rule mining (ARM) is an important area in DM, and it is used to mine frequently co-related patterns from an existing database(Iqbal & ur Rehman, 2016). A lot of research has been done on association rule mining. Zhong, R. and Wang, H., for example, have done some research on classical association rule mining algorithms and they published “Research of Commonly Used Association Rules Mining Algorithm in Data Mining” in 2011(Zhou & Wang, 2011). In their paper, they explain the

�10

Page 11: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

basic module of association rule mining by analysing the sales transaction problem. They also compare several association rule mining algorithms, including the Data Set Partitioning Algorithm, Depth-First Algorithm, Breadth-First Algorithm, Sampling Algorithm and Incremental Update Algorithm(Zhou & Wang, 2011). Among all these algorithms, Apriori algorithm is the most classical one(Zhou & Wang, 2011). Jiao, X.(2013) reveal that the arising of many long patterns leads to the decline on Apriori algorithm’s performance, and he has proposed an improved version of Apriori algorithm(Jiao, 2013). Another paper about ARM worth mentioning is “Fast Algorithm for Mining Association Rules” published by Agrawal, R. et al. To discover association rules from a large database, the authors propose two new algorithms and a combined algorithm of the two algorithms named “AprioriHybrid”. According to the idea of the Apriori algorithm, the Associative Math_Pair Mining algorithm has been designed in this project. As pairs are needed in this project, this algorithm does not contain iterates, so Associative Math_Pair Mining algorithm is much simpler than Apriori algorithm.

1.2.2 Community Detection

In the digital age, people from various fields deal with lots of information every day. Usually, the information can be considered as complex networks that have hierarchy inside (Newman, Barabasi, & Watts, 2009). The size of a network from the real world is usually very large, as a result, decomposing the networks into communities is quite important. To uncover a priori unknown modules, such as the topics in an information network, people are used to grouping the nodes which are strongly correlated with each other into one cluster(Blondel, et al., 2008). Many community detection algorithms have been developed to help people get communities from a network(Zhao, et al., 2016). To evaluate the accuracy of a community detection algorithm, researchers have conducted many experiments on different types of network. For instance, Lu, et al. have proposed a novel algorithm for community detection and test this algorithm on weighted networks(Lu, Wen, & Cao). Furthermore, to test the community detection algorithms on a real-world network, Zhao, et al.(2016) have implemented the algorithms on some artificial networks and found out the limitations of some algorithms.

Besides the above research, Blondel, et. al(2008) have proposed a heuristic method which is based on modularity and it is called the Louvain Modularity method. Instead of requiring users to give a specific number of the communities needed to be detected, the Louvain method unfolds the hierarchical community structure for the network by using an iterative modularity maximisation method(Blondel, et al., 2008). The Louvain method assigns each node in the network to exactly one cluster (Wikipedia, 2017). The edge with both ends in the same cluster contributes to the modularity, while the edges that extend across clusters have a negative effect on modularity(Chen, Xing, & Han, 2015).

1.2.3 Structured Knowledge and “TechLand”

Structured knowledge graph is widely used in the learning process because a high-quality structured knowledge graph contains a worth of information which can help learners to module the knowledge(Zhao & Zhang, 2016). Many studies have found that structured knowledge can transpire within the tagging system in Web communities, such as Furl, Amazon, and Rojo(Zhao & Zhang, 2016). Tagging is considered by those Web communities as a classification process and the objects on social web communities, such as photos and questions, are often tagged with keywords by the users. By tagging the objects, sets of categories are derived based on the tags(Zhao & Zhang, 2016) and a potential clustering hierarchy (shown in Figure 3) emerges in a database of those tags.

The information hidden in the tags can be utilised to generate knowledge graphs. Mining knowledge graphs is a vibrant area in DM and many knowledge-graph based applications are actively researched(Chen, Xing, & Han, 2015), such as query understanding and reformulation(Joyee & Papadimitriou, 2010), entity profiling(Yerva, et al., 2012), exploratory search (Genc, 2014) and serendipitous search(Bordino, Mejova, & Lalmas, 2013). A successful

�11

Page 12: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

application of structured knowledge is TechLand, an existing technology landscape assisting tool. TechLand was developed by Chunyang to help developers from software engineering background get a good understanding of the landscape of most programming concepts(Chen, Xing, & Han, 2015). However, Techland system is limited to programming field; people from other fields, such as math, linguistics, games and so on, will also need a platform that can be easily used to find out the landscape of a particular field. Thus, another motivation for this project is to extend the idea and method of Techland system to other fields. This project is based on a math database collected from Mathematics.SE. After mining all the pairs of domain-specific entities which have association relationships from the database, a visualised knowledge graph based on those pairs are generated.

1.3 Road Map This report describes the entire process of this project, and the road map of this report is displayed as followings.

Figure_2: Road Map

�12

Page 13: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Chapter 2

Algorithm In this project, the expected results of the algorithm are association pairs of mathematical terms, and these pairs are visualised into the associative network as shown in Figure_3. The associative network of mathematical terms illustrates the math landscape. Every node in this graph stands for a particular mathematical term, and the link between every two nodes shows the association rule of this pair. In addition, to provide the users a better showcase of each cluster, the first 10 tags are expected to be picked out and generated into corresponding subgraphs(shown in Figure 4 & 5). To get the landscape, the most challenging job is to get the association pairs. In this section, an algorithm which was designed for mining the associations among these tags is introduced.

Figure_3: Mathematic LandScape

2.1 What are Mathematical Terms? Before the introduction of the algorithm, it is necessary to clarify the meaning of “Mathematical Terms”. Like most of the social communities, Mathematics.SE also uses tagging system and each

�13

Page 14: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

question posted on Mathematics.SE has up to five tags(What are tags, and how should I use them?, 2017). These tags are in different forms: some are single words, such as ‘limits’ and ‘analysis’, and the others are phrases, such as ‘graph-theory’, ‘functional-analysis’ and ‘sequences-and-series’. Usually, these tags are mathematical terminologies which give a brief summary of the topics that contained in the questions. Based on that, the tags are referred to as mathematical terms in this project.

Figure_4: Clustering hierarchy Figure_5: Relational knowledge of “calculus”

2.2 Association Role Mining R. Agrawa and R. Srikan(1994) raise Apriori algorithm, which is the first algorithm for mining frequent item sets with Boolean association rules(Jiao, 2013). Based on this algorithm, a simpler algorithm is designed in this project to mine all the association pairs and explore the association rules between the two tags of each pair.

The following two parameters play a crucial role in association rule mining:

!

! .

In the above two equations, ! and ! are two different tags and ! is the total count of the questions posted on Mathematics.SE and ! means the count of the tag combinations that contain both ! and ! . The support value ! evaluates how often ! and ! happen together in all questions, while the confidence value ! measures how often ! appears in questions that are labeled with ! . When the support value and confidence value of tag pair ! reach the thresholds respectively, it can be seen that an association rule exists between ! and ! .

suppor t (tagA, tagB) =tagCom bo containing (tagA & tagB)

tagCom bo

con f idence(tagA ⇒ tagB) =tagCom bo containing (tagA & tagB)

tagCom bo containing tagA

tagA tagB tagCom botagCom bo containing (tagA & tagB)

tagA tagBsuppor t (tagA, tagB) tagA tagB

con f idence(tagA ⇒ tagB) tagBtagA (tagA, tagB)

tagA tagB

�14

Figure_5: Relational knowledge of “calculus”

Page 15: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Considering the typical sales transaction example used by Rakesh, A.(Agrawal & Srikant*), the combination of tags attached to each post can be considered as a transaction in this project and the single tag is an item in the transaction. Basically, the algorithm designed in this project can be divided into two main sections, the first section is to get all possible undirected pairs, and the second section is to get the association pairs based on support value and confidence value.

Table_1 explains all the notations used in the algorithm and Algorithm_1 shows the core part of the entire algorithm.

Table_1: Notations for Associative Math_Pair Mining Algorithm

�15

NotationInput Combo_Tag An ArrayList of String[ ] which is used to store all the

records of tag combination. Each String[ ] stores the combination of tags that attached to one posted question.Structure: [[tagA, tagB],[tagA, tagC, tagD], … […]]

Output OuterMap A HashMap stores all pairs of tags that appear in same question. The key is of String type recording a single tag and the value of OuterMap is another HashMap, whose key is another tag and value counts the frequency of those two tag.Structure: {tagA, {tagB, frequency of tagA&B}}

OneWayPair A HashMap stores all the tag pairs and their frequency. The type of key value in this HashMap is a Set, which is used to store the pairs and the value is the frequency of this pair.Structure: {[tagA, tagB], frequency of tagA&B}

Variables TempMap A HashMap which acts as a temporary container for the inner map.Structure:{tagB, frequency of tagA&B}

count The counter for counting the frequency of tagA&B while scanning all the records of tag combination.Type: int

Note 1. The difference between OneWayPair and OuterMap is that in OneWayPair, the situations [tagA, tagB] and [tagB, tagA] are merged together.2. The following pseudo code illustrates the algorithm of mining the association pairs which is the core of the whole algorithm. Although the final expected output is OneWayPair, the progress of transforming OuterMap to OneWayPair WILL NOT be displayed in the following code.

Page 16: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Algorithm_1: Associative Math_Pair Mining

The above algorithm gives the core part of Associative Math_Pair Mining algorithm which mines all the pairs that happen in the same question. In this case, ! and ! are considered as two different pairs. However, the final tag correlation graph is an undirected graph, so the ! is then transformed into ! by merging ! and ! into ! .

After getting all the undirected pairs, a support value and a confidence value are selected randomly to get the first generation and then visualise the first generation into a graph. The outcomes of each generation are observed and compared and this project adjusts the support value and confidence value manually until a beautiful graph is generated. Write down the support value and confidence value and they are the appropriate thresholds. If the confidence value of one pair is greater than the confidence threshold, the association rule ! will be mined by association rule mining.

(tagA, tagB) (tagB, tagA)

OuterMap OneWa yPairs (tagA, tagB) (tagB, tagA)(tagA, tagB)

tA → tB

�16

Page 17: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Chapter 3 Implementation This section of the project has two main stages: data visualisation and web implementation.

3.1 Data visualisation To visualise the mined pairs, this project first groups the tags into different clusters by doing community detection. As mentioned above, the Louvain method can detect communities from an associative network automatically. The Modularity function in Gephi utilises the Louvain method to detect the strongly correlated tags(Ferrante, 1987), so this project has chosen Gephi to do community detection. As shown in Figure_6, 14 modularity classes are returned, which means that the entire math landscape is divided into 14 clusters. The red points on the graph show the number of nodes in each cluster.

Figure_6: Modularity tool in Gephi

Based on the modularity classes, the landscape of mathematical terms is finally represented as a graphical associative network. In this project, Force Driven Layout has been used for visualising the mined association rules into the association network. Figure_ 7 illustrates a part of the math landscape visualised by Force Driven Layout and it is clear that the network is an undirected graph G(V,E) with several clusters. The node set V of the graph is a set of tags (i.e., the mathematical terms) and the size of the nodes represents the activity metric of the tag. The edge set E contains undirected edges! (i.e., associations between the tag pairs). As shown in Figure_7 that Force Driven Layout is a good choice for inspecting clustering results in a network.

To make the landscape more visible, some extra features have been implemented on the graph. Firstly, as can be seen in Figure 7, all the nodes are labeled by the tag names, so that the reader can easily recognise what each node stands for. Secondly, the edges between the nodes are coloured by the connected nodes. If the two nodes are in the same colour, the link between them is also the same

< tagA, tagB >

�17

Page 18: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

colour, while if the nodes are in different colours, the link will be randomly coloured by one of the two colors. In this case, the community of each mathematical term is more clear and people can easily tell how many clusters there is and how close they connect with each other. Another special feature of this graph is highlighting clusters and this has been discussed in Chapter 1 that only the neighbours of the node are highlighted when the mouse moves over it. So that it is convenient for people to find out the correlations between one node and its neighbours in a quite complex network.

Additionally, this project generates 10 subgraphs for the first 10 frequent tags. The process of getting the 10 subgraphs contains two parts, one is to collect the first 10 tags; the other one is to get the subset of tagCombo for each tag. Given a tag ! , to mine its sub-graph, firstly, the questions that are tagged with ! were picked out from all posts.

When mining relational knowledge of a given tag ! , ! itself is removed from the transactions. Then the Associative Math_Pair mining algorithm was executed on the rest of the data, which can find out association rules of ! . These results are used to contribute to subgraphs like the one shown in Figure 3.

Figure_7: One part of math landscape

3.2 Web Implementation Implementing the graphs in a website is the final step of this project. The development of this website is based on basic web developing technologies, such as HTML, CSS, and JavaScript. Especially, a Javascript library named D3.js is used to generate the final math landscape. D3 produces dynamic, interactive data visualisations in web browsers(D3.js, 2017). To prepare for the implementing of D3, the mined pairs are firstly transformed into .json file. Two arrays are contained in a .json file, one is the nodes array and the other one is the links array. As displayed in Figure_8 that people can use the navigation bar to go to the subGraph page or they can easily search a mathematical term in the search bar.

tagAtagA

tagA tagA

tagA

�18

Page 19: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Figure_8: The header of the website Figure_9: The footer of the website

�19

Page 20: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Chapter 4

Evaluation This chapter covers two sections. The first section evaluates the mined association pairs and the generated associative network. Meanwhile, the second section zooms in the limitations of this project which leads to the poor performance.

4.1 The Accuracy of Association Pair Mining To test the performance of Associative Math_Pair Mining algorithm, two experiments have been conducted in this project. In the first experiment, 20 records of the tag combination are selected randomly from 2053397 records as a sample dataset and the algorithm runs on the sample dataset without setting support value and confidence value. The purpose of this experiment is to test the performance of the algorithm in mining all possible association pairs. The result showed that 54 pairs in total are collected from the 20 records and it is verified by checking the results manually.

The purpose of the second experiment is to find the appropriate support value and confidence value, which can be used to mine high-quality association pairs. To increase the efficiency of the evaluation, this experiment sampled only 100 tags from the 1482 tags and then ran the algorithm on the whole set of the tag combination with different support values and confidence values. Then the resulted associative pairs were determined by looking up each tag in the corresponding TagWiki [32] manually. It appeared that if the support value was too high, there would be insufficient tags being picked out, while if it is set too low, there will be excessive nodes in the graph, which will lead to muddle of the network. After some attempts, the support value was finally set as 0.007 which generates an appropriate amount of nodes. Also, to get highly correlated pairs, different confidence values were tried based on that support value and 0.085 was the final confidence value selected for generating the association rules. According to this experiment, a large confidence value would result in dispersion, in which case the nodes were divided into more clusters than expected; while most of the nodes would be correlated with a small confidence value.

Figure_8: crowded part

�20

Page 21: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

4.2 Limitation Despite that this project has mined all the association pairs from the math database and visualised them into a mostly accurate summary graph of math landscape, there are some limitations with the generated associative network. Firstly, the approach used in this project did not work very well for those tags which happen very often. As shown in Figure 7, there does exist communities among the blue and green nodes but it is extremely crowded; as a result, it is hard to crystallise that the relationship among different clusters. The second limitation of the associative network is that it is static. In this case, if any change happens to the database on Mathematics. SE, the graph cannot change automatically, while Mathematics. SE is dynamic, different questions are being asked all the time, which can easily make our associative network outdated. Another drawback is that the static graph cannot show the trend that how the association rule changed between these mathematical terms since this website was launched.

�21

Page 22: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Chapter 5

Conclusion and Future Work Nowadays, people rely on online information to learn math landscape, however, it can lack objectivity. To overcome the limitations of online information, The project has been proposed. The most significant outcome is the Associative Math_Pair Mining algorithm. It is a simple association rule mining algorithm for mining association pairs from tags collected from Mathematics.SE. Moreover, this report also presents a localhost website named MathLand. This website visualised the correlations between the pairs of mathematical terms and it provides a summary of mathematical terms and the association rules between them, which is helpful to meet the information needs in math landscape inquiries.

Although, this project has generated an associative network of mathematical terms and it provides a easy way to people to find out the math landscape, a better performance of this website can be achieved by some further work. On one hand, the total number of the nodes in one cluster can limit ed. For example, node “calculus” has more than ten neighbours, so it would be much clearer if only the first ten frequent nodes were shown in one cluster. Two options are available to achieve this purpose: one is to cut the nodes while doing the association rule mining, but this method may lead to lost of data. The reason is that one tag cut in one community could be in top ten of another community.

Additionally, there is some future work can be done in the future to make the project better, however, as time was limited, I couldn't do. Firstly, we can generate an associative graph for every month from 2009 (the year Math. SE starts) till now, which enables us to visualise the resulted association network into a trend map that can be used to compare the trend during that period. The trend map can also be used to predict how those mathematical terms will be used in the future, which can solve the first problem to some degree. However, to make the graph change dynamically with the question posted on Mathematics.SE, further research is in need.

�22

Page 23: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

References (2017,May10).RetrievedMay24,2017,fromGoogle:h=ps://www.google.com.au/search?

client=safari&rls=en&q=use+of+integral+in+math&ie=UTF-8&oe=UTF-8&gfe_rd=cr&ei=sCkeWdOUGOvDXsyYgZgC

(2017,May10).RetrievedMay24,2017,fromGoogle:h=ps://www.google.com.au/search?client=safari&rls=en&q=graph+theory&ie=UTF-8&oe=UTF-8&gfe_rd=cr&ei=eCkeWZatJ-vDXsyYgZgC

Agrawal,R.,&Srikant*,R.(n.d.).FastAlgorithmsforMiningAssocia2onRules.IBMAlmadenResearchCenter,CA.

Blondel,V.D.,Guillaume,J.-L.,Lambio=e,R.,&Lefebvre,E.(2008,October9).Fastunfoldingofcommunicesinlargenetworks.JournalofSta2s2calMechanics:TheoryandExperiment,2008.

Bordino,I.,Mejova,Y.,&Lalmas,M.(2013).Penguinsinsweaters,orserendipitousenctysearchonuser-generatedcontent.Proceedingsofthe22ndACMinterna2onalconferenceoninforma2onandknowledgemanagement.ACM,109-118.

Bostock,M.(2017).D3.js.RetrievedMay24,2017,fromD3:Data-Driven-Documents:h=ps://d3js.org

Chen,C.,&Xing,Z.(2015).MiningTechnologyLandscapefromStackOverflow.Singapore.Chen,C.,Xing,Z.,&Han,L.(2015).TechLand:Assis2ngTechnologyLandscapeInquirieswith

InsightsfromStackOverflow.Singapore.Cherney,D.,Denton,T.,Thomas,R.,&Waldron,A.(2013).LinearAlgebra.California.Delugach,H.S.(1992,November).Specifyingmulcple-viewedsojwarerequirementswith

conceptualgraphs.JournalofSystemsandSoRware,19(3),207-224.Deng,W.(2012).FutureControlandAutoma2on:Proceedingsofthe2ndInterna2onal

ConferenceonFutureControlandAutoma2on(ICFCA2012)-,Volume2.SpringerScience&BusinessMedia.

Elsevier.(2017,May21).RetrievedMay25,2017,fromWikipedia:h=ps://en.wikipedia.org/wiki/Elsevier

Elsevier,BV.(2017).LinearAlgebraanditsApplica2ons.RetrievedMay24,2017,fromELSEVIER:h=ps://www.journals.elsevier.com/linear-algebra-and-its-applicacons

Ferrante,J.(1987,July).TheProgramDependenceGraphandItsUseinOpcmizacon.ACMTransac2onsonProgrammingLanguagesandSystems,9(3),319-349.

Genc,Y.(2014).Exploratorysearchwithsemancctransformaconsusingcollaboracveknowledgebases.WSDM.ACM,661-666.

Iqbal,M.,&urRehman,S.(2016,Dec).AssociaconRuleMiningUsingComputaconalIntelligenceTechnique.Interna2onalJournalofComputerScienceandInforma2onSecurity,416-424.

Israelsky,P.(2011).6ReasonsWhyQ&ASitesCanBoostYourSEOin2011(DespiteGoogle'sFarmerUpdate).RetrievedMay25,2017,fromMOZ:h=ps://moz.com/blog/6-reasons-why-qa-sites-can-boost-your-seo-in-2011-despite-googles-farmer-update-12160

Jiao,Y.(2013,January1).TheresearchofimprovedassociaconrulesminingApriorialgorithm.Theresearchofimprovedassocia2onrulesminingApriorialgorithm,2(1).

Joyee,R.R.,&Papadimitriou,P.(2010).StructuredAnnotaconsofQueries.Interna2onalConferenceonManagementofdata.

�23

Page 24: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Lu,Z.,Wen,Y.,&Cao,G.(n.d.).CommunityDetec2oninWeightedNetworks:AlgorithmsandApplica2ons.NanyangTechnologicalUniversity.

Mathema2cs.(2017,May10).RetrievedMay24,2017,fromQuora:h=ps://www.quora.com/topic/Mathemaccs

Mathema2cs.(2017,May24).RetrievedMay24,2017,fromStackExchange:h=ps://math.stackexchange.com

Newman,M.,Barabasi,A.-L.,&Wa=s,D.J.(2009).ThestructureandDynamicsofNetworks.PrincetonUniversityPress.

Wang,H.,&Liu,X.(2011,September15).TheresearchofimprovedassociaconrulesminingApriorialgorithm.FuzzySystemsandKnowledgeDiscovery(FSKD).

Whataretags,andhowshouldIusethem?(2017).RetrievedMay24,2017,fromStackExchange:h=ps://math.stackexchange.com/help/tagging

Wikipedia.(2017,April18).RetrievedMay24,2017,fromLouvainModularity:h=ps://en.wikipedia.org/wiki/Louvain_Modularity#Comparison_to_Other_Methods

Wu,S.(2015,July10).UnderstandingtheForce.RetrievedMay24,2017,fromMedium:h=ps://medium.com/@sxywu/understanding-the-force-ef1237017d5

Yerva,S.R.,Grosan,F.,Miklos,Z.,Tandrau,A.,&Aberer,K.(2012).TweetSpector:Encty-basedretrievalofTweets.SIGIR,1016-1016.

Zhao,C.,&Zhang,L.W.(2016,14).StructuredKnowledgeLearningProcessDesignBasedonOntology.StructuredKnowledgeLearningProcessDesignBasedonOntology,2017(May),1.

Zhao,X.,Xing,Z.,Kabir,M.A.,Sawada,N.,Li,J.,&Lin,S.(2017,March23).HDSKG:Harvescngdomainspecificknowledgegraphfromcontentofwebpages.SoRwareAnalysis,Evolu2onandReengineering(SANER),2017IEEE24thInterna2onalConferenceon.

Zhao,Y.,Rene,A.,&Claudio,T.J.(2016).ACompara2veAnalysisofCommunityDetec2onAlgorithmsonAr2ficialNetworks.Sciencfic.

Zhong,R.,&Wang,H.(2011,November1).ResearchofCommonlyUsedAssociaconRulesMiningAlgorithminDataMining.InternetCompu2ng&Informa2onServices(ICICIS).

�24

Page 25: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Appendix 1 - Study Contract

�25

Page 26: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

�26

Page 27: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

�27

Page 28: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Appendix 2 - Files Lists

File_List_1 is the file list of my Association rule mining algorithm. All the txt files and the csv files are the mined results. All the files int this list are my own work.

File_List_2 is the file list of my web development files and data visualisation files in D3.js. All the files in this list are my own work.

File_List_3 is the file list of my community detection files and all the files in this list are my own work.

�28

File_List_1 File_List_3File_List_2

Page 29: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

Appendix 3 - Readme File

�29

Page 30: Mining Domain Landscape from Social Q&A Websitescourses.cecs.anu.edu.au/courses/CSPROJECTS/.../Tonghui_Yuan_Report.pdf · Mining Domain Landscape from Social Q&A Websites Tonghui

Tonghui Yuan - u5833138

�30