1 ios press a decade of semantic web research …2 s. kirrane et al. / a decade of semantic web...

27
Undefined 0 (2018) 1–0 1 IOS Press A decade of Semantic Web research through the lenses of a mixed methods approach Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country Sabrina Kirrane a , Marta Sabou b , Javier D. Fernández a , Francesco Osborne c , Cécile Robin d , Paul Buitelaar d , Enrico Motta c , and Axel Polleres a a Vienna University of Economics and Business, Austria E-mail: fi[email protected] b Vienna University of Technology, Austria E-mail:fi[email protected] c Knowledge Media institute (KMi), The Open University, UK E-mail:fi[email protected] d Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland E-mail: fi[email protected] Abstract. The identification of research topics and trends is an important scientometric activity. In the Semantic Web area, initially topic and trend detection was primarily performed through qualitative, top-down style approaches, that rely on expert knowledge. More recently, data-driven, bottom-up approaches have been proposed which can offer a quantitative analysis of the research field’s evolution. In this paper, we aim to provide a broader and more complete picture of Semantic Web topics and trends by adopting a mixed methods methodology, which allows a combined use of both qualitative and quantitative approaches. Concretely, we build on a qualitative analysis of the main seminal papers, which have adopted a top-down approach, and on quantitative results derived with three bottom-up data-driven approaches (Rexplore, Saffron, PoolParty) on a corpus of Semantic Web papers published in the last decade. In this process, we both use the latter for “fact-checking” on the former and also derive key findings in relation to the strengths and weaknesses of top-down and bottom-up approaches to research topic identification. Overall, we provide a reflectional study on the past decade of Semantic Web research, however the findings and the methodology are relevant not only for our community but beyond the area of the Semantic Web to other research fields as well. Keywords: Research Topics, Research Trends, Linked Data, Semantic Web, Scientometrics 1. Introduction The term scientometrics is an all encompassing term used for an emerging field of research that analyses and measures science, technology research and innovation [21]. Although the term scientometrics is a broad term, in this paper, we focus on one particular sub field of scientometrics that uses topic analysis to identify trends in a scientific domain over time [18]. Understanding topics and subsequently predicting trends in research domains are important tasks for re- searchers and represent vital functions in the life of a research community. Overviews of present and past topics and trends provide important lessons of how the research interests evolve and allow the research com- munity to better plan its future work. While, visions of future topics can inspire and channel the work of a research community. As with other research domains, topics and trends analysis is vital for the Semantic Web research community as it helps to identify both under- represented and emerging research topics. 0000-0000/18/$00.00 c 2018 – IOS Press and the authors. All rights reserved

Upload: others

Post on 25-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

Undefined 0 (2018) 1–0 1IOS Press

A decade of Semantic Web research throughthe lenses of a mixed methods approachEditor(s): Name Surname, University, CountrySolicited review(s): Name Surname, University, CountryOpen review(s): Name Surname, University, Country

Sabrina Kirrane a, Marta Sabou b, Javier D. Fernández a, Francesco Osborne c, Cécile Robin d,Paul Buitelaar d, Enrico Motta c, and Axel Polleres a

a Vienna University of Economics and Business, AustriaE-mail: [email protected] Vienna University of Technology, AustriaE-mail:[email protected] Knowledge Media institute (KMi), The Open University, UKE-mail:[email protected] Insight Centre for Data Analytics, National University of Ireland, Galway, IrelandE-mail: [email protected]

Abstract. The identification of research topics and trends is an important scientometric activity. In the Semantic Web area,initially topic and trend detection was primarily performed through qualitative, top-down style approaches, that rely on expertknowledge. More recently, data-driven, bottom-up approaches have been proposed which can offer a quantitative analysis of theresearch field’s evolution. In this paper, we aim to provide a broader and more complete picture of Semantic Web topics andtrends by adopting a mixed methods methodology, which allows a combined use of both qualitative and quantitative approaches.Concretely, we build on a qualitative analysis of the main seminal papers, which have adopted a top-down approach, and onquantitative results derived with three bottom-up data-driven approaches (Rexplore, Saffron, PoolParty) on a corpus of SemanticWeb papers published in the last decade. In this process, we both use the latter for “fact-checking” on the former and also derivekey findings in relation to the strengths and weaknesses of top-down and bottom-up approaches to research topic identification.Overall, we provide a reflectional study on the past decade of Semantic Web research, however the findings and the methodologyare relevant not only for our community but beyond the area of the Semantic Web to other research fields as well.

Keywords: Research Topics, Research Trends, Linked Data, Semantic Web, Scientometrics

1. Introduction

The term scientometrics is an all encompassing termused for an emerging field of research that analyses andmeasures science, technology research and innovation[21]. Although the term scientometrics is a broad term,in this paper, we focus on one particular sub fieldof scientometrics that uses topic analysis to identifytrends in a scientific domain over time [18].

Understanding topics and subsequently predictingtrends in research domains are important tasks for re-

searchers and represent vital functions in the life ofa research community. Overviews of present and pasttopics and trends provide important lessons of how theresearch interests evolve and allow the research com-munity to better plan its future work. While, visionsof future topics can inspire and channel the work of aresearch community. As with other research domains,topics and trends analysis is vital for the Semantic Webresearch community as it helps to identify both under-represented and emerging research topics.

0000-0000/18/$00.00 c© 2018 – IOS Press and the authors. All rights reserved

Page 2: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Semantic Web technologies have been an area ofintense research in the academic community for al-most two decades. During this timeframe, several pa-pers have been published by researchers from the Se-mantic Web community that endeavor to predict Se-mantic Web research topics and trends [2,3], or as theresearch advanced over the years, to analyze these top-ics and trends [16,19]. In parallel, several researchersfrom the Semantic Web community [8,22,26,29,30,33]have been actively working on tools and techniquesthat can be used to automatically uncover research top-ics and trends, from scientific publications.

Most of the trend prediction/analysis papers in theSemantic Web area [2,3,16,19] adopt a top-down ap-proach that primarily relies on the knowledge, intu-ition and insights of experts in the field. While un-doubtedly these are very valuable assets, trend-papersthat purely follow this approach risk focusing on majortrends alone while overlooking under-represented oremerging trends. Also, we suspect that a detailed anal-ysis of the scientific publications in the field over timemight reveal the degree to which the expert predictionshave come true, or even falsify their predictions, andat the least allow us to quantifiably assess them. Theseshortcomings could be well-addressed by (semi-) auto-matic, data-driven approaches, which identify researchtrends in a bottom-up fashion from large corpora.

The primary goal of this paper is to provide abroader and more complete picture of Semantic Webtopics and trends in the last decade by relying on bothtop-down and bottom-up approaches. A secondarygoal is to better understand the strengths and weak-nesses of these two families of approaches in termsof topic identification (i.e., expert-based versus data-driven). To this end, we adopt a mixed methods re-search methodology [25], which involves the combina-tion of quantitative and qualitative research methods,in order to gain better insights with respect to researchconducted within the Semantic Web community.

Concretely, our approach has three main compo-nents. Firstly, in a qualitative study we converge thefindings of four top-down style seminal papers [2,3,16,19] at different points in time, into a unified Re-search Landscape. Secondly, we employ three diversedata-driven quantitative approaches to uncover Seman-tic Web topics and trends from a corpus of research pa-pers in a bottom-up fashion. The corpus analyzed withthese tools covers the academic literature that emergedfrom five popular international publishing venues forSemantic Web researchers: the International SemanticWeb Conference (ISWC), the Extended Semantic Web

Conference (ESWC), the SEMANTiCS conferences,the Semantic Web Journal (SWJ) and the Journal ofWeb Semantics (JWS), over a 10 year period from2006 to 2015 inclusive. Each of the the data-drivenapproaches employ different techniques for topic ex-traction and analysis, for instance a handcrafted taxon-omy in PoolParty1 and an automated taxonomy in Rex-plore2 and Saffron3. Thirdly, we compare and contrastthe topics derived from both the quantitative and qual-itative approaches, in order to provide: (i) a broaderpicture of Semantic Web topics and, in that process (ii)a better understanding of strengths and weaknesses ofthe various approaches involved.

Based on our analysis we were able to classify re-search topics into three different groups that indicate:(i) topics that are deemed important by experts and fre-quently occur in papers published in popular SemanticWeb conferences and journals; (ii) topics that expertsconsider important, however there was not sufficientevidence in the papers to confirm this; and (iii) topicsthat only some experts highlight as important, howeverthey were strongly represented in the research papers.However, although it was relatively easy to align thetopics in the seminar papers to topics identified usingthe data drive approaches, the trend analysis was notso straightforward. In essence, we discovered that nei-ther trends derived from broad foundational topics norspecific multi-word topics provide enough evidence toconfirm expert visions outlined in the seminal papers.

The remainder of the paper is structured as follows:Section 2 provides an overview of existing work onautomatic topic and trend analysis in the SemanticWeb community. Section 3 describes the overarchingmethodology that guided our analysis. Section 4 pro-vides a snapshot of the Semantic Web research com-munity based on the observations of several Seman-tic Web researches [2,3,16,19]. This is followed by thepresentation of the topic analysis of papers publishedin the core Semantic Web venues over a 10 year pe-riod from 2006 to 2015 with PoolParty, Rexplore, andSaffron in Sections 5-7. A discussion on the findingsof our analysis is presented in Section 8. Finally, Sec-tion 9 concludes the paper and presents directions forfuture work.

Page 3: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 3

/

Fig. 1. Overview of the mixed methods-based methodology.

2. Related Work

Detecting topics from a collection of documents isan important task that has attracted considerable atten-tion in recent years leading to a variety of relevant ap-proaches from different media sources, such as newsarticles [13], social networks [10], blogs [28], emails[27], to name but a few. In this section, we discussapproaches for the more focused task of detecting re-search topics from scientific literature, and in particu-lar methods employed by the Semantic Web commu-nity.

A classical way to model the topics of a documentis to extract a list of significant terms [9] (e.g., usingtf-idf) and cluster them [34]. Another common solu-tion is the adoption of probabilistic topic models, suchas Latent Dirichlet Analysis (LDA) [4] or Probabilis-tic Latent Semantic Analysis (pLSA) [20]. However,these generic approaches suffer from a number of lim-itations that often hinder their application for the taskof detecting scientific topics. Firstly, they produce un-labeled bags of words that are often difficult to asso-ciate with distinct research areas. Secondly, the num-ber of topics to be extracted needs to be known a pri-ori. Finally, using such methods it is not possible todistinguish research areas from other kinds of topicscontained in a document.

Therefore, several approaches were proposed tospecifically address the problem of detecting researchtopics. For instance, Morinaga et al [27] present amethod that exploits a Finite Mixture Model to de-

1PoolParty, https://www.poolparty.biz/system-architecture/2Rexplore, https://technologies.kmi.open.ac.uk/Rexplore/3Saffron, http://saffron.insight-centre.org/

tect research topics and to track the emergence of newtopics. Derek et al [14] developed an approach thatmatches scientific articles with a manually curated tax-onomy of topics that is used to analyze topics acrossdifferent timescales. Chavalarias et al [11] propose atool known as CorText that can be used to extract alist of n-grams from scientific literature and to performclustering analysis in order to discover patterns in theevolution of scientific knowledge. Other approachesexploit the citation graph. For example, Chen et al [12]designed a tool called CiteSpace which combines co-citation analysis and burst detection [24] to identifynew emerging trends. While, Jo et al [23] detect topicsby combining distributions of terms with the citationgraph related to publications containing these terms.

Public tools for the exploration of research datausually identify research areas by using keywords asproxies (e.g., DBLP++ [15], Scival4), adopting prob-abilistic topic models (e.g., aMiner [35]) or exploit-ing handcrafted classifications (ACM5, Microsoft Aca-demic Search6). However, these solutions suffer fromsome limitations. For example, keywords are unstruc-tured and usually noisy, since they include terms thatare not research topics. In addition, the quality of key-words assigned to a paper varies a lot according to theauthors and the venues. Probabilistic topic models pro-duce bags of words that are often not easy to map tocommonly known research areas within the commu-nity. Finally, handcrafted classifications are expensiveto build, requiring multiple expertise, and tend to age

4http://www.elsevier.com/solutions/scival5https://www.acm.org/publications/class-20126http://academic.research.microsoft.com/

Page 4: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

4 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

very quickly, especially in a rapidly evolving field suchas Computer Science.

The Semantic Web Community has also produced anumber of tools and techniques that use semantic tech-nologies for detecting and analyzing research topics.For instance, Bordea and Buitelaar [8] focus on topicextraction and modeling, demonstrating how domainterminology and document keywords can be clusteredin order to form semantic concepts that can be used forexpert finding. In a related work, Monaghan et al. [26]present their expertise finding platform Saffron anddemonstrate how it can be used to link expertise topics,researchers and publications, based on their analysis ofthe Semantic Web Dog Food (SWDF) corpus. The datais further enhanced with URIs and expertise topic de-scriptions from DBpedia and related information fromthe Linked Open Data (LOD) cloud. An alternative ap-proach is adopted by the Rexplore system [30], an en-vironment for exploring and making sense of schol-arly data that integrates statistical analysis, semantictechnologies, and visual analytics. Rexplore builds onKlink-2 [29], an algorithm which combines semantictechnologies, machine learning and knowledge fromexternal sources (e.g., the LOD cloud, web pages, callsfor papers) to automatically generate large-scale on-tologies of research areas. The resulting ontology isused to semantically enhance a variety of data miningand information extraction techniques, and to improvesearch and visual analytics. Hu et al. [22] demonstratehow Semantic Web technologies can be used in orderto support scientometrics over articles and data submit-ted to the Semantic Web Journal as part of their openreview process. Towards this end the authors provideexternal access to their semantified dataset, which isalso linked to external datasets such as DBpedia andthe Semantic Web Dog Food corpus. On top of thisdata they provide several interactive visualizations thatcan be used to explore the data, ranging from generalstatistics to depicting collaborative networks. While,Parinov and Kogalovsky [33] describe the Socionet re-search information system that focuses on linking re-search objects in general and research outputs in par-ticular. The authors argue that information inferredfrom the semantic linkage of research objects and ac-tors can be used to derive new scientometric metrics.

Although data-driven approaches have been evalu-ated on their own, to date there is a lack of worksthat compare and contrast existing approaches, or in-deed evaluate them with respect to expert-driven ap-proaches. This paper fills this gap by adopting a holis-tic approach to topic and tend analysis, by comparing

and contrasting the results of three data-driven and fourexpert-based topic-detection approaches in the contextof Semantic Web research.

3. Methodology

The primary objective of this paper is to gain a bet-ter understanding of the topics and the trends in theSemantic Web community over a ten year period from2006-2015. Towards this end, we use a mixed meth-ods approach to topic analysis. The thesis being that acombination of different research methods will allowus to gain a more comprehensive view of the topics andtrends within the community. In this section, we de-scribe our overall mixed methods methodology (Sec-tion 3.1) and then explain its individual quantitativeand qualitative stages in the rest of the subsections.

3.1. A mixed methods approach to topic analysis

According to Leech and Onwuegbuzie [25], themixed methods research methodology involves thecombination of quantitative and qualitative researchmethods in order to gain knowledge about some phe-nomenon under investigation. In our work we adoptedthe mixed methods approach that is illustrated in Fig-ure 1 and described next.

Qualitative research was employed to manually ex-tract key research topics from four seminal papers thatdiscuss the past, present and future of the Seman-tic Web technology [2,3,16,19] (Section 3.2 and Sec-tion 4). The extraction of keywords was performedindividually by three authors of this paper (Step 1)and their findings were converged during a consensusworkshop (Step 2). The result of this workshop is ahigh level overview of the Semantic Web research upto 2016, which we call the Research Landscape (cf.Table 2).

Quantitative research consisted of the use of threedifferent topic analysis tools to automatically extractand rank topics in order of importance, from our Se-mantic Web papers corpus from 2006 to 2015 inclu-sive (Section 3.3 and Sections 5-7). The quantitativeanalysis was performed with three different tools (i.e.,PoolParty, Rexplore, and Saffron) that enable users togain insights on the various research topics that appearin research papers published in popular Semantic Webpublishing outlets.

Finally, we combine the results obtained both withqualitative and quantitative methods in order to bet-

Page 5: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 5

/

Fig. 2. Conceptual overview of topics detection approaches: main steps and data sources

ter understand the topics that are frequently discussedwithin the community (Section 3.4). The alignment ofthe Research Landscape topics to those extracted byquantitative tools was performed individually by threeauthors of this paper (Step 4) and then their findingswere cross-correlated (cf. Tables 3- 6) to derive themain findings of this paper (Step 5, Sections 8 and 9).

3.2. Qualitative study of seminal papers

In terms of qualitative analysis, the following ap-proach was used to identify the topics mentioned inthe seminal papers [2,3,16,19]. First (Step 1 in Fig-ure 1), each paper was read by three of this paper’s au-thors who identified the main technical keywords andtopic descriptions mentioned in the paper (e.g., ontol-ogy, OWL). To keep the analysis as objective as pos-sible, the authors extracted the exact wording used inthe papers instead of using synonyms more familiar tothem. Second, the extracted keywords were groupedinto broader topic areas by each author (e.g., knowl-edge structures and modeling). Third, the results of theseparate analysis were discussed and aligned during aconsensus workshop (Step 2 in Figure 1). The primaryoutcome of the qualitative analysis was the develop-ment of a unified Research Landscape (shown in Ta-ble 2) based on the alignment of the topics mentionedin the four seminal papers.

3.3. Quantitative analysis of research papers

For the quantitative analysis, rather than using asingle tool we leveraged three different tools (i.e.,PoolParty, Rexplore, and Saffron) that employ dif-ferent approaches to topic extraction (Step 3 in Fig-ure 1). Before describing the different quantitative ap-proaches employed by the aforementioned tools, wefirst describe the conceptual steps that are typically

followed when it comes to data-driven topic analy-sis. The workflow for topic extraction and analysis (asdepicted in Figure 2) typically follows three sequen-tial steps, namely taxonomy creation, corpusannotation, and analytics.

Taxonomy creation involves the creation of a topictaxonomy that guides the analysis process. Inpractice, this step can be achieved manually bydomain experts, or automatically with the taxon-omy being learned either from the documentcorpus of interest or from a larger externaldocument corpus.

Corpus Annotation concerns the annotation of thedocument corpus in terms of the taxonomytopics. Different annotation approaches rangefrom manually assigning each paper in a corpusto the most representative topics, annotating thedocument abstracts with the relevant topics, or an-notating the entire text of the paper based on atopic list or hierarchy.

Analytics refers to various analytical activitiesthat can be conducted over the annotateddocument corpus. For instance, trend detec-tion analytics, expert profiling and recommenda-tions.

A tabular overview of the approaches adopted byPoolParty, Rexplore, and Saffron with respect to thesemain steps is presented in Table 1, while a highleveloverview of each of the tools is presented below. Thecorpus analyzed with these tools covers the academicliterature that emerged from five popular internationalpublishing venues for Semantic Web researchers: theInternational Semantic Web Conference (ISWC), theExtended Semantic Web Conference (ESWC), theSEMANTiCS conferences, the Semantic Web Journal(SWJ) and the Journal of Web Semantics (JWS), over

Page 6: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

6 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Table 1Comparison of the methods and data sets used by various topic and trend analysis tools.

Tool Taxonomy Creation Topic Taxonomy Document Corpus Corpus Annotation Topic Analysis Other AnalyticsPoolParty Manual Fairly broad/deep SW Venues

2006-2015Automatic(full-text)

Topic frequency intext

Taxonomyextension

Rexplore Automaticfrom broaderexternal corpus

17K topics in CS,96 topic in SW,9 levels deep

(MV) SW Venues2006-2015(FSW) Scopus2006-2015

Automatic(abstracts, titles,keywords)

Number of papersand citationsassociated with atopic

Taxonomy learning,expert profiling

Saffron Automatic fromdocument corpusthrough clusteringof co-occurrences

Flat list of terms SW Venues2006-2015

Automatic(full-text)

Topic frequency intext

Taxonomy learning,expert profiling

a 10 year period from 2006 to 2015 inclusive. In thecase of ISWC, ESWC and the JWS the papers spannedthe entire timeframe, however it is worth noting thatthe SEMANTICs and SWJ papers were only availablefrom 2009 and 2010 respectively.

PoolParty is a semantic technology suite that supportsthe analysis of documents guided by a taxonomyof the domain of interest. In the case of the analy-sis described in this paper the taxonomy was man-ually created from conference and journal meta-data (i.e., call for papers, sessions, tracks, spe-cial issues etc.). Our hypothesis was that a handcrafted topic hierarchy based on the informationextracted from said sources should broadly speak-ing reflect the topics that the community are in-terested in and also what existing researchers areactively working on. The PoolParty analysis wasconducted over the full text of the research ar-ticles, from ISWC, ESWC, Semantics, the JWSand the SWJ. Although the PoolParty suite is ca-pable of extracting topics automatically, follow-ing our purely manual approach we simply usedthe tool to annotate the documents based on themanually generated taxonomy.

Rexplore is an interactive environment for explor-ing scholarly data that leverages data mining, se-mantic technologies and visual analytics tech-niques [30]. In the context of this paper, weused Rexplore technologies for tagging researchpapers in two datasets with relevant researchtopics from the Computer Science Ontology(CSO), an automatically generated ontology ofresearch areas, and produce a number of analyt-ics. The approach for tagging the publicationstook into consideration their title, keywords andabstract and is a slight variation of the methodadopted by Springer Nature for characterizingsemi-automatically their Computer Science pro-ceedings [31].

Saffron is a term and taxonomy extraction tool, whichis primarily used for expert finding [5]. Saffronfacilitates the extraction of knowledge from textin a fully automatic manner, by leveraging bothtext mining and Linked Data principles. Its algo-rithms include: key-phrase extraction, entity link-ing, taxonomy extraction, expertise mining, anddata visualization. The Saffron analysis, whichwas conducted over the full text of the papers(as per the PoolParty analysis) is based on ataxonomy that is automatically generated duringthe analysis of the Semantic Web corpora (com-posed of papers from ISWC, ESWC, Semantics,the JWS and the SWJ) that was specifically con-structed for our analysis.

3.4. Cross correlation of results

The final stage of our analysis involved the align-ment of the topics identified by Rexplore, Pool-Party and Saffron with the Research Landscape topicsemerging from the analysis of the seminal papers. Theoutput of each of the three data-driven approaches wasmapped by one of the authors to the topics of the Re-search Landscape (Step 4 in Figure 1). The principlesused to guide the mapping process, which involved acombination of syntactic and semantic matching, canbe summarized as follows:

Exact syntactic match: is the most straightforwardcase as topics that have exactly the same label(e.g., Linked Data) are already aligned.

Partial syntactic match: refers to cases where twotopics have similar but not exactly matching la-bels, however clearly refer to the same body ofresearch. For instance, Description Logics is asubtopic of Logic and Reasoning.

Semantic match: denotes topics that have syntacti-cally completely disjoint labels but they are se-mantically related. Links between syntactically

Page 7: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 7

different labels are often recorded in our extendedResearch Landscape document, where severalkeywords were assigned to a larger overlappingtopic. For example, we assigned keywords suchas SPARQL to the Query Languages topic.

No match: is used to represent topics identified by thedata-driven approaches that are completely newand cannot be related to any of the topics of theResearch Landscape.

Individual topic alignments were cross-checked bythe two additional authors to reduce any bias and fur-ther discussed during an analysis and cross-correlationworkshop (Step 5 in Figure 1). The results of thisworkshop are depicted in Tables 3- 6.

4. Qualitative Study: Research TopicIdentification from Seminal Papers

In the Semantic Web area, a handful of well-knownpapers identify research topics and discuss trendswithin the community [2,3,16,19]. Some of these pa-pers predict future topics [2,3], while others reflecton research topics in the past years or in the present[3,16,19]. In this section, we briefly introduce the sem-inal papers before presenting the topics mentioned inthese papers in a format that allows them to be usedas a basis for comparison with topics identified via thedata-driven methods.

4.1. The seminal papers

At the turn of the millennium (2001), Berners-Leeet al. [2] coined the term "Semantic Web" and set a re-search agenda for a young and multi-disciplinary re-search field around a handful of topics. Six years later,Feigenbaum et al. [16] analyzed the uptake of Seman-tic Web technologies in various domains as of 2007. Indoing so, they provided a picture of the technologiesavailable at that time as well as the main challengesthat these technologies could solve. The authors tooka reflective rather than predictive stance in their work.Later in 2016, two important papers were published by[3,19], coinciding with the 15-year anniversary of theSemantic Web community. Bernstein et al. [3] providetheir vision of Semantic Web research beyond 2016by grounding their predictions in an overview of pastand present research. Therefore, their paper is both re-flective of past/present work and predictive in termsof future research. Glimm and Stuckenschmidt [19] in

turn, look back at the last 15 years of Semantic Webresearch through the lens of papers published at ISWCconferences from 2002 to 2014. The authors adopt anempirical approach to better understand the topics andtrends within the Semantic Web community, in whichthey identify 12 key topics that describe Semantic Webresearch and then manually classify papers publishedin ISWC conference proceedings according to thesetopics. This work can also be categorized as a data-centric analysis of research topics and trends, whichwas performed completely manually.

Each of the vision papers mentioned above are pri-marily based on the expert knowledge of the authorsand reflect their views, without aiming to be complete.Our objective is to use the topics identified in theseseminal papers as a baseline for a comparison with theoutput of the three quantitative, data-centric topic iden-tification methods discussed in this paper. Note that,unlike in information retrieval research, the proposedResearch Landscape (cf. Table 2) is by no means an ab-solute gold-standard that should be achieved, but ratheracts as an intuitive comparison basis for understandingthe strengths and weaknesses of expert-driven versusdata-driven topic identification methods.

4.2. Core topics from the seminal papers

After manually annotating research topics discussedin each of the seminal papers, we aligned the identi-fied topics across papers, and observed eleven core re-search topics that are mentioned by three or four of theseminal papers (cf. Table 2). All four papers agree onthe following eight core research topics:

Knowledge representation languages and stan-dards, such as XML, RDF and a so-called Seman-tic Web language, were considered crucial to enablingthe vision of intelligent software agents by Berners-Lee et al. [2]. Work on the development of web-basedknowledge representation languages (now also includ-ing OWL) continued over the next 7 years [16]. By2016 this was seen as a core line of research extendingalso to the standardization of representation languagesfor services [3,19]. As for the future, Bernstein et al.[3] predict that knowledge representation research willfocus on representing lightweight semantics, dealingwith diverse knowledge representation formats and de-veloping knowledge languages and architectures for anincreasingly mobile and app-based Web.

Knowledge structures and modeling. Berners-Leeet al. [2] consider knowledge structures such as ontolo-gies, taxonomies and vocabularies as essential compo-

Page 8: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

8 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Table 2Research Landscape: Core and Marginal topics discussed in the seminal papers. Topics in () were only intuitively mentioned.

Berners-Lee et al. [2]Future

Feigenbaum et al. [16]Past (2000-2007)

Glimm and Stuckenschmidt [19]Past (2000-2016)

Bernstein et al. [3]Past (2000-2016)

Bernstein et al. [3]Future from 2016

knowledge representationlanguages and standards

knowledge representationlanguages and standards

knowledge representationlanguages and standards

knowledge representationlanguages and standards

representing lightweightsemantics

ontologies and modeling,taxonomies, vocabularies

ontologies and modeling,taxonomies, vocabularies

ontologies and modeling,knowledge graphs

ontologies and modeling,(PR) knowledge graphs

-

logic and reasonings logic and reasonings logic and reasonings logic and reasoning -search and questionanswering

(ranking) search, retrieval and ranking (PR) question answeringsystems

-

(data integration) (ontology matching) data integration & matching andintegration

(PR) needs-based,lightweight data integration

integration of heterogeneousdata

proof & trust privacy, trust,access control

security, trust, provenance personal information,privacy

trust & data provenance(representation, assessment)

databases semantic web databases - database managementsystems

-

decentralization (decentralization) distributed data storage andfederated query processing

vastly distributedheterogeneous data

(decentralization)

- query language (SPARQL) query processing (SPARQL,federated query processing)

developing efficient querymechanisms

-

- (linked data, DBpedia) linked data (PR) linked data(open government data),(social data)

-

Cor

eto

pics

(machine learning,prediction, analysis,automatic report)

knowledge extraction anddiscovery

automatic knowledgeacquisition

latent semantics,knowledge acquisition,ontology learning

-

intelligent software agents - - multilingual intelligentagents

-

(semantic web services) - semantic web services - -(Internet of Things) - - - high volume and velocity of

data, e.g., streaming &sensor data

- visualization user interfaces and annotation - -- (scalability, efficiency, ro-

bust semantic approaches)- - scale changes drastically

- change management andpropagation

- - -

- (social semantic web,FOAF)

- -

Mar

gina

ltop

ics

- - - - data quality, e.g.,representation, assessment

nents of the Semantic Web. Follow up papers confirmactive research on the creation of ontologies [3,16]entailing research topics such as modeling patterns,large-scale modeling efforts, and knowledge acquisi-tion [19]. Bernstein et al. [3] introduce knowledgegraphs as novel knowledge representation structures.Based on their trend analysis in terms of number ofpublications in each topic published at ISWC confer-ences, Glimm and Stuckenschmidt [19] conclude thatresearch on both language standards and ontologieshas diminished in impact over time.

Logic and Reasoning. Berners-Lee et al. [2] con-jured that inference rules and expressive rule lan-guages will enable logic-based automated reasoningon the Semantic Web. Their prediction was abun-dantly confirmed in follow-up papers: Feigenbaumet al. [16] reporting work on the development of in-ference engines for reasoning by 2007; Glimm and

Stuckenschmidt [19] discussing several sub-topics inthis area such as scalable and efficient reasoning, non-standard reasoning or the combination of logics withnon-logical reasoning paradigms; and Bernstein et al.[3] confirming work on developing tractable and effi-cient reasoning mechanisms. The quantitative analysisof ISWC papers reported in [19] suggests that the num-ber of papers focusing on reasoning was relatively sta-ble from 2002 to 2012, however experienced a slightdecline between 2012 and 2014.

Search, retrieval, ranking, question answering. Be-sides intelligent agents, Berners-Lee et al. [2] pre-dicted that search and question answering programswould also benefit from the Semantic Web. In2007 Feigenbaum et al. [16] indirectly refer to thistopic in the context of ranking, however this researchtopic becomes increasingly important according to pa-pers published in 2016: Glimm and Stuckenschmidt

Page 9: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 9

[19] identify sub-topics such as search algorithms, do-main specific search engines and natural language ac-cess to linked data. While, Bernstein et al. [3] describework on question answering systems based on seman-tic markup and linked data from the Web (e.g., IBMWatson).

Matching and Data Integration. Ontology match-ing and data integration were already intuitively men-tioned, but not concretely named, by Berners-Leeet al. [2]. Data integration played an important rolein many commercial applications developed up un-til 2007 and opened up the need for change manage-ment and change propagation across integrated datasets [16]. By 2016, the community explored sub-topics such as data and knowledge integration, ontol-ogy matching (schema matching) and entity matching(record linkage) [19] with a new trend towards needs-based, lightweight data integration [3]. Work on thistopic regularly appeared at ISWC conferences during2002-2014 [19]. For the future, Bernstein et al. [3] dis-cuss the need to integrate heterogeneous data as part ofthe broader topic of data management.

Privacy, Trust, Security, Provenance. Berners-Leeet al. [2] envision proofs and digital signatures as keyaspects of the Semantic Web in order to enable moretrustworthy data exchange. While the topic of privacywas only vaguely mentioned in 2007 [16], it becamewell established by 2016 and covered topics such assecurity, trust and provenance [19]. According to Bern-stein et al. [3] future work should focus on the repre-sentation and assessment of provenance information,as part of the broader topic of data management.

Semantic Web Databases. Similarly to Berners-Lee et al. [2], Feigenbaum et al. [16] discuss re-search topics around the development of SemanticWeb tools as instrumental for commercial uptake, es-pecially ontology editors (e.g., Protégé) and SemanticWeb databases (e.g., triple stores). According to Bern-stein et al. [3] many of these tools evolved into com-mercial tools by 2016.

Distribution, decentralization, federation. Berners-Lee et al. [2] envisioned that the Semantic Web wouldbe as decentralized as possible, bringing new inter-esting possibilities at the cost of losing consistency.Feigenbaum et al. [16] exemplified one of these novelscenarios by mentioning FOAF as an example of a de-centralized social-networking system. In turn, Glimmand Stuckenschmidt [19] identified distributed datastorage and federated query processing as a primarygoal of Linked Data. Finally, Bernstein et al. [3] com-mented on this topic briefly, confirming that mod-

ern semantic approaches already integrate distributedsources in a lightweight fashion, even if the ontologiesare contradictory.

Besides the aforementioned core topics, three im-portant topics were not predicted by Berners-Lee et al.[2], but were mentioned by the other three papers.These are:

Query Languages and Mechanisms. By 2007, re-search also focused on the development of query lan-guages, most notably SPARQL [16] and its evolu-tion towards topics such as federated query processingtechniques [19] and developing efficient query mecha-nisms [3]. Work on query processing has been increas-ingly published at ISWC over time [19].

Linked Data. By mentioning DBpedia, Feigenbaumet al. [16] intuitively pointed to the future researchtopic of Linked Data. This topic became well estab-lished by 2016, with key sub-topics including publish-ing Linked Data (tools and guidelines) as well as ac-counts of concrete data publication projects [19]. Anew wave of structured data available on the web (e.g.,open government data, social data) further extendedresearch on the Linked (open) Data topic [3].

Knowledge extraction, discovery and acquisition. In2007, Feigenbaum et al. [16] hint at this topic withterms such as machine learning, prediction and anal-ysis. By 2016, knowledge extraction and discoveryemerged as a field of its own focusing on knowl-edge acquisition, information extraction from text andgeneral purpose knowledge bases and exploring tech-niques from data mining and machine learning amongothers [19]. Automatic knowledge acquisition wasboosted by more powerful statistical and machinelearning approaches as well as improved computa-tional resources [3]. For the future, Bernstein et al. [3]identify a need for new techniques to extract latent,evidence-based models (ontology learning), to approx-imate correctness and to reason over automatically ex-tracted ontologies/knowledge structures. An increas-ing importance is given to using crowdsourcing forcapturing collective wisdom and complementing tradi-tional knowledge extraction techniques.

4.3. Marginal topics from the seminal papers

Our analysis also identified several marginal topics,mentioned by individual papers, or by a maximum oftwo of the seminal papers. These topics, which are alsopresented in Table 2 are discussed here:

Intelligent software agents. The underpinning themeof Berners-Lee et al. [2]’s vision paper was intelligent

Page 10: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

10 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

software agents that would provide advanced function-ality to users by being able to access the meaning ofSemantic Web data. Interestingly, this topic has notbeen mentioned until recently, when Bernstein et al.[3] discuss work on training conversational intelligentagents based on multilingual textual data on the web.

Semantic Web Services. Berners-Lee et al. [2] alsoenvisioned the applicability of Semantic Web tech-nologies for advertising and discovering web-services.This intuition was the precursor of the SemanticWeb Services research field established a few yearslater, which focused on semantic service description,search, matchmaking, automatic composition and exe-cution [19].

Internet Of Things. The application of SemanticWeb to physical objects within the context of the futureInternet Of Things (IoT) was intuitively mentioned byBerners-Lee et al. [2]. This topic was not mentionedby any of the follow-up papers, even thought it is con-sidered to play an important role in the future. Indeed,Bernstein et al. [3] predict that dealing with high vol-ume and velocity data will be necessary due to the in-creased number of streaming data sources from sen-sors and the IoT. They envision techniques for the se-lection of streaming data (data triage), for decision-making on streaming sensor data as well as the inte-gration of streaming sensor data with high quality se-mantic data.

Human-Computer Interaction. Feigenbaum et al.[16] mention visualization as features of user-centricapplications, while Glimm and Stuckenschmidt [19]identify an emerging user interfaces and annotationtopic, which investigates topics such as humans in theloop, making use of human input, user interfaces forthe Semantic Web, and involving users in annotationtasks.

Scalability, efficiency and robustness. Feigenbaumet al. [16] position scalability, efficiency and robustsemantic approaches as key factors needed to ad-dress semantic web challenges, in particular integra-tion, knowledge management and decision support. Inturn, Bernstein et al. [3] recognize that new research isneeded given that the scale changes drastically.

Change management and propagation. Feigenbaumet al. [16] mention or hint that change managementand change propagation across integrated data sets isneeded to accompany data integration research.

Social semantic web. Although predicting futuretrends was not their explicit goal, by mentioning FOAFFeigenbaum et al. [16] intuitively pointed to the futureresearch topic on the Social Semantic Web.

Data quality Under the heading of data manage-ment, Bernstein et al. [3] group work on data integra-tion, data provenance and new technologies that shouldallow representing and assessing data quality, such astask-focused quality evaluation (e.g., is a resource ofsufficient quality for a task?).

4.4. Trends

Although the seminal paper focus primarily of re-search topic identification, they also offer some hintson the way these topics evolve over time (i.e., trends).We discuss this aspect of the seminal papers in thissection.

In 2001, Berners-Lee et al. [2], used a fictitious sce-nario to describe the authors’ vision of a web of datathat can be exploited by intelligent software agents thatcarry out data centric tasks on behalf of humans. Ad-ditionally the paper identifies the infrastructure neces-sary to realize this vision focusing on four broad ar-eas of research, namely: expressing meaning, knowl-edge representation, ontologies and intelligent soft-ware agents.

In 2007, Feigenbaum et al. [16] reflected on theideas presented in [2] and highlighted that although theoriginal autonomous agent vision was far from beingrealised, the technologies themselves were proving tobe highly effective in terms of tacking data integra-tion challenges in enterprises especially in the life sci-ences and health care domains. Furthermore, the au-thors highlighted that consumers were starting to adoptFOAF profiles and to embrace decentralized social-networking. However, they also point to new privacyconcerns brought about by the ability to link disparatedata sources.

In 2016, Glimm and Stuckenschmidt [19] observeda gradual decline in research on knowledge represen-tation standards and languages, knowledge structuresand modeling, and Semantic Web services. Newer top-ics, such as search, retrieval and ranking, privacy,trust, security, provenance or user interfaces and an-notation are identified as less prominent but never-theless maintain a constant presence over the years.Whereas, they highlight that query languages andmechanisms, Linked Data, and knowledge extraction,discovery and acquisition have gained in importanceover the years.

Also in 2016, discussing present research topics,Bernstein et al. [3] noted a large spectrum between twoopposite research lines on expressivity and reasoningon the Web on the one hand and ecosystems of Linked

Page 11: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 11

Data on the other. Particularly notable is the adoptionof Semantic Web technologies in several large, moreapplied systems centered around knowledge graphs,which use Semantic Web representations yet ensurethe functionality of applied systems which resulted inless formal and precise representations than expectedat the earlier stages of Semantic Web research. Basedon these considerations, the authors predict movingfrom logic-based to evidence-based approaches in aneffort to build truly intelligent applications using vast,heterogeneous, multi-lingual data.

In the next sections we continue with describing theresults of topic and trend analysis by employing data-driven tools.

5. PoolParty Quantitative Analysis

Although PoolParty is a semantic technology suiteprimarily targeted toward knowledge organization andmanagement in a business context, in this paper we usePoolParty to manually construct a taxonomy of topicsthat are of interest to the Semantic Web community,and subsequently use this taxonomy to analyze the top-ics that are mentioned in papers that are published inprominent Semantic Web venues.

Taxonomy Creation. Even though there are severalNatural Language Processing (NLP) tools that can beused to automatically extract topics from text, consid-ering that accuracy was a key factor for this study, weelected to manually generate and curate a list of topics,based on the venue metadata that are relevant for thecommunity. First, we manually extracted the metadata(i.e., call for paper line items and the track, session andinvited talk titles, etc..) from the ISWC, ESWC and theSEMANTiCS conference series and the SWJ and theJWS journal websites, from 2006 to 2015 inclusive.Where the metadata was not available we consulted theInternet Archive Wayback Machine. Once all of themetadata was gathered we created a dataset composedof the aforementioned venue metadata. We started bymerging the text from the titles of the invited talks, theline items from the call for papers, the various tracknames and the session titles, per year per venue. Wesubsequently went through each line item and notedall of the topics (words and phrases that are relevantfor the community) that appeared. It is worth notingthat more often than not there were multiple topics perline item. Following on from this we merged all of thetopics into a single topic list and removed the dupli-cates, resulting in a dictionary containing 3,421 unique

topics. In the spirit of open science all of the metadatagathered was fed into the Scholarly Data 7 initiative,and was used to strengthen the existing ontology 8.

In parallel, we derived a set of foundational tech-nologies (based on existing taxonomies and wellknown Semantic Web research sub-communities) thatreflect key research areas within the Semantic Webcommunity. The complete list of foundational tech-nologies can be seen in Figure 3. Finally, we man-ually created a coarse grained taxonomy from the3,421 unique dictionary topics, by assigning each topicto one or more foundational technologies worked onby the community. It is worth noting that althoughmany of the dictionary topics could be associatedwith foundational topics, we noticed that there werea significant number of topics that did not fit intothe technological foundations that we had identified.Generally speaking, topics could be grouped into thefollowing seven high level metadata categories: ap-plied (e.g. applications, tools, terms relating to us-age), business (e.g. profit, business value, impact), do-main (e.g. healthcare, media, education), evaluation/-metrics/studies (e.g. measure, ranking, negative re-sults), foundational technologies (e.g. knowledge rep-resentation, data management, security and privacy),methods (e.g. finding, planning, transformation), andstandards (e.g. SPARQL, standardization, W3C). It isworth noting that the results presented in this paper arebased solely on the topics that could be aligned withthe foundational technologies. A wider study on appli-cations and domains is left to future work.

Corpus Annotation. The full text of the papers wasextracted from conference proceedings and journalsand uploaded to the Semantic Web Company’s Pool-Party platform. We subsequently used PoolParty to an-notate the documents based on the topics appearingin the taxonomy and to count the occurrences of eachtopic. It is worth noting that, although the PoolPartysuite also includes automatic entity extraction tech-niques, these were not considered during our analysisas the objective was to use a purely hand crafted tax-onomy in order to extract key research topics. Addi-tionally, in order to get an indication of the coverage ofthe technical foundations, across the five venues underanalysis, we aggregated the number of occurrences foreach of the topics within a given foundation.

Analysis. The chart presented in Figure 3 providesdetails on the % coverage for each of the eighteen

7http://www.scholarlydata.org/#resources8http://www.scholarlydata.org/ontology/doc/

Page 12: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

12 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Fig. 3. PoolParty: % coverage per foundational technology across the 5 venues for the 10-year timeframe

Fig. 4. PoolParty: Growth/Decline of foundational technologies across the 5 venues for the 10 year timeframe

Page 13: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 13

(a) top 10 multi-word topics (b) top 11-20 multi-word topics

Fig. 5. PoolParty: Growth/Decline of the (a) top 10 and (b) top 11-20 multi-word topics across the 5 venues for the 10 year timeframe.

foundations, across the five venues for the 10-yeartimeframe under examination. As expected, knowl-edge representation and data creation/publishing/shar-ing is the top foundation, with almost 23% of thetotal occurrences in all documents. Note that thisfoundation includes several topics that are fundamen-tal to the Semantic Web community (i.e., the abilityto represent semantic data and to publish and sharesuch data). Next in order of importance, the man-agement of such knowledge (data management) andthe construction of feasible systems (system engineer-ing), constitute almost 16% and 11% of the occur-rences, respectively. Important functional areas such assearching/browsing/exploration, data integration andontology/thesaurus/taxonomy management also figurestrongly in comparison to the other foundations (allof them with more than 7.5% occurrences). In con-trast, very specific topics, such as logic and reasoningand concept tagging and annotation represent a modest4.4% and 2.6% respectively, and cross-topics, such ashuman computer interaction, machine learning, com-putational linguistics and NLP, security and privacy,recommendations and analytics are only marginallyrepresented. It is also worth noting that topics that re-late to quality, dynamic data and scalability are alsounder-represented (at around 2%).

In order to gain some insights into the researchtrends within the Semantic Web community over thelast decade, Figure 4 depicts the growth/decline ofeach of the foundations over the 10-year timeframe.Although the general trend for all topics shows yearon year increases, it is worth mentioning that robust-ness, scalability, optimization and performance, dy-namic data/streaming, searching/browsing/exploration

and machine learning have increased by more than200% since 2005. While in contrast, security and pri-vacy and ontology/thesaurus/taxonomy managementhave had marginal growth of only 30% for the sameperiod.

While, Figure 5 focuses on the growth/decline of thetop 20 multi-word topics. Interestingly, results show asharp increase of Linked Data at the expense of Se-mantic Web. Note also that natural language is in thetop-10 multi-word topics, even though this is a crosstopic which may be more represented in a differentcommunity. Finally, the decrease in the occurrence ofweb services can also be seen here.

6. Rexplore Quantitative Analysis

Rexplore [30] is a system that implements novel so-lutions in large-scale data mining, semantic technolo-gies and visual analytics, to provide an innovative en-vironment for exploring and making sense of scholarlydata.

Taxonomy Creation. Rexplore uses the Klink-2 algorithm [29] for automatically generating andregularly updating the Computer Science Ontology(CSO)9, a very large ontology of research areas. Klink-2 processes networks of research entities, such as pub-lications, authors, venues, and technologies, filters outterms that are not research topics, and automaticallyinfers semantic relationships between topics. Somebranches of the ontology, regarding in particular Se-mantic Web and Software Engineering, were also re-

9http://cso.kmi.open.ac.uk

Page 14: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

14 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Fig. 6. Rexplore: Frequent topics (excluding subtopic) in MV (blue) and FSW (red).

fined by domain experts as a result of previous eval-uations [29,32]. The CSO model10 is an extension ofthe BIBO ontology11 which in turn builds on SKOS12.It includes three semantic relations: relatedEquiva-lent, which indicates that two topics can be treatedas equivalent for the purpose of exploring researchdata (e.g., Ontology Matching, Ontology Mapping),skos:broaderGeneric, which indicates that a topic isa sub-area of another one (e.g., Linked Data, Seman-tic Web), and contributesTo, which indicates that theresearch outputs of one topic contributes to another(e.g., Ontology, Semantic Web). The last version ofthe CSO13 was generated from 16 million publicationsin the Rexplore dataset and includes about 17k topicslinked by 70k semantic relationships. The main root isComputer Science, however the ontology also includes

10http://technologies.kmi.open.ac.uk/rexplore/ontologies/BiboExtension.owl11http://purl.org/ontology/bibo/12https://www.w3.org/2004/02/skos/13http://cso.kmi.open.ac.uk/download/CSO.nt

a few secondary roots, such as Linguistics, Geometry,Semantics, and so on.

CSO presents two main advantages over manu-ally crafted categorizations used in Computer Science(e.g., 2012 ACM Classification, Microsoft AcademicSearch Classification). First, it can characterize higher-level research areas by means of hundreds of sub-topics and related terms, which allows Rexplore tomap very specific terms to higher-level research areas.Secondly, it can be easily updated by running Klink-2 on a set of new publications. A comprehensive dis-cussion of the advantages of adopting an automati-cally generated ontology in the scholarly domain canbe found in [29].

Since we intend to investigate the trend of SemanticWeb topics both within high tier domain conferencesand the literature, we generated two different datasets:one focusing on the main Semantic Web venues (MV)and the second on all literature containing reference tothe Semantic Web (Full Semantic Web, FSW). MV in-cludes 4,734 publications and it was generated by re-trieving the set of articles associated with several core

Page 15: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 15

Fig. 7. Rexplore: Number of publications associated with eight Semantic Web subtopics in MV.

Fig. 8. Rexplore: Number of publications associated with eight Semantic Web subtopics in FSW.

Semantic Web venues: ESWC, ISWC, SEMANTICS,Semantic Web Journal (SWJ), and Journal of WebSemantics (JWS) from the Scopus dataset. FSW in-cludes 32,431 publications and it was produced by se-lecting all articles associated with the topic SemanticWeb or with its 96 associated subtopics in CSO (e.g.,Linked Data, RDF, Semantic Web Services) in bothMV and in the dump containing all Scopus papers inComputer Science for the interval 2000-2015. We la-bel this dataset the Full Semantic Web (FSW). We an-alyzed both datasets in the 2006-2015 period. In MV

the number of publications grows steadily, from lessthan 50 publications a year to more than 650. While,the number of publications in FSW decreases in thelast four years of the period under analysis, because theScopus Dump adopted in this analysis was completeonly until 2013.

Corpus Annotation. The analytics presented in thissection were produced by associating to each topicin CSO (e.g., Human Computer Interaction) all pub-lications that contain in the title, in the abstract, orin the keyword field either 1) the topic label, 2) the

Page 16: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

16 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Fig. 9. Rexplore: Frequent Semantic Web subtopics in MV (blue) and FSW (red).

alternative labels (e.g., HCI, Human-Computer Inter-action), and 3) its subtopics (e.g., Graphical User In-terfaces, Tangible User Interfaces, User Centered De-sign). For example, we will associate with the topicHuman Computer Interaction all publications taggedwith one of the 293 terms associated with its labels andits subtopics in CSO. This same strategy was imple-mented in Springer Nature semantic pipeline to anno-tate proceeding books [31] and was previously evalu-ated on the software engineering domain, proving thatits performance in classifying papers was not statisti-cally significantly different from the ones of six seniorresearchers [32]. We then counted the number of pa-pers associated with each topics in each year and pro-duced relevant analytics.

Analysis. In this section, we will first analyse themain topics appearing in each dataset and then zoomon Semantic Web subtopics.

Figure 6 shows the main research fields addressedby the Semantic Web papers in both MV and FSW,ranked by the percentage of their publications in thefield of Semantic Web. We excluded from this viewany super and sub areas of Semantic Web that will bediscussed later in detail. Unsurprisingly, the topic On-tology appears in about 61.2% of the papers (55.3%for FSW), followed by Artificial Intelligence (35.1%,27.2%), Information Retrieval (32.7%, 25.2%), QueryLanguages (26.5%, 17.1%) and Knowledge Base Sys-tem (17.5%, 12.7%). Interestingly, these five core re-search areas appear more often in the main venues(+7.1% in average), but they are also very importantareas for the FSW dataset. Other research areas appearmore prominently in one of the datasets. The QueryLanguage area is much more frequent in the MV, prob-ably due to the fact that the main venues tradition-ally are focused on Semantic Web query languages,

such as SPARQL. Formal Logic has a similar behavior(10.9%, 6.6%), suggesting a stronger focus of the mainvenues on this topic. Conversely, other research fieldsappear more often in the FSW dataset. This is the caseof Natural Language Processing (17.5%, 18.9%), Hu-man Computer Interaction (9.8%, 15.5%), Web Ser-vices (5.6%, 13.4%), Electronic Commerce (3.2%,4.3%) and Ubiquitous Computing (3.1%, 5.3%). Thisseems to suggest that there is a good amount of re-search in the intersection of these topics and SemanticWeb that is not fully represented in the main venues.

The Semantic Web field subsumes several heteroge-neous research areas dealing with different aspects ofits vision. Figure 9 shows the popularity of the mainSemantic Web direct subtopics in the two datasets. Weinclude in this view also the area of Ontology Engi-neering, which is not formally a sub-topic of Seman-tic Web, since a very large portion of its outcomesare published in the main Semantic Web venues. It isagain interesting to consider the difference between thedatasets. The topics Linked Data (23.4, 8.8%), Ontol-ogy Matching (9.5%, 2.1%), OWL (9.4%,6.7%), andSPARQL (3.5%,1.9%) are more frequent in the mainvenues. Conversely publications addressing Seman-tic Search (4.0%, 10.6%) and Semantic Web Services(1.7%, 3.9%) are more popular outside these venues.

Figure 7 and Figure 8 show the popularity of themain sub-topics over the years. The two main dynam-ics, evident in both datasets, are the fading of Seman-tic Web Services and the rapid grow of Linked Dataand to a lesser extent of SPARQL. Indeed, SemanticWeb Services is one of the main areas in 2004, andan integral part of the initial Semantic Web vision [2].However, the number of papers about these topics con-sistently decreases and from 2013 there are almost nopublications about them in the MV and very few in

Page 17: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 17

FSW. The second trend is the steady growth of LinkedData from 2007. In 2015 about half of Semantic Webpapers in the main venues refer to this topics. Inter-estingly, both trends are first anticipated by the mainvenues, and only later evident also in the FSW dataset.It thus seems that the tendencies of the main venuesinfluence in time all the Semantic Web research.

7. Saffron Analysis

Saffron is a tool for extracting knowledge struc-ture from text, by gathering and summarizing exper-tise information. Although its principal aim is exper-tise topic extraction, its functions are multiple andits applications various. In terms of functions, it usesNLP and Linked Data techniques to construct domain-specific topical hierarchies, catering for: topic extrac-tion, domain-independent term extraction, topic link-ing, expert finding and linking. Possible applicationsinclude expert search in a research domain or an en-terprise environment [6], exploring the triggers of a fi-nancial crisis [7], forecasting emerging trends from ascientific area [1], and expert profiling.

Taxonomy Creation. Saffron follows a domain-independent method, which is one of its biggest ad-vantages compared to most systems in the area, in thatit does not require external domain-specific classifica-tions. Such information is often not readily availableespecially in niche domains, and creating a classifica-tion is very costly in terms of time, human expertiseneeds, and maintenance. Saffron bypasses this barrierby automatically building a domain model from the in-put corpus. The idea behind the domain model is tocapture the expertise knowledge of the corpus by iso-lating the most generic concepts (i.e., the basic levelcategories of the specific domain). The domain modelis represented as a vector of words, collected basedon linguistic features, distribution properties, and co-herence evaluation. In a second stage, a multi-wordterm extraction is used to collect intermediate levelkeyphrases (known as candidate terms). The domainmodel is subsequently used as a basis to calculate se-mantic relatedness between the candidate terms andthe domain of expertise. We then connect and organizethe resulting topics in an understandable and usablemanner by means of a hierarchical taxonomy. This isachieved by connecting related concepts and directingthe edges from broader to more specific relations usinga global generality measure. This eventually yields ataxonomy of topics, where the root is the most generic

term of the domain, and the leaves the most specificterms. The construction of topic hierarchies from cor-pora is discussed in detail in [5].

Corpus Annotation. In order to prepare the data tobe ingested by Saffron, all collected proceedings weresplit into single paper documents. The papers, all avail-able as pdfs, were subsequently converted to raw textand cleansed of any special or bad characters as well asbadly converted images and figures, which could resultin noise for the analysis or lead to a failure of the sys-tem. Furthermore, the reference section of every paperwas removed in order to reduce potential noise causedby the list of authors’ names, conferences, dates, etc...After term extraction and ranking, the resulting top-ics are connected together by directed edges to forma hierarchical taxonomy, represented as a graph. Wevisualize the results of the Saffron analysis with Cy-toscapes, an open source software tool for complexnetworks graph visualization14. It allows us to performa network analysis on the output provided by Saffron,and a customization of the layout. In our case, the sizeand the color of the nodes are proportional to the num-ber of neighbors each topic is connected to. Figure 10shows the general picture of the graph displaying theinterconnected topics from the results of the analysis.

Analysis. The size of the nodes in the graph is re-lated to the number of edges connected to them. Wecan see in Figure 10 topics that stand out from the anal-ysis. Not surprisingly, the first and predominant one(i.e., the root of the taxonomy) over the ten years pe-riod is the Semantic Web topic itself. Around it, weobserve several main clusters with major keyphrases.The main (i.e., the most connected) keywords areRDF Data and Linked Data, followed by Natural Lan-guage, Data Source and Reasoning Task. A strong fo-cus is also put on Machine Learning, Ontology Engi-neering, Query Execution, and the mark-up languageOWL-S. By concentrating on the clusters, we iden-tify the importance of data in terms of its representa-tion (RDF Data, RDF Graph, Linked Data), its acces-sibility (Open Data), and its query (Query Execution,Query Processing). Some other main interests in thedomain are visible, represented by a cluster made upof Natural Language and topics related to the queryof information such as Semantic and Keyword Search,Keyword Query, Semantic Similarity or InformationRetrieval. Natural Language is also connected to an-other dominating subject that is the one of Machine

14see http://www.cytoscape.org/

Page 18: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

18 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Fig. 10. Saffron: Taxonomy of Semantic Web topics.

(a) top 10 topics (b) top 11-20 topics

Fig. 11. Saffron: Topic evolution over time for (a) top 10 topics and (b) top 11-20 topics.

Learning, associated to Ontology Matching and Map-ping. One of the branches originating from the Seman-tic Web topic brings together concepts related to thestructure and representation of the ontology (Knowl-edge Base, Knowledge Representation, Ontology Lan-guage, OWL Ontology), while a sub-branch leads tologic and reasoning-related topics (Description Logic,Reasoning task, Reasoning Algorithm). Around the

Ontology Engineering node, User Interface seems tobe an important related matter, as well as tasks likeOntology Development and Ontology Editing.

Given that Saffron establishes and calculates rela-tions between the topics, it thus puts the focus on theterms that share the most connections with other do-main specific words. The most connected terms are

Page 19: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 19

considered the most influential ones as they are used inthe context of many other domain-related topics.

Following the processing of the corpus of ten yearsof Semantic Web publications, we selected the 20 firstmain topics extracted by Saffron (i.e., the most con-nected ones), with the aim of observing the distribu-tion of their use in the documents across the years. Thetwo charts in Figure 11 show the percentage of docu-ments mentioning the aforementioned topics (i.e., thenumber of documents where they appear at least once),per year.

The first thing that catches the eye when looking atthe figure is the fact that Semantic Web as a topic is onthe decline. When speculating as to what this means,on the surface clearly Semantic Web as a topic is men-tioned less in more recent papers. However the reasonscould be manifold, for example the term/field may beso established that the community has established con-fidence to "not have to name it" in their papers, or thatthe community is trying to re-brand their research interms of new terms such as Linked Data. Both wouldbe possible explanations, a more in depth understand-ing of which would certainly require more analysis,but the speculation about it alone might already bringabout a debate concerning our self-understanding tothe community. Emerging or new terms and topics,such as said "Linked Data" are easier to assess andjudge.

On the graph showing the first 10 topics (orderedby the number of connections) extracted from Saffron(Figure 11), the biggest and the most significant pro-gression is striking in the use of the term Linked Data.While it was completely missing in 2006, it experi-enced a very rapid growth in particular between 2008and 2010 where its rise was multiplied by 9, to even-tually reach 64% of the distribution in the documentsby 2015. Another dramatic increase among the top 20terms is the Open Data topic, remarkably developingfrom about 1% in 2006/2007 to 45% in 2015. This isparticularly meaningful, and demonstrates the emer-gence and widening use of Linked Data and Open Datathrough the years. We can notice from the graphs an-other emerging topic which is slowly but graduallygaining importance, namely Query Execution, appear-ing in 2015 in 15% of the publications. RDF Data andData Source are topics whose popularity have also in-creased quite significantly, multiplying their presenceby more than twice in the documents since 2006. Thisshows that the Semantic Web community has placed astronger focus on those concepts through time. Othertopics whose popularity increased with time by at least

twice their initial proportion include RDF Graph (withtwo peaks in 2008 and 2014), Machine Learning (witha peak in 2012) and Query Processing (with a smallpeak in 2009 then a quite steady line).

On the contrary, the very hot topic of Semantic Webitself is less cited in publications, with a decrease of20% between the beginning and the end of the studiedtime frame. Its high degree of genericity may accountfor why it is progressively less used in the publications:specific facets of the domain can be mentioned withoutpointing to the higher level topics. It is still the mostused keyphrase nonetheless, reaching 91% of distribu-tion in the documents in 2006 and lowering down to70% in 2015.

Among the topics experiencing strong variationsthrough time, Web Service is a declining one. After ex-periencing a peak of use in 2008 with a 40% distribu-tion in the documents, it then dropped to less than 20%in 2015. Other main terms decreased with time, al-though in a less pronounced way, like Semantic Searchwhich experiences two small peaks in 2008 and 2011,and slight drops in 2010 and 2012 to a more steadycurve thereafter. Some topics appear to be consistentover the years, such as Ontology Engineering, whilesome others are more volatile. The Natural Languagetopic, despite being equally cited in 2006 and in 2015,gradually dropped in the first half of the period ex-amined, to gain in popularity again after 2011. Key-word Search shows quite a varied pattern, with dropsin 2007, 2010 and 2012, and peaks in 2009 and 2011.As for Service Description, it increased slowly up to13% by 2009, but gradually declined towards its initialvalue by 2015.

8. Topic Alignment and Findings

In this section, we compare and contrast the core andmarginal topics mentioned in the seminal SemanticWeb papers (discussed in Section 4), with primary top-ics identified by the data-driven approaches (presentedin Sections 5-7). For this analysis, in order to cover awider spectrum of topics, we consider the top 40 multi-word topics from PoolParty, Rexplore and Saffron (seeTable 7 in Appendix A). Initially we conducted themapping exercise with the top 20 topics, however af-ter seeing that there were no mappings for several coretopics we elected to use 40 topics instead. The analy-sis described herein is summarized in Table 3, whichdepicts the core research topics identified in the sem-inal papers and their coverage by the data-driven ap-

Page 20: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

20 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

proaches, and Table 4, which presents the marginal re-search topics. Finally, the research topics covered bythe data-driven approaches that were not identified bythe seminal papers are presented in Table 5, while thevisionary research topics from the seminal papers andtheir coverage by the data-driven approaches are de-tailed in Table 6.

8.1. Core and marginal topic analysis

The results presented below are based on a compari-son between the core and marginal topics mentioned inthe seminal Semantic Web papers and the major topicsuncovered by PoolParty, Rexplore and Saffron.

PoolParty Analysis: It is not surprising that it waseasy to align the PoolParty foundational topics withthe core topics identified by the seminal papers (cf.Table 3), as said topics represent eighteen high levelfoundational topics from the Semantic Web commu-nity identified by analyzing existing taxonomies andwell known Semantic Web research sub-communities.One notable omission is distribution, decentralization,and federation, which is an emerging topic in the Se-mantic Web community. Comparing the PoolParty out-put to the marginal topics presented in Table 4, weagain observe good coverage with the exception of thetopics relating to multilingual intelligent agents andchange management and propagation. However, sev-eral topics that didn’t figure in the seminal papers wereuncovered by the PoolParty analysis, such as recom-mendations, use cases, case studies, open data, infor-mation systems, web data, semantic technology, andstructured data (cf. Table 5). Given that these topicsare very general in nature they could not be easilymapped to the primary topics appearing in the seminalpapers.

Rexplore Analysis: As shown in Table 3, Rexploreextracted topics that can be mapped to the majorityof core topics with the exception of two topics: se-mantic web databases and distribution, decentraliza-tion, federation. Like PoolParty, when it came to themarginal topics (cf. Table 4), multilingual intelligentagents and change management and propagation werenot among the core topics extracted by Rexplore, addi-tionally evidence of research on scalability, efficiency,robust semantic approaches was not present in the top40 topics. However, Rexplore also identified severalapplication or use case oriented keywords that werenot mentioned in the seminal papers (cf. Table 5),such as computational linguistics, recommender sys-

tems, mobile devices, cloud computing, e-learning sys-tem, robotics, electronic commerce systems, and deci-sion support systems.

Saffron Analysis: The mapping of the seminal papertopics to the results of the Saffron topic analysis af-firms that all core topics that appear in the seminalpapers appear in the Semantic Web corpus. Interest-ingly, Saffron is the only approach that uncovered evi-dence of distribution, decentralization and federation,however in contrast privacy, trust, security, and prove-nance did not figure in the top 40 topics uncovered bySaffron. Although Saffron has good coverage of thecore topics, it was less successful at identifying key-words that could be aligned to the marginal topics (cf.Table 4). Like PoolParty and Rexplore, topics relatingto multilingual intelligent agents and change manage-ment and propagation were not present in the top 40,however nor were scalability, efficiency, robust seman-tic approaches and semantic web services. Like Pool-Party, Saffron also identified general topics that couldnot be aligned with topics from the seminal papers,namely, open data, web data, web technology.

8.2. Evidence of Future Topics

Besides using the data-driven approaches to look forevidence of the topics that the community have beenactively working on, we also investigated if the data-driven approaches could also find evidence of futuretrends predicted in the seminal papers, in particularthose mentioned by Bernstein et al. [3]. According toour mapping presented in Table 6, evidence of each ofthe four main lines of future research topics was un-covered by at least one of the data-driven approaches.Interestingly, all approaches found topics which indi-cate that research relating to the Internet of Things,streaming and sensor data is becoming sufficientlywide-spread to indicate its rise in importance withinthe Semantic Web community. However, at the sametime, the other three topics that relate to scale, intel-ligent software agents and quality were only weaklyidentified by the seminal papers.

8.3. Evidence of Trends

In the following we summarize the analysis of thetrends identified by PoolParty (cf. Figure 4- 5), Rex-plore (cf. Figure 7- 8) and Saffron (cf. Figure 11). Thefoundational topic and trend analysis conducted viaPoolParty did not yield any useful results, as gener-

Page 21: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 21

Table 3Core research topics identified in the seminal papers and their coverage by the data-driven approaches.

Coverage Matched topicsCore topic PoolParty Rexplore Saffron PoolParty Rexplore Saffron

knowledgerepresentation languagesand standards

knowledge representation, knowledge based systems,knowledge representation,Resource DescriptionFramework (RDF), WebOntology Language (OWL)

rdf data, owl s, blank node,object property

Knowledge structuresand modeling

ontology/thesaurus/taxonomymanagement, websemantics, ontology engi-neering, ontology language,data models, ontologymatching

ontology, ontology engineer-ing

owl ontology, ontologyengineering, rdf graph,data model, ontology lan-guage, ontology editing,web semantics, ontologydevelopment, ontologymatching

logic and reasoning description logic, formallogic/ formal languages/de-scription logics, logicprogramming

formal logic, descriptionlogic, Web Ontology Lan-guage (OWL)

reasoning task, descriptionlogic

search, retrieval,ranking, questionanswering

search engines, semanticsearch, web search, naturallanguage, searching/ brows-ing/ exploration, computerlinguistics & NLP systems,information retrieval

information retrieval, se-mantic search/similarity,computer linguistics

keyword search, semanticsearch, natural language, in-formation retrieval

matching and dataintegration

ontology matching, ontologyalignment, similarity mea-sures, data integration

ontology matching, data inte-gration

ontology matching, seman-tic similarity

privacy, trust, security,provenance

- security & privacy security of data, data privacy -

semantic web databases - data sets, knowledge base,data source, knowledge man-agement, data management

knowledge base systems data source, relationaldatabase, knowledge base

distribution,decentralization,federation

- - - - federated query, federatedquery processing

query languages andmechanisms

query languages, query an-swering, query processing

query languages, SPARQL,SPARQL queries

query execution, keywordquery, query processing,query language

linked data linked data, linked open data,semantic web, web of data,data integration, data cre-ation/publishing/sharing

linked data, semantic web,linked open data, data inte-gration

linked data, semantic web

knowledge extraction,discovery andacquisition

information retrieval, ma-chine learning, extraction,data mining, text mining,entity, extraction, analytics,machine learning

information retrieval, natu-ral language processing, datamining, machine learning,natural language processingsystems

machine learning, informa-tion retrieval

ally speaking work on each of the foundational top-ics appear to be increasing year on year. A cross cor-relation of the trends highlighted by PoolParty, Rex-plore and Saffron provides evidence that topics such aslinked data, open data and data sources have an up-ward trend, whiles topics such as semantic web, webservice, service description and ontology matching ap-pear to be on a downward trend. When it comes totrend analysis using the data-driven approaches, it isclear that neither foundational topic analysis nor topicspecific analysis, provides us with enough evidenceto confirm the visions outlined in the seminal papers.For this there is a need for a more focused analysis

that maps visions to relevant research topics and usesyear on year aggregate counts to depict trends. Al-though, Fernandez Garcia et al. [17] made some ini-tial attempts at mapping the trends identified by Pool-Party to the visions from the seminal paper, unfortu-nately such a mapping is not very straightforward evenfor manual mappings and as such is left to future work.

8.4. Mixed Method Observations

The comparative analysis of the research topicsidentified with the qualitative and quantitative meth-ods, discussed in the previous sections, reveals several

Page 22: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

22 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Table 4Marginal research topics identified in the seminal papers and their coverage by the data-driven approaches.

Coverage Matched topicsMarginal topic PoolParty Rexplore Saffron PoolParty Rexplore Saffronmultilingual intelligentagents

- - - - - -

semantic web services web service, semantic webservice

web services, semantic webservices

web service, service descrip-tion

visualization, userinterfaces andannotation

user interfaces, semantic an-notation, human computerinteraction & visualization,annotation, concept tagging

human computer interaction,visualization

user interface

(scalability, efficiency,robust semanticapproaches)

- - robustness, scalability, opti-mization and performance

- -

change management andpropagation

- - - - - -

(social semantic web,FOAF)

social network social networks social medium

Table 5Research topics covered by the data-driven approaches that were not identified by the seminal papers.

PoolParty Rexplore Saffronrecommendations, use cases, case studies, opendata, information systems, web data, semantictechnology, structured data

computational linguistics, recommender systems,mobile devices, cloud computing, e-learning sys-tem, robotics, electronic commerce systems, deci-sion support systems

open data, web data, web technology

Table 6Visionary research topics from the seminal papers and their coverage by the data-driven approaches.

Coverage Matched topicsFuture topic PoolParty Rexplore Saffron PoolParty Rexplore Saffron

scale changes drastically - robustness, scalability, op-timization and performance

- -

intelligent software agents - - - artificial intelligence -

(Internet of Things), highvolume and velocity of data,e.g., streaming & sensor data

dynamic data / streaming Internet of Things stream processing

data quality, e.g,representation, assessment

- - quality - -

interesting observations on the benefits and drawbacksof these approaches, as discussed next.

Qualitative vs. Quantitative approaches. Compar-ing the quality of topic detection using data-drivenmethods with that of expert-driven methods (cf. Ta-ble 3), we observe that data-driven approaches hada high recall when it comes to detecting core top-ics identified by experts in the seminal papers. While,data-driven methods failed to cover multidisciplinarytopics, (i.e., topics that cross boundaries between ar-eas), such as distribution, decentralization, federation,or privacy, trust, security, provenance, or semantic webdatabases. These weakly covered topics are particu-larly interesting, as they indicate research areas that,although considered important by experts, have not yetattracted a critical mass of research to be reliably iden-

tified with quantitative methods. These topics, couldfor example be especially encouraged in calls for pa-pers of future conferences or via workshops or journalspecial issues.

Analyzing the coverage of marginal topics (cf. Ta-ble 4), we find an opposite phenomenon of researchtopics for which there is marginal agreement amongexperts, but strong data-driven evidence of work onthose topics. Indeed, data-driven approaches con-firm some of the marginal topics such as social se-mantic web and human computer interaction. Theseare topics on which a sufficient volume of work isperformed to allow identification by data-driven ap-proaches, but for which a core community has not yetbeen formed. These are obviously interesting topicsfor the community and work on them should be fos-

Page 23: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 23

tered with community building efforts such as organiz-ing dedicated workshops or benchmarking activities.

As expected, the coverage of visionary topics ( Ta-ble 6) was lower. Although these periphery topics aresomehow addressed by the Semantic Web community,the data-driven analysis failed to represent them withthe required fine-grained details. As per the other cate-gories, work on these topics should be encouraged viaworkshops, call for papers, and special issues. How-ever, further work on trend detection and analysis isneeded in order to better detect emerging topics and tounderstand the research gaps with respect to the vision.

A major benefit of data-driven methods is that theyare capable of providing evidence of the popularity ofresearch areas and topics over time and consequentlycan be used to derive research trends (although theseare somewhat sensitive to the available data and canbe less accurate when data is missing, for instance to-wards the end of the analysis period).

Comparison of Quantitative Methods. For thequantitative analysis of our work, we employed data-driven methods that differed, among others, in the waythe topic taxonomy was created. In the case of Pool-Party a manually built topic taxonomy was employedwhich closely reflected the topics on which the com-munity are looking for in call for papers or in con-ference programs. Rexplore made use of the CSO on-tology, a large-scale ontology of computer science ex-tracted from a very large corpora and key research ar-eas as well as associated research topics. Finally, Saf-fron extracted its taxonomy of topics entirely from thecorpus under analysis and used clustering to identifytopics that belong to a research area (without actuallyderiving research area names). Obviously, these ap-proaches of procuring the topic taxonomy are decreas-ing in terms of cost as per the time of expert involve-ment.

In terms of overall performance, (cf. Tables 3, 4, 6),PoolParty identified 17/21 core, marginal and futuretopics (10/11 core topics; 4/6 marginal topics; 3/4 fu-ture topics). Together with Saffron, PoolParty identi-fied the most core topics, while achieving the high-est recall for the other two topic categories too (i.e.,marginal and future topics). Closely after PoolParty,Rexplore identified 14 of the 21 topics of the ResearchLandscape (9/11 core topics; 3/6 marginal topics; 2/4future topics), identifying in each category just onetopic less than PoolParty. Finally, Saffron is overallvery close in its coverage to that of the other two toolsby identifying 13 out of 21 topics (10/11 core topics;

2/6 marginal topics; 1/4 future topics). While having avery good coverage of the core topics, Saffron’s per-formance was remarkably inferior to the other tools forthe other topic categories, where it primarily identifiedthose topics which were already identified by the othertools. From the above, we conclude that the use ofa-priory built taxonomies of research areas, whilemore expensive, it leads to a better coverage of re-search topics, especially in the analysis of marginalor emerging research topics. Moreover, we attributethe high success of PoolParty to covering research top-ics to the fact that it relied on a high-quality, manu-ally built topic taxonomy that was well aligned to thedomain.

While the most cost-effective, Saffron identified abag of topics that was less straightforward to align toresearch areas than the output of the other two ap-proaches that relied on taxonomies of research areas(and associated topics). The alignment and interpreta-tion of Saffron keywords required expert knowledgeand therefore Saffron should mostly be used in settingswhere such expert knowledge is available.

While PoolParty had the best performance in con-firming research topics from the qualitative analysis,Rexplore provided the most additional topics (cf. Ta-ble 5), clearly identifying research topics at the in-tersection of the Semantic Web and other researchcommunities (e.g., computational linguistics and cloudcomputing), thus providing invaluable support in po-sitioning the work of our community in a broader re-search context.

9. Conclusion

The analysis of research topics and trends is an im-portant aspect of scientometrics which is expandingfrom qualitative expert-driven approaches to also in-clude data-driven methods. The Semantic Web com-munity is no different, with several seminal papers re-flecting on and predicting the work of the commu-nity and data-driven methods (based on Semantic Webtechnologies) trying to achieve similar topic and trenddetection activities (semi-)automatically.

In this paper, we aimed to go beyond the variousviews on our community’s Research Landscape scat-tered in several papers and obtained with differentmethods. To that end, we proposed the use of a mixedmethods approach that can converge, unify but alsocritically compare conclusions reached with both ex-pert or data-driven approaches. The main conclusions

Page 24: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

24 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

of our work refer not only to the important topics forSemantic Web research but also refer to strengths andweaknesses of the various topic extraction and analysismethods.

An overarching Semantic Web Research Land-scape. A key benefit and novelty of our work is thatwe identified and aligned core research topics men-tioned in the seminal papers and then verified theseusing data-driven methods. After extracting, groupingand aligning the topics from the seminal papers, weconcluded on eleven core Semantic Web topics (cf. Ta-ble 3), out of which eight were confirmed by all thedata-driven approaches, while the remaining three in-dicate topics that are important but not sufficiently rep-resented in papers at the key Semantic Web venues.Besides these core topics, we capture six marginal top-ics (cf. Table 4) out of which two are very stronglysupported by evidence from data-driven methods.

From a trends perspective it was clearly visible thattopics such as linked data, open data and data sourceshave increased in importance over the year. While, atthe same time topics such as semantic web, web ser-vice, service description and ontology matching seemto appear less and less. Although we could speculateas to why this is the case (e.g., a push by the commu-nity towards using semantic technology to open up andlink data may have caused a decline work in relation toservice based machine-to-machine interaction), how-ever a more in depth analysis, involving sources otherthan over research papers, would be needed in order toconform our suspicions.

Looking into the future, we identify four future top-ics (cf. Table 6), from which the topics on IoT, sen-sor and streaming data has ample evidence in theanalyzed research corpus. Finally, the Rexplore data-driven method provided insights into the interactionsof our fields with other research areas, highlightingits cross-disciplinary nature. Considering the growinginterest in scientomentrics within the Semantic Webcommunity, our findings could be used as a base-line for benchmarking other topic and trend detectionmethods for the same time period, or extended to caterfor more recent work by the community.

Strengths and weaknesses of methods. Qualita-tive, expert-driven methods benefit from insights byexperts who reflect on past or present research topicsand trends and predict future directions. As such, theyremain valuable assets in the scientometrics tool-box.Data-driven methods challenge expert-analysis by pro-viding a surprisingly high recall, especially for core

research topics, and naturally less for marginal andemerging topics. However, a major benefit of data-driven methods is that their findings are backed-up byquantitative data which can be used to perform a rangeof other analytics such as research trend detection oridentifying connections between research topics.

A key element of the data-driven approaches consid-ered here is the use of a topic taxonomy which can bederived with costly, manual effort, semi-automaticallyor fully-automatically. Not-entirely surprisingly, well-curated taxonomies lead to the best performance, butthese naturally age very quickly and their mainte-nance is not sustainable. Therefore, semi-automatic orfully-automatic taxonomy construction methods offera cheaper and more sustainable alternative with only aslight loss of recall.

In this paper, we proposed and demonstrated theuse of a mixed methods approach, which combinesboth qualitative and quantitative methods in an attemptto overcome their respective weaknesses. This mixedmethods approach has several strengths. Firstly, it al-lowed us to synchronize the results of several qual-itative studies and propose a unified Research Land-scape of the area. Secondly, by comparing and con-trasting the Research Landscape with the results of thedata-driven methods, we could: (1) confirm those top-ics that are both seen as important by experts and forwhich quantitative evidence can be gathered - theseare clearly core topics in the community; (2) identifytopics that experts consider important but for whichdata-driven methods do not (unanimously) find suffi-cient evidence in the paper corpus - these are topicsthat the community should encourage; (3) identify top-ics on which not all experts agree (which is naturalgiven some bias inadvertently brought in by experts)but which are strongly represented in the research data- these topics could benefit from community build-ing efforts. To summarize, mixed methods allows fordrawing interesting conclusions in areas where quan-titative and qualitative methods agree or disagree. Aweak-point of the presented method is the use of man-ual extraction and alignment of topics which couldhave introduced bias. We tried to minimize this by per-forming each of these steps with multiple experts andthen reaching agreement where their opinions differed.

In this paper we have mainly focused on two ap-proaches to analyze and reflect about the past and tosome extent the future development of our researchcommunity, using expert opinions and applying ourown data-driven methods. However, we could go a

Page 25: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 25

third way, adopting either (as mentioned in the end ofSection 4.2) emerging methods such as crowdsourcingfor a similar reflectional exercise. That is, based on thefindings and topics presented here, let the communityitself on a larger scale than relying on the insights ofa few of its established experts, assess the importanceand future of topics for the community. Such an anal-ysis should probably counteract biases in terms of en-suring that researchers do not assess/favor the (future)importance of their own field of research, but we wouldexpect this to be an interesting future direction.

Additional avenues for further study include: a morefocused analysis that maps visions to relevant researchtopics and generates the corresponding trends; thedeepening of the work to better understand the typeof coverage offered in each of the identified researchtopics; and a broadening of the work to consider notonly the research topics but also the application areasand domains where these technologies are routinelyapplied.

Also, it would be interesting to test this method inother communities (e.g., Software Engineering) and tofurther improve the topic-alignment methods to furtherreduce bias.

References

[1] Kartik Asooja, Georgeta Bordea, Gabriela Vulcu, and PaulBuitelaar. Forecasting emerging trends from scientific lit-erature. In Nicoletta Calzolari (Conference Chair), KhalidChoukri, Thierry Declerck, Sara Goggi, Marko Grobelnik,Bente Maegaard, Joseph Mariani, Helene Mazo, AsuncionMoreno, Jan Odijk, and Stelios Piperidis, editors, Proceed-ings of the Tenth International Conference on Language Re-sources and Evaluation (LREC 2016), Paris, France, 2016. Eu-ropean Language Resources Association (ELRA). ISBN 978-2-9517408-9-1.

[2] Tim Berners-Lee, James Hendler, Ora Lassila, et al. The se-mantic web. Scientific American, 284(5):28–37, 2001.

[3] Abraham Bernstein, James Hendler, and Natalya Noy. A newlook at the semantic web. Communications of the ACM, 59(9):35–37, 2016.

[4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latentdirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

[5] Georgeta Bordea. Domain adaptive extraction of top-ical hierarchies for Expertise Mining. Thesis, Na-tional University of Ireland, Galway, Ireland, 2013. uri:http://hdl.handle.net/10379/4484.

[6] Georgeta Bordea, Sabrina Kirrane, Paul Buitelaar, andBianca O Pereira. Expertise mining for enterprise content man-agement. In Proceedings of the 8th International Conferenceon Language Resources and Evaluation (LREC 2012), pages3495–3498, 2012.

[7] Georgeta Bordea, Kartik Asooja, Paul Buitelaar, and LeonaO’Brien. Gaining insights into the global financial crisis usingsaffron. NLP Unshared Task in PoliInformatics, 2014.

[8] Georgetas Bordea and Paul Buitelaar. Expertise mining. InProceedings of the 21st National Conference on Artificial In-telligence and Cognitive Science, Galway, Ireland, 2010.

[9] Michel Callon, Jean-Pierre Courtial, William A Turner, andSerge Bauin. From translations to problematic networks: Anintroduction to co-word analysis. Information (InternationalSocial Science Council), 22(2):191–235, 1983.

[10] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. Emerg-ing topic detection on twitter based on temporal and socialterms evaluation. In Proceedings of the tenth internationalworkshop on multimedia data mining, page 4. ACM, 2010.

[11] David Chavalarias and Jean-Philippe Cointet. Phylomemeticpatterns in science evolutionâATthe rise and fall of scientificfields. PloS one, 8(2):e54847, 2013.

[12] Chaomei Chen. Citespace ii: Detecting and visualizing emerg-ing trends and transient patterns in scientific literature. Journalof the Association for Information Science and Technology, 57(3):359–377, 2006.

[13] Xiang-Ying Dai, Qing-Cai Chen, Xiao-Long Wang, and JunXu. Online topic detection and tracking of financial news basedon hierarchical clustering. In Machine Learning and Cyber-netics (ICMLC), 2010 International Conference on, volume 6,pages 3341–3346. IEEE, 2010.

[14] Sheron Levar Decker. Detection of bursty and emerging trendstowards identification of researchers at the early stage oftrends. PhD thesis, uga, 2007.

[15] Jörg Diederich, Wolf-Tilo Balke, and Uwe Thaden. Demon-strating the semantic growbag: automatically creating topicfacets for faceteddblp. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pages 505–505. ACM,2007.

[16] Lee Feigenbaum, Ivan Herman, Tonya Hongsermeier, EricNeumann, and Susie Stephens. The semantic web in action.Scientific American, 297(6):90–97, 2007.

[17] Javier David Fernandez Garcia, Elmar Kiesling, Sab-rina Kirrane, Julia Neuschmid, Nika Mizerski, AxelPolleres, Marta Sabou, Thomas Thurner, and PeterWetz. Propelling the potential of enterprise linkeddata in austria. roadmap and report., 2016. URLhttps://www.linked-data.at/wp-content/uploads/2016/12/propel_book_web.pdf.

[18] Volker Frehe, Vilius Rugaitis, and Frank Teuteberg. Sciento-metrics: How to perform a big data trend analysis with sci-enceminer. In GI-Jahrestagung, 2014.

[19] Birte Glimm and Heiner Stuckenschmidt. 15 years of semanticweb: An incomplete survey. KI-Künstliche Intelligenz, 30(2):117–130, 2016.

[20] Thomas Hofmann. Probabilistic latent semantic indexing. InProceedings of the 22nd annual international ACM SIGIR con-ference on Research and development in information retrieval,pages 50–57. ACM, 1999.

[21] William Hood and Concepción Wilson. The literature of bib-liometrics, scientometrics, and informetrics. Scientometrics,52(2):291–314, 2001.

[22] Yingjie Hu, Krzysztof Janowicz, Grant McKenzie, KunalSengupta, and Pascal Hitzler. A linked-data-driven andsemantically-enabled journal portal for scientometrics. InInternational Semantic Web Conference, pages 114–129.

Page 26: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

26 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Springer, 2013.[23] Yookyung Jo, Carl Lagoze, and C Lee Giles. Detecting re-

search topics via the correlation between graphs and texts. InProceedings of the 13th ACM SIGKDD international confer-ence on Knowledge discovery and data mining, pages 370–379. ACM, 2007.

[24] Jon Kleinberg. Bursty and hierarchical structure in streams.Data Mining and Knowledge Discovery, 7(4):373–397, 2003.

[25] Nancy L Leech and Anthony J Onwuegbuzie. A typology ofmixed methods research designs. Quality & quantity, 43(2):265–275, 2009.

[26] Fergal Monaghan, Georgeta Bordea, Krystian Samp, and PaulBuitelaar. Exploring your research: Sprinkling some saffron onsemantic web dog food. In Semantic Web Challenge at the In-ternational Semantic Web Conference, volume 117, pages 420–435, 2010.

[27] Satoshi Morinaga and Kenji Yamanishi. Tracking dynamics oftopic trends using a finite mixture model. In Proceedings of thetenth ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 811–816. ACM, 2004.

[28] Mizuki Oka, Hirotake Abe, and Kazuhiko Kato. Extractingtopics from weblogs through frequency segments. In Proc.of the Workshop on the Weblogging Ecosystem: Aggregation,Analysis and Dynamics, 2006.

[29] Francesco Osborne and Enrico Motta. Klink-2: integratingmultiple web sources to generate semantic topic networks.In International Semantic Web Conference, pages 408–424.Springer, 2015.

[30] Francesco Osborne, Enrico Motta, and Paul Mulholland. Ex-ploring scholarly data with rexplore. In International semantic

web conference, pages 460–477. Springer, 2013.[31] Francesco Osborne, Angelo Salatino, Aliaksandr Birukou, and

Enrico Motta. Automatic classification of springer nature pro-ceedings with smart topic miner. In International SemanticWeb Conference, pages 383–399. Springer, 2016.

[32] Francesco Osborne, Patricia Lago, Muccini Henry, and MottaEnrico. Reducing the effort for systematic reviews in softwareengineering. In preparation, 2018.

[33] Sergey Parinov and Mikhail Kogalovsky. Semantic linkages inresearch information systems as a new data source for sciento-metric studies. Scientometrics, 98(2):927–943, 2014.

[34] J Michael Schultz and Mark Liberman. Topic detection andtracking using idf-weighted cosine coefficient. In Proceedingsof the DARPA broadcast news workshop, pages 189–192. SanFrancisco: Morgan Kaufmann, 1999.

[35] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, andZhong Su. Arnetminer: extraction and mining of academic so-cial networks. In Proceedings of the 14th ACM SIGKDD inter-national conference on Knowledge discovery and data mining,pages 990–998. ACM, 2008.

Appendix

A. Additional results

Page 27: 1 IOS Press A decade of Semantic Web research …2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach Semantic Web technologies have

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 27

Table 7Extended topics: Top-40 multiwords in Poolparty and top-40 topics in Rexplore (MV) and Saffron

Poolparty Rexplore Saffron1 semantic web semantic web semantic web2 linked data ontology rdf data3 knowledge base artificial intelligence linked data4 web service information retrieval natural language5 web semantics query languages data source6 data source linked data reasoning task7 data sets knowledge based systems machine learning8 description logic natural language processing systems query execution9 on the web Computational Linguistics owl S

10 natural language formal logic ontology engineering11 use cases data mining rdf Graph12 social network knowledge representation User Interface13 query languages human computer interaction service description14 search engines ontology matching open data15 query answering web ontology language (OWL) semantic search16 user interfaces description logic query processing17 semantic annotation linked open data (LOD) keyword search18 information retrieval data integration keyword query19 web of data web services owl ontology20 open data resource description framework (RDF) web service21 data models security of data query language22 semantic search ontology engineering data model23 ontology matching semantic search/similarity ontology matching24 information systems social networks web data25 query processing SPARQL federated query26 machine learning data privacy stream processing27 ontology language recommender systems relational database28 semantic web service electronic commerce blank node29 linked open data sensors information retrieval30 logic programming ubiquitous computing ontology language31 knowledge management semantic information description logic32 data integration SPARQL queries federated query processing33 ontology engineering pattern recognition semantic similarity34 semantic technology data visualization object property35 ontology alignment knowledge acquisition ontology editing36 web search information technology social medium37 web data mobile devices knowledge base38 structured data wikipedia web technology39 case studies machine learning web semantics40 similarity measures DBpedia ontology development