an approach for semantic enrichment of social media ... · mantic enrichment module. its purpose is...
TRANSCRIPT
![Page 1: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/1.jpg)
Institut für Informatik der
Friedrich-Schiller-Universität Jena
An approachfor semantic enrichmentof social media resourcesfor context dependent processing
Diplomarbeitzur Erlangung des akademischen Grades
Diplom-Informatiker
vorgelegt von
Oliver Schimratzki
betreut von
Birgitta König-Ries
Fedor Bakalov
January 26, 2010
![Page 2: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/2.jpg)
Department of Computer Science at
Friedrich-Schiller-University Jena
An approachfor semantic enrichmentof social media resourcesfor context dependent processing
Diploma Thesissubmitted for the degree of
Diplom-Informatiker
submitted by
Oliver Schimratzki
supervised by
Birgitta König-Ries
Fedor Bakalov
January 26, 2010
![Page 3: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/3.jpg)
Abstract
This diploma thesis provides the functional basis for information filtering in the domain
of complexity. It helps to create the domain-specific, adaptive portal CompleXys, that
filters blog entries and similar social media resources according to their relevance to a
specific context.
The first of two required modules, that are developed throughout this work, is a se-
mantic enrichment module. Its purpose is to extract and provide semantic data for each
input document. This semantic data should be appropriate for a relevance decision to
the domain of complexity as well as for further usage in the filter module. It utilizes
various approaches to perform a multi-label text classification onto a fixed complexity
thesaurus.
The second implemented module is a content filter module. It provides a dynamic
system of filters, which forms an access interface to the document store. It uses the
previously extracted annotation and classification data to enable complex, semantically
based filter queries.
Though the total system performance will only be testable after the complete system
is implemented, this thesis also conducts a first proof-of-concept evaluation of the two
created modules. It investigates the classification quality of the semantic enrichment
module as well as the response time behavior of the content filter module.
![Page 4: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/4.jpg)
![Page 5: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/5.jpg)
Acknowledgements
This thesis is the result of my research and implementation work in a project of the
Heinz-Nixdorf Endowed Chair of Practical Computer Science at the Friedrich-Schiller
University of Jena. I have been really fortunate to get the possibility to finish my studies
within such a pleasant and interesting environment. For this chance I like to give special
thanks to my two supervisors Birgitta König-Ries and Fedor Bakalov. Without them I
would have never been able to create this work.
Furthermore I like to thank Adrian Knoth and again Fedor Bakalov, who contributed
a lot to the basic project architecture, on which upon I builded my work and who im-
plemented the basic functions I relied on. Adrian has also been so kind to provide a
database server for my work.
Additionally I am indebted to my thesis reviewers Fedor Bakalov, Birgitta König-Ries
and Gerald Albe. They all helped to improve the text with their various comments and
suggestions.
Yet another important source for this thesis were the developers of the tools I worked
with and the writers of the papers and books I cited. Among them special mentions
should be given to the makers of GATE and KEA++, who were the most important
external supporters of my work.
Of course this page has also a place, that is solely reserved for my parents, who set
my personal longtime record of twenty-six years nonstop support. I can hardly express
how grateful I am. Just...thank you!
Last but not least, I like to give a huge thank-you to my beloved fiancee Monika Heyer
for her steady patience, encouragement and love. You are great. =o)
![Page 6: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/6.jpg)
vi
![Page 7: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/7.jpg)
Contents
1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Scope Of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Complex Systems Portal . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Semantic Content Enrichment . . . . . . . . . . . . . . . . . . . . 5
1.4.3 Complexity Domain Model . . . . . . . . . . . . . . . . . . . . . . 5
1.4.4 Semantic Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 CompleXys 92.1 Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Performance Requirements . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Design Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Essentials 233.1 Notation of Semantic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Semantic Annotations . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.3 Microformats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Morphological Analysis . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
![Page 8: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/8.jpg)
viii Contents
3.2.5 Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.6 Pragmatic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.7 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Tools and Standards 31
4.1 SIOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Corpus Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 ANNIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 SKOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 KEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Candidate Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 KEA++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Calais . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 OpenCalais WebService . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Related Work 45
6 Semantic Content Annotator 51
6.1 CompleXys Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.2 CompleXys Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Semantic Content Annotator Pipeline . . . . . . . . . . . . . . . . . . . . 54
6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.2 Crawled Content Reader . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.3 Onto Gazetteer Annotator . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.4 Kea Annotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.5 Open Calais Annotator . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.6 Content Writer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Semantic Filter 61
7.1 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.1.1 Filter Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.1.2 Abstract Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.1.3 Logic Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.1.4 Basic Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
![Page 9: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/9.jpg)
Contents ix
7.2 Output Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2.1 RSS Converter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2.2 Sesame Triplestore Converter . . . . . . . . . . . . . . . . . . . . . 66
8 Evaluation 698.1 Classification Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.1.1 Document Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.1.2 Test Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.1.3 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.2 Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.1 Test Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9 Summary and Future Work 79
References 82
List of Figures 89
List of Tables 91
![Page 10: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/10.jpg)
x Contents
![Page 11: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/11.jpg)
CHAPTER 1
Introduction
This chapter introduces the subject of this thesis, describes the task and clarifies the
further working procedure. First it introduces and motivates the general topic in the
Sections 1.1 and 1.2. Then it clarifies the objectives in Section 1.3 and sets the thesis
scope in Section 1.4. Finally it outlines the further chapters in Section 1.5.
1.1 Background
The world wide web is by far the greatest data repository mankind created. But a
majority of the therein stored information is incomprehensible, when one lacks the se-
mantic context it is stored in. Most people are able to manually reconstruct this con-
text out of a text, but to search the web for complex information is often an incredibly
time-consuming and hard task, even in times of elaborate search engines. Enhanced
automatization capabilities would therefore be a great achievement in the evolution of
the www. Unfortunately, machines provide up to now a far poorer performance in text
understanding than people do. A possibility to overcome this problem is to change
the structure of the data itself and to explicitly provide the additional semantics, that
people normally implicitly add in their minds.
Tim Berners-Lee, who is credited with the invention of the world wide web, proposed
a corresponding concept already back in 1989 [4]. Back then he suggested not just mere
hyperlinks, but typed ones - "the web of relationships amongst named objects". This
ideas resulted in the first HTML version [60], that contains the type element as well as
rel and rev attributes. Type is used to define the kind of relationship the source doc-
ument has towards another resource. The rel attribute is applicable to other HTML
elements and can be used to describe the appropriate type of semantic relationship to-
wards a second resource. Rev is the reverse - an adequate, passive version to the active
![Page 12: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/12.jpg)
2 Introduction
rel attribute. Indeed, type became popular with defining structural references like the
related stylesheet document for a website and linking the alternate printing version
or RSS feed, but never get widely established as semantic informant, while rel and revremained mostly unused. The semantic HTML elements were misused for presenta-
tional purposes a long time, until the W3C CSS Level 1 Recommendation [38] in 1996
started the slowly progressing counterrevolution of strictly divided presentation and
content. This development finally leads to a rediscovery of semantic HTML in today’s
microformats movement, that will be further described in Subsection 3.1.3.
Figure 1.1: The Semantic Web layers1
However, these beginning problems did not hinder Tim Berners-Lee to pursue his vi-
sion further and to publish a Semantic Web Road Map in 1998 [59], which marks the
starting point to the W3C’s Semantic Web activities towards the machine-understandable
web. The Semantic Web layer diagram in Figure 1.1 shows the components, that should
finally achieve, what the first attempt had not. It is easily perceptible that the whole
approach is based on the traditional URIs2 and the Resource Description Framework
RDF3, which are used to reference and describe resources in a standardized way. Fur-
thermore the ontology component is of special interest for the topic of this thesis, be-
cause ontologies can be used to describe a domain in a machine-understandable way.
1Accessed on January 20, 2010: http://www.w3.org/2007/03/layerCake.png2http://tools.ietf.org/html/rfc39863http://www.w3.org/RDF/
![Page 13: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/13.jpg)
1.2 Motivation 3
While the development and establishment of the Semantic Web is still the focus of many
researchers and organizations, the Web 2.0, another fundamental change to the usage
of the internet, seemingly overtook their efforts throughout the preceding decade. It
is often characterized as the step from a read-only (Web 1.0) environment to a read-
write ecology [56]. With blogs, social networks and wikis today’s web is not mostly
a consumption media anymore, but an infrastructure to collaborate and publish for
everyone. Among others, this leads to such interesting usage approaches as crowd-
sourcing [28], that harnesses the collective knowledge and creativity of network users
to produce outcomes, that are competitive to those of task experts, but often notably
less expensive.
Another important development in web related systems is context awareness. Among
others web applications are nowadays often conscious of the habits and interests of the
accessing user and tailor the content and structure of the user interface individually for
his special needs. For example this approach is successfully applied by the Amazon4
and Sevenload5 recommendations to improve their respective portals.
1.2 Motivation
Aware of their characteristics, it is a reasonable step to combine Web2.0, Semantic Web
and context awareness in order to achieve an even more useful web environment.
While the Web2.0 emergence provides an enormous and steadily growing amount of
new information, Semantic Web and context awareness are powerful approaches to
efficiently utilize all this data for single users, without them getting lost in a state of
information overload. A contribution to such efforts can be made by exploring the
possibilities to use Semantic Web components for crowdsourcing and context aware
content systems. Accordingly this thesis investigates the question how semantic data
and Semantic Web technologies can be applied to the task of utilizing resources, that are
freely published in scientific blogs or news sites, as content for a fixed domain, adaptive
portal.
This could help to improve information portals in two ways. It assists in the process of
automatically picking potentially relevant social media resources out of a multitude of
distributed sources. Furthermore the automatic extraction and annotation of semantic
metadata can be used to estimate the resources’ usefulness for the users. By doing so
this information can be used to provide content recommendations and to dynamically
4http://www.amazon.com5http://sevenload.com
![Page 14: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/14.jpg)
4 Introduction
adapt the portal in order to display the optimal content for each individual.
1.3 Objectives
The goal of this thesis is to investigate the applicability of semantic data, that is auto-
matically extracted out of heterogeneous, social media resources, for various tasks in
the environment of a domain specific, adaptive information portal. The first task is the
binary decision, if a particular resource should be regarded as relevant for the field of a
certain domain and hence be further processed. The second task is the categorization of
the relevant resources into several main domain categories, which can, among others,
be used to organize the contents for intuitive browsing across several subpages. The
third task is the assignment of resources to a domain set of finer-grained topical terms,
that can be used to outline its subject.
The result of the third task can also be used to match user interest information to the
available set of resources to identify suitable content recommendations. However, to do
so the user interest had to be recorded in a way, that is comparable or equal to the same
topical term set. Assuming that this is the case, the second goal of this thesis is to ex-
plore the possibilities to efficiently pick out resources, that match to certain, potentially
complex conditions concerning their previously annotated semantic attributes.
The two described goals can not be successfully accomplished without an underlying
set of domain relevant terms. Thus it is necessarily a third goal of this thesis to provide
a sufficient domain model, in order to perform a proof-of-concept and enable a proper
evaluation of the previous goals.
1.4 Scope Of Thesis
This section clarifies the scope of this thesis and therewith provides a statement of what
should be achieved and what not. The first Subsection 1.4.1 sets the thesis’ work into
the context of the CompleXys project, whose component it is. The succeeding three
subsections describe the scopes of the implementation units, that emerge from the par-
ticular goals, that were identified in Section 1.3. Subsection 1.4.2 provides the scope
of the Semantic Content Enrichment module, Subsection 1.4.3 that of the CompleXys
Domain Model and Subsection 1.4.4 is dedicated to the scope of the Semantic Filter
module.
![Page 15: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/15.jpg)
1.4 Scope Of Thesis 5
1.4.1 Complex Systems Portal
This work is part of the CompleXys project, that intends to provide a domain specific,
adaptive portal for complexity. This portal should be able to provide complexity re-
lated social media resources chosen by context. To achieve this, it needs to collect the
resources from the internet, enrich them with semantic data, match them with a domain
model for classification, filter them based on the raised data as well as on the context
and finally display them to the user. This thesis contributes to the system by providing a
module for the semantic enrichment and classification of the collected social resources,
a complexity domain model as necessary basis for these tasks and a module for the
content filtering itself. The scope of the single elements will be detailed further in the
succeeding subsections.
1.4.2 Semantic Content Enrichment
The first module, that should be provided, aims to enrich social resources with seman-
tic content and to classify them therewith. To achieve this, there is first a need to find
and apply ways to analyze the resources and extract semantic data out of them. This
task involves complex subfields of natural language processing and received extensive
research over many decades, so it is reasonable to assume, that a single part of this the-
sis is unlikely to be sufficient for outperforming the existent solutions. Thus the focus is
set on identifying and utilizing fitting state-of-the-art tools for the special requirements
of this module. Furthermore the module needs to be able to use the semantic data for
several classification tasks and to persist the extracted data in a usefully accessible way.
1.4.3 Complexity Domain Model
To provide a sufficient domain model, a set of complexity specific terms had to be col-
lected and usefully structured. The model will be used as a basis for the classification
and filter processes so it had to be extensive and specific enough to successfully identify
many texts out of the broad, interdisciplinary area of complexity. However, the creation
of a comprehensive model is a very time-consuming task and beyond the scope of this
thesis. So a good prototype will be enough, as long as the access interface onto it is
flexible and abstract enough to improve the model subsequently without problems.
Accordingly a suitable data structure for the representation of the model had to be
found.
![Page 16: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/16.jpg)
6 Introduction
1.4.4 Semantic Filter
The second module should provide a method for content filtering, so that social re-
sources can be displayed according to a set of predefined filter criteria. The filter should
provide a possibility to express complex queries regarding the semantic attributes of
the resources and to efficiently access the subsets of resources, that match these queries.
The intention of this thesis is thereby not to provide an exhaustive set of imaginable
filters, but a flexible, free extendable system as well as a basic set of useful filters for
example and presentation purposes.
1.5 Thesis Outline
This introductory chapter gave an insight to the general topic as well as the particular
research problem. It clarified the motivation, the objectives and the thesis scope.
Chapter 2 supplies a presentation of the CompleXys project and information about gen-
eral considerations, needs and design decisions. Section 2.1 identifies and formulates
the requirements of the system. Section 2.2 introduces the CompleXys architecture and
its several working steps.
Chapter 3 provides the background knowledge for the remainder of this work. There-
fore, Section 3.1 treats options to notate semantic data. Section 3.2 gives an overview
over natural language processing as a fertile research field for semantic data extraction.
Chapter 4 presents tools and standards, that became apparent to be useful within the
practical part of this work. Section 4.1 introduces SIOC as a standard for the metadata
of social media resources. Section 4.2 describes GATE, which is basically an architec-
ture and framework for language processing systems. Section 4.3 deals with the taxon-
omy standard SKOS. Section 4.4 pays attention to the elaborate keyphrase exctraction
package KEA. Finally Section 4.5 gives an insight into the OpenCalais toolkit and web
service, that is capable to enrich content with semantic data upon request.
Chapter 5 provides an overview to previous research in the field of information filter-
ing. Various approaches to the task are presented and set into a context to this thesis.
Chapter 6 discusses the Semantic Content Annotator module. Section 6.1 is devoted
to the CompleXys domain model and Section 6.2 explains the concept of the Semantic
Content Annotator pipeline and gives a detailed description of its implementations.
Chapter 7 provides an insight into the semantic filter module. Section 7.1 treats the
AbstractFilter concept and its implementations. Section 7.2 presents the output variants
![Page 17: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/17.jpg)
1.5 Thesis Outline 7
for the filtered data, that have been implemented for presentation purposes.
Chapter 8 evaluates the two introduced modules. Section 8.1 examines the classifica-
tion quality of the Semantic Content Annotator module. Section 8.2 tests the runtime
performance of the Semantic Filter module. The evaluation results are concluded in
Section 8.3.
Finally Chapter 9 provides a summary of the thesis and considers possible future work,
that is based on the obtained results.
![Page 18: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/18.jpg)
8 Introduction
![Page 19: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/19.jpg)
CHAPTER 2
CompleXys
The CompleXys project is the environment, in which this thesis’ work is embedded.
This chapter introduces the system. Therefore, a requirement analysis is performed in
Section 2.1 and a general architectural overview is given in Section 2.2.
2.1 Requirement Analysis
This section is dedicated to the requirements of CompleXys. For obtaining these, it is
helpful to first identify the relevant actors and respective use cases and only thereof
deduce the actual requirements. Accordingly Subsection 2.1.1 introduces the identified
actors. Subsection 2.1.2 is dedicated to the use cases, linking the actors to the system.
Subsection 2.1.3 specifies the performance requirements of the system and Subsection
2.1.4 the design constraints. This section is loosely oriented on standard requirement
analyses, but is due to the much smaller scope of this chapter of course rather shortened
and abstracted in comparison to a full-sized software requirement analysis.
2.1.1 Actors
The actors are parties outside the system that interact with the system [10]. These par-
ties can be users or other systems and they can be divided into consuming entities, that
use the functionality of the system and assisting entities, that help the system to achieve
its purpose. Five different actors could be identified for CompleXys:
• Information Consumers
• Information Providers
• Assisting Systems
![Page 20: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/20.jpg)
10 CompleXys
• Administrators
• Developers
Information Consumers are the central clients of the systems. They are the ones, getting
value out of the system, whose main purpose indeed is to be a mediating software
layer between large resource sets and these same information consumers. They are
characterized by an interest focus on complexity related topics. Furthermore every
information consumer is supposed to have personal preferences and special interest
fields within this domain, so it is sensible to treat him as an individual. His interests
may change over time. It is expected, that information consumers are average world
wide web users and have at least basic web browsing skills. The usage frequency can
vary from one-time uses to many times a day, depending on the individual information
needs, time and access possibility. The number of information consumers could in
short-term vary from very few to hundreds and may long-term become notably higher.
Information Providers are the second most important actor class, because they provide
the resources, that will be displayed to the information consumers. Potentially every-
one in the internet could be an information provider as long as he publishes contents
and allows agents to crawl his site. They do not necessarily care or even know, that
their resources are processed within CompleXys. Thus there is no implicit control over
topic, quality, publishing frequency, size, form, language, media type and subsequent
modification of resources. Likely examples of information providers for CompleXys
are scientific bloggers or researchers, that publish their papers freely in the internet.
Assisting systems are all those external systems, that are utilized within CompleXys.
They may serve various purposes, that are beyond the scope of CompleXys. Up to now
this is only the OpenCalais web service, because the majority of the reused software is
applied internally, but this might change in the future. However, being dependent on
external systems, comes always with the risk of externally caused outages or errors, so
new systems must be integrated with care.
Administrators are the entities, that assist and support the running system. They are
responsible for generally maintaining the system. Furthermore they can manually add
information sources and resources to the resource set. The expertise level of the ad-
ministrators is naturally very high inside their respective working domain, but due to
specialization reasons it can not be assumed, that every administrator is able to main-
tain every part of the system. They should be promptly available whenever problems
with the system occur. Their number depends on the size of the system and the spe-
cialization level of the administrators. Special data administrators may manage the
![Page 21: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/21.jpg)
2.1 Requirement Analysis 11
resources, that should be harvested and further administrators may be entrusted with
database, system and network management.
Developers are the only actor class, that is not occupied with the running system, but
with the code itself. They are responsible to evolve and extend CompleXys beyond
its initial version. Possible goals for these actors can be the elimination of flaws, new
functions or performance enhancement. They need to be skilled programmers.
2.1.2 Use Cases
The use cases are descriptions of how an actor uses a system to achieve a goal and
what the system does for the actor to achieve that goal. It tells the story of how the
system and its actors collaborate to deliver something of value for at least one of the
actors [6]. Use cases are strongly related to the functional requirements of a software
requirements analysis and are quite convenient to identify the external interfaces by
regarding their relationship to the actors . Due to the coarse grained perspective of this
thesis onto the requirements and the fact, that they tend to do be much more detailed
than the corresponding use cases, this subsection will act as an abstract surrogate for
the functional requirements and external interface requirements subsections.
The following eleven use cases could be identified:
• get information recommendations
• search
• modify user interest manually
• get digest
• gather resources
• use assisting service
• manage source or resource list manually
• maintain system
• add feature
• identify users
• record user interest
Figure 2.1 visualizes how the particular use cases relate to the roles of the previously
identified actors towards the system.
![Page 22: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/22.jpg)
12 CompleXys
Figure 2.1: The relationships between actors and use cases
"Get Information Recommendations" is one of the most essential features of the sys-
tem and a typical use case of the information consumer. More precisely the use case
involves the selection and dynamic display of resources depending on their estimated
level of interest to a particular information consumer. In order to perform successfully,
the use case assumes, that several conditions are fulfilled. First it is dependent on the
use case "Identify Users", because a user had to be recognized, before a system can make
useful personal assumptions about him. Furthermore it is dependent of "Record User
Interest", because the system needs a possibility to store user interest representations,
for afterwards matching them to the available resources. And finally it is dependent
on the use case "Provide Information", because the displayed resources must obviously
be obtained in the first place. Additionally the resources have to actually match the
user interest. Avoiding front-end error messages, the system had to behave sensible,
whenever these assumptions are not given. There should be a possibility to display
resources in an unpersonalized way, if a user can not be identified, if no user inter-
est data has been raised yet or if there are simply no matching resources available for
the particular user. The use case involves the sequential steps "Load Stored User In-
terest", "Load Resources", "Match Resources to User Interest" and "Display Matching
Resources". Important demands are an acceptable response time, high recall and high
precision. These do reappear in the Subsection 2.1.3 and are discussed closer at this
point of the text. Furthermore the use case must be intuitively accessible for users with
the assumed average internet expertise level of the information consumer actors. This
involves the need of dynamically reflecting the probability of a resource to be inter-
![Page 23: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/23.jpg)
2.1 Requirement Analysis 13
esting in display attributes like size, position and highlighting. To efficiently achieve
this the implicit relevance rating done in the step "Match Resources To User Interest"
should be expressed in relative probability values instead of binary decisions or an in-
teger sorted order. Because of its core importance to the system and its meaning for
information consumers as the central actor class "Get Information Recommendations"
can be rated with highest priority1.
"Search" is another important use case, that is related to information consumer. While
"Get Information Recommendations" is characterized by a passive information con-
sumption of the user, "Search" is the active querying of needed data. It is basically in-
dependent of other use cases than "Gather Resources", but may be used as information
source by the "Record User Interest" use case, when the user is additionally identified by
the "Identify User" use case. It involves the sequential steps "User Send Search Query",
"Search For Matching Resources" and "Display Search Results". Important demands are
good response times, that do not seriously interrupt the users’ browsing flow and an
intuitively understandable and controllable user interface. The use case is rated with
high priority, because albeit it is not actually a core feature of CompleXys, the internet
user is highly accustomed to this function and is likely to insist on needing it.
"Manually Modify User Interest" is another use case related to the information con-
sumer. Its goal is to visualize the recorded interest model to the described user and
let him alter it as he wishes. This assists to improve system transparency and possibly
also the system’s value to the user, because it is capable to establish a very up-to-date
and correct user interest model. This helps to smooth away three common flaws of in-
formation filtering systems. Firstly it helps to rapidly adapt the system to new interest
emphases of the user, secondly it helps to remove expired interests instantly and thirdly
it provides the possibility to correct erroneously added interest entries. Traditional sys-
tems may require quite a long time to autonomically adapt to the cases one and two,
because they usually require a certain amount of related behavior data. The third case
is worse, because the system may repeatedly draw the wrong conclusion and the un-
wanted topic may not even lose importance over time, when autonomic adaption is the
only option to change user models. This use case is dependent on the "Identify Users"
use case, because the actual user must be recognized by the system in order to find and
visualize his user model for him and to persistently store changes for future usage. Fur-
thermore, the use case is dependent on the assumption, that users benefit from a more
accurate interest profile. This is true as long as use cases like "Get Information Recom-
mendation" apply the profiles to produce value for the user. The use case involves the
1Priority levels: Highest = 5/5, High = 4/5, Middle = 3/5, Low = 2/5, Lowest = 1/5
![Page 24: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/24.jpg)
14 CompleXys
sequential steps "Load User Interest Data", "Display User Interest Data" and optionally
"User Alters User Interest Data" as well as "Store Altered User Interest Data". Important
demands are good response times, that do not seriously interrupt the users’ browsing
flow and an intuitively understandable and controllable user interface, that orientates
on the assumed average internet expertise level of information consumers. The use
cases’ priority is middle, because it provides a useful, but not an essential feature for
the system.
The use case "Get Digest" is related to the information consumer too. It provides the
users with the possibility to subscribe to a digest, that regularly delivers email sum-
maries of recent resources, that may be of personal interest to them. Beside the obvi-
ous "Gather Resources" the use case is also dependent on the "Identify User" and the
"Record User Interest" use cases, because in order to make assumptions about the rel-
evancy of resources to him the user needs to be assigned to his interest records. The
use case involves the sequential steps "User Subscribes To Digest" and regularly "Load
Stored User Interest", "Load Recent Resources", "Match Recent Resources to User In-
terest", "Summarize Matching Resources" and "Send Summary To User". Furthermore
there had to be an optional step "User Unsubscribes" to end the digest subscription.
The use cases’ priority is low, because it is a nice feature, but it is not essential.
"Gather Resources" is a use case, that is related to the information provider actors. Its
purpose is to collect resources, that may be of interest to the information consumers and
can be displayed later on. It assumes, that it has access to a set of source addresses and
that it is able to access the corresponding sources. Furthermore it assumes, that these
sources frequently provide new resources, that are potentially relevant to the domain
of complexity and interesting for the users of CompleXys. It involves the sequential
steps "Get Source Address", "Load Source", "Crawl Source For New Resources" and in
case of successfully finding one or more new resources "Store New Resources". The
size and number of processed and stored resources may range from few and little ones
to millions with possible sizes up to whole books or long-term discussion archives.
Complexity related scientific resources are usually not very time-critical, so a gathering
frequency of one to three times a day should be sufficient. Furthermore the system
must give attention to the needs of the information providers. This involves respect
of privacy and thereby to the crawler access policy described in the sources’ robots.txt
file [51]. It also includes respecting copyrights and do only crawl and display resources
from those sources, which allow to do so. This use cases’ priority is highest, because an
information filtering system without information would be worthless.
"Use Assisting Service" is a use case, that is related to the actor class of assisting systems.
![Page 25: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/25.jpg)
2.1 Requirement Analysis 15
Its purpose is to outsource tasks, that are beyond the scope of CompleXys, to external
services, that are capable to master them. It assumes, that the external service works
correct, reliable and sufficiently fast. Furthermore it assumes, that the connection to the
external service is stable and also fast enough. A useful service utilization is depen-
dent of the system’s capability to provide the data the respective service needs and to
understand the service response. A successful use case is composed of the sequential
steps "Send Request To Service", "Service Receives Call", "Service Processes Request",
"Service Sends Response" and "Receive Response". The priority depends on the impor-
tance of the respective problem, that is solved by the service, and can therefore not be
generally specified.
"Manually Manage Source Or Resource List" is a use case, which is managed by the
actor class of the administrators. The sources of the initial source list and the resources,
which are autonomically found on them, provide by nature a very limited set of in-
formation, that may moreover become outdated very quickly inside its fast moving
internet environment. This problem can be treated by instructing one or more adminis-
trators to manually add, modify and remove sources or solitary resources and thereby
keep the resource list up-to-date. In order to achieve a successful use case termination,
the administrator needs to have access rights and a possibility to modify the source
and resource list. Additionally administrators are assumed to be capable of indepen-
dently finding new relevant sources and resources, identifying outdated or erroneously
added resources and reacting in a suitable way to these cases. The use case involves the
sequential steps "User Modifies Source Or Resource List", "Store Modified Source Or
Resource List" and optionally for modifying and deleting tasks "Display Source or Re-
source List". These steps can all either refer to the source list or the resource list within
a single step sequence and are just summarized for clarity reasons. The use case is gen-
erally not time-critic, but involves human-machine interaction and should therefore
provide acceptable response times. Not every data administrator can be assumed to be
a computer expert, so the interface should also be intuitively accessible.The priority is
rated high, because inaccurate and outdated resources could strongly disturb the user
experience, but then again it is not a core feature of the system
"Maintain System" is a very general use case, that involves all works, that help to keep
the running system in a useful state or to bring it back to one, which involves cor-
rect functionality, accessibility and sufficient performance. The use case is mainly per-
formed by administrators, handling the network, software and hardware environment,
in which CompleXys is embedded, but it also involves developers, who can optimize
the code for better performance or eliminate bugs. Of course the administrators and
![Page 26: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/26.jpg)
16 CompleXys
developers need high expertise in and access to their respective responsibility layer
in order to accomplish their maintenance duties. To grant the needed access rights,
involved actors need to be identified first, so this use case is dependent of "Identify
Users". Due to the versatile characteristics of the various maintenance tasks involved
actors are often quite heterogeneously skilled. Thus the system needs to be partitioned
carefully in order to clearly divide between the different layers and to allow a main-
tainer to do his job without detailed knowledge about the job of another maintainer.
The priority is the highest, because an unmaintained system is not likely to be usable
at all.
The initial system is probably not the final version of CompleXys. Software systems are
usually subject to an evolutionary progress and should therefore be prepared to evolve.
Accordingly the "Add Feature" use case represents the developer task of adding a new
feature to CompleXys. This assumes that the developer possesses high programming
expertise, which is clearly reasonable. Furthermore it demands an easily expandable
system structure, which is characterized by low coupling, high cohesion and standard
compliance. The use cases’ priority is rated middle, because it is not essential for the
running system, but steadily helps to improve it.
The use case "Identify User" aims to recognize a certain user and relate him to his role,
stored profile data and access rights. It requires the user or a third party to make an in-
put, that identifies him. The identification procedure should happen very rapidly and
require no expertise above average internet skills. If a user is needed to enter identifi-
cation data it should be in a form, that he can remember it by heart, so he do not need
to write it down. The use case is likely to divide in an initial step "User Registrates"
and repeating "User logs in" steps at the beginning of each new session. A "User logs
out" step at the end of each session is optionally and can alternatively be automatically
accomplished by session expiration. On the other hand the identification procedure
should be secure. No other entity than the specific user should be able to successfully
claim, that he is the user. The two highest priority use cases "Get Information Recom-
mendations" and "Manually Modify User Interest" are dependent of this use case so it
is also rated as highest priority.
The "Record User Interest" use case is related to the information consumer. Its purpose
is to record the interests of the actual user, in order to use the gathered information
to improve the quality of the recommended resources. It assumes, that the user ex-
plicitly or implicitly provides information about his interests to the system and that
these are digitally record- and classifiable. The use case involves the sequential steps
"User Provides User Interest Clues", "Classify User Interest" and "Recalculate And Store
![Page 27: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/27.jpg)
2.1 Requirement Analysis 17
User Model". The former step may alternatively be "Third Party Provides User Interest
Clues", when the information is gathered from external sources like social networks or
other information filtering systems. However, user data from external sources should
for privacy reasons never be collected without the permission of the respective user.
Furthermore, "Recalculate And Store User Interest" involves a recalculation, because
new interests may replace old ones and all relative values in the system may change.
The process of implicit user interest recording should usually run in background and
therefore not slow the system down in a noticeable way. The highest priority use case
"Get Information Recommendations" is dependent of this use case so it is also rated as
highest priority.
2.1.3 Performance Requirements
According to the IEEE Recommended Practice for Software Requirements Specifica-
tions [30] performance requirements should specify both the static and the dynamic
numerical requirements imposed on the software. Static numerical requirements may
include the number of simultaneous users to be supported or the amount and type
of information to be handled. Dynamic numerical requirements may include, for ex-
ample, the numbers of transactions and tasks and the amount of data to be processed
within certain time periods for both normal and peak workload conditions.
CompleXys is a web application and thereby potentially accessible to an extremely large
audience of internet users. Admittedly the restricted interest domain of complexity
scales the amount of likely users down again, but due to the adaptability of the system
to other domains, it is nevertheless reasonable to design CompleXys in a way, that it is
scalable to be used by thousands of people, which hardly suggests linear performance
and memory complexity in every user-relevant system aspect.
Likewise, regarding the world wide web environment and its steady information flood,
the system is expected to be faced with an enormous amount of documents over time.
The size of single documents is quite hard to estimate and may range from the re-
stricted number of symbols in a microblogging post up to whole books and long-term
discussion archives. This suggests linear performance and memory complexity in every
resource-relevant system aspect as well. Especially effective ways of storing, accessing
and filtering big amounts of differently sized documents had to be implemented.
CompleXys is an information filtering system, that hardly relies on user interaction.
Active information consumption is simultaneously the main value provided to its users
and a way of further improving the system performance by implicitly shaping correct
![Page 28: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/28.jpg)
18 CompleXys
user interest models. However, this assumes, that the user actually wishes to use the
system in everyday life. To attract him do so, the system had to provide a comfortable
and undisturbed usage experience. One relevant performance topic in this aspect is the
response time. The response time is regarded as the time passing between a given input
to a system or system component and the corresponding response. This is essential,
because the exceeding of particular tolerance values in waiting time is likely to provoke
user discontent and a loss of attention to the system. Andrew B. King states in "Website
optimization" [35], that these long known effect is even enhanced by the familiarization
of broadband users to rapid access. Acceptable response times for entry-level DSL
users are estimated to lie between two and three seconds. These shall be achieved in
ninety-five percent of all transactions at normal workload and in sixty percent of all
transactions at peak workload. Furthermore a transaction should possibly never take
longer than twenty seconds.
The performance of an information filtering system is obviously also dependent on
the provided filtering quality. This quality is measured by the two metrics recall and
precision. Precision is thereby a measure of the usefulness and recall a measure of the
completeness of the calculated document ranking [40]. Or more formally stated recall
is defined as:
recall =# relevant hits in hitlist
# relevant documents in the collection
It measures how well the system identifies relevant documents. The recall is optimal
when every relevant document is rated above the display threshold. Of course this
can be easily done by simply rating every document with a high value. This is why
precision is needed as a second metric. Precision is defined as:
precision =# relevant hits in hitlist
# hits in hitlist
It measures how well the system filters non-relevant documents out, rating them low
in CompleXys’ approach. The precision is optimal when every document returned to
the user is relevant to the query. Both metrics have a relative, fixed range between 0.0
and 1.0. A useful information filtering system must have a high recall, so that most of
the relevant documents are displayed to the user. But to be comfortable and to stand
out in comparison to simple search engines it must also achieve a high precision value,
so that the user had not to troublesomely search for relevant documents as needles in
![Page 29: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/29.jpg)
2.2 Overview 19
a haystack. Using the evaluation results of [36] and [7] as reference points the average
recall should be higher than 0.5 and the average precision higher than 0.7.
2.1.4 Design Constraints
This subsection specifies design constraints, that can be imposed by other standards,
hardware limitations and others.[30]
According to the needs of the developers, that were identified in Subsection 2.1.2 the
system had to be expandable. This includes a properly modularized system structure,
characterized by low coupling and high cohesion of the several components. Next to
this horizontal division into functionality modules, the system needs to be clearly par-
titioned into vertical abstraction layers, that are connected by well designed interfaces
and allow the replacement of a layer without affecting another. Furthermore, it should
allow administrators, to maintain their layer without possessing any knowledge of ad-
jacent layers apart from how the interface looks like.
CompleXys utilizes semantic data to provide a semantic topic classification and is thereby
a strong candidate to be a native part of the emerging Semantic Web itself. To achieve
this and further benefit from improving linked data clouds, semantic data reasoners
and the like, CompleXys needs to apply the common semantic web standard languages
and protocols whenever it is sensible.
2.2 Overview
CompleXys is a word construct derived from "complex sys(tem)", referring to its pur-
pose of providing a user context sensitive, adaptive portal for the domain of complexity.
Fedor Bakalov and Adrian Knoth identified five basic modules and two sorts of re-
sources to be relevant for the CompleXys project. Thereupon, they developed a system
architecture, that reflects the requirements of the previous subsection. It is visualized
in Figure 2.2.
The first module is the harvester. Its purpose is to proactively collect resources and it
is composed of three components. The crawler searches for new, potentially interesting
resources and stores their access data in the resource list. But some special data like
those about university resources is entered manually by the administrator, instead of
being crawled. The resource list is a data storage containing the URLs as well as op-
tional invocation methods and authentication information for accessing the resources,
![Page 30: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/30.jpg)
20 CompleXys
Figure 2.2: The CompleXys overview schema
that should be harvested. The fetcher collects those resources, that are specified in the
resource list and fetches their content. The outcome of this module is a growing collec-
tion of raw resources.
The second module is the Content Type Indexer. It performs an analysis of the content’s
format and sources to extract metadata like the source type, the content type or simply
the title. The metadata is restructured to RDF, using Dublin Core1 as reference for com-
mon metadata and SIOC2 for online community related type data like the affinity to a
special blog or forum (see Section 3.1 for more information on SIOC). The outcome of
this module are documents with an amount of metadata, that was extracted out of the
source or predefined as markup in the content.
The third module is the Semantic Content Annotator. Its implementation is a main
emphasis of this diploma thesis. The Semantic Content Annotator extracts machine-
readable semantics out of the received content and annotates it. To achieve this, natural
language preprocessing, keyphrase extraction and annotation web services can be uti-
lized. The preprocessing and some basic semantic matching is done by using the GATE3
framework and some of its plugins. It is also used for the persistent storage of the pro-
cessed documents and semantics, employing a prepared Hibernate4 implementation
(see Section 4.2 for GATE). Keyphrase extraction as a possibility for extracting topic se-
mantics from the text, is implemented by means of KEA5 with a modified version of
1http://dublincore.org/2http://rdfs.org/sioc/spec/3http://gate.ac.uk/4https://www.hibernate.org/5http://www.nzdl.org/Kea/index.html
![Page 31: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/31.jpg)
2.2 Overview 21
the encapsulating GATE Plugin (see Section 4.4 for KEA). Finally the OpenCalais web
service6 is called by another encapsulating GATE plugin, returning semantic entities
and facts (see Section 4.5 for Calais). The extracted semantic data is mapped to the do-
main ontology concepts, thereby providing the probability for the text to be member
of the particular category. The structure and implementation of the Semantic Content
Annotator module is further described in Chapter 6.
The fourth module is the Semantic Filter. It is also the second module, that will be
realized in the scope of this thesis. Its purpose is to apply the collected semantics for
information filtering. So it is meant to provide a dynamically configurable interface
for the access to those documents, that fit to certain filter conditions. An abstract java
filter bundled with a FilterIterator data structure is utilized to achieve this. The filter
is implemented in a set of logic filters and proposition filters, forming a filter system,
that is similar to propositional logic. The implementation of the Semantic Filter will be
detailed in Chapter 7.
The fifth and last module is the CompleXys Portal. It utilizes the collected and seman-
tically annotated resources as well as a user model to deliver those resources to the
user that he most probably might be interested in. The resources are accessed by users
through a number of navigation topologies. A general topology will be available to
every user and additionally there will be personalized recommendations tailored to the
interests of the individual user. The user interests are defined in a user model, being up-
dated automatically based on the user browsing history. Furthermore the user himself
will be able to update his model manually.
6http://www.opencalais.com/
![Page 32: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/32.jpg)
22 CompleXys
![Page 33: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/33.jpg)
CHAPTER 3
Essentials
This chapter provides the background knowledge for the remainder of this thesis. A
main task of this work is the semantical enrichment of text within the Semantic Content
Annotator module. Therefore, Section 3.1 introduces existing options to notate seman-
tic data. But the semantic enrichment can obviously not be done, without collecting
the semantic data in the first place. For that reason Section 3.2 gives an overview over
natural language processing as a fertile research field for semantic data extraction.
3.1 Notation of Semantic Data
Semantic data can be displayed and stored in various ways, depending on the quality
and quantity of the data, as well as on the kind of purposed reuse. The three succeed-
ing subsections will introduce important notation possibilities. Subsection 3.1.1 will
deal with ontologies, Subsection 3.1.2 with annotations and Subsection 3.1.3 with mi-
croformats.
3.1.1 Ontologies
An ontology is a formal, explicit specification of a shared conceptualization [21]. It pro-
vides the needed syntax and semantics to describe relevant aspects of a domain in a
way others and especially machines can understand. This is achieved by determining
concepts and the relations between them. A tiny example ontology might be a concept
dogOwner, a concept dog and a relation owns that can connect both. Special properties
may add more information to a concept. For example dog may need to have a dogTagproperty. Furthermore axioms are defined to assign semantic information to those con-
cepts and relations. Axioms are sets of logical terms, which can be used to describe
![Page 34: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/34.jpg)
24 Essentials
facts like: Every dogOwner needs to have at least one own relation towards a dog.
Frequently used ontology modeling languages are nowadays the Web Ontology Lan-
guage OWL1, the Web Service Modeling Language WSML2 and the Simple Knowledge
Organization System SKOS3. The latter will be described in detail in Section 4.3, be-
cause it fits the requirements of this thesis best.
3.1.2 Semantic Annotations
The knowledge structure represented in ontologies is an important step towards a
working semantic web, but up to this point the data is still abstract and not yet con-
nected to the actual world wide web. Therefore, todays websites need additional meta-
data, that describes its semantic meaning in a machine-understandable way. The pro-
cess of adding this metadata to a document is called semantic annotation.
There are basically three ways to link semantic data to a document [52]: Embedded,
internally and externally referenced annotations. Embedded annotations are directly
written into the HTML document. This can be done either by using an object or scriptelement or by writing it into an HTML comment. Either way, it is not displayed by
common browsers, but can be parsed and used by any semantically based application.
The advantage of this possibility is that the semantics are always present and do not
need to be fetched in a second loading step. The disadvantage is that much semantic
data may result in confusing source code and annotations in elements like script may
violate the code’s validation rules.
Internally referenced annotations reference to an external annotation storage out of
their code. This can be done in a link element with the rel attribute set to ’meta’ and
type attribute for instance set to ’application/rdf+xml’ in case of RDF based metadata
notation. References starting from object elements or anchors are also possible.
As a third option the external metadata document can reference to the annotated one.
To address special parts of the website XPointer or simple offset values may be used.
While the other possibilities expect direct write access to the source document, this
can be done by externals and can thereby be applied to a wide set of scenarios like
personalized annotations or social meta-tagging systems.
Beside the question how annotations are linked to a document, it is also interesting whoactually does it. Manual annotation is of course a valuable option. But it is probably
1http://www.w3.org/TR/owl-features/2http://www.wsmo.org/wsml/3http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/
![Page 35: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/35.jpg)
3.1 Notation of Semantic Data 25
not sufficient, because even when incentives in combination with crowd sourcing prin-
ciples and useful annotation tools like SMORE [50], CREAM [22] and Annotea [33] may
accomplish a lot, the sheer mass of documents in the www is still likely to exceed what
can be achieved this way. Fortunately the field of natural language processing provides
promising approaches for automatic annotation. These will be discussed in Subsection
3.2.
3.1.3 Microformats
The web itself was originally conceptualized for managing semantic information as we
already stated in Section 1.1. The microformats idea is about utilizing this fact and pro-
ducing machine-understandable semantics just by providing special purpose notation
standards based on this POSH, which is a recently created abbreviation for ’Plain Old
Semantic Html’. Beside predefined semantic elements like address for contact informa-
tion or blockquote for quotes, the class attribute is applicable to every element and can
be used to assign other semantic descriptors. But to be useful for machine processing
these descriptors need to follow a common convention, that can be parsed. There-
fore, the microformats community defines modular, open data format standards. The
schema in Figure 3.1 reflects the principles of coherence and reusability, that are central
to the microformats idea. Fine grained elemental microformats are always reused to
build up the more complex, compound microformats like hCalendar4 or hCard5.
Figure 3.1: The basic microformats schema6
An example for microformats may be the following hCard, which describes identity
4http://microformats.org/wiki/hcalendar5http://microformats.org/wiki/hcard6Accessed on January 12, 2010: http://microformats.org/media/2008/micro-diagram.gif
![Page 36: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/36.jpg)
26 Essentials
and contact information of the FSU Jena:
<div class="vcard"><a class="fn org url" href="http://www.uni-jena.de/">
Friedrich-Schiller-University Jena</a><div class="adr">
<span class="type">Work</span>:<div class="street-address">Fürstengraben 1</div><span class="locality">Jena</span>,<abbr class="region" title="Thuringia">TH</abbr><span class="postal-code">07737</span><div class="country-name">Germany</div>
</div></div>
It is obvious how naturally anyone with HTML skills can adopt to this style. It is ex-
tremely simple, light-weighted and pragmatic. It concentrates on modular, specific
topics and is quite human-readable. Furthermore it is self-contained because it is based
on embedded annotations and avoids language redundancy by reusing existing and
well-known HTML elements.
On the other hand microformats do not support URI identification of entities, which
leads to problems when trying to interoperate with the Semantic Web concepts around
the RDF-based W3C initiative. Additionally microformats do have a flat vocabulary
structure without namespaces, which may become problematic when different micro-
formats with equal class attributes are supposed to be combined on a single page. And
finally microformats are controlled by a little, closed community, that standardizes ex-
isting and common formats. This approach makes it unlikely to ever provide dozens
of domain-specific formats, therefore "the long tail" [1] of the social web will probably
always stay excluded from this kind of semantics.
![Page 37: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/37.jpg)
3.2 Natural Language Processing 27
3.2 Natural Language Processing
Natural language processing (NLP) is an interdisciplinary research field, that resides
between linguistics and computer science and strongly interrelates with artificial intel-
ligence. It is concerned with the processing of natural language by computers. NLP
emerged originally from machine translation researches in the middle of the twenti-
eth century [46]. Today’s applications involve useful tasks like spellcheckers, machine
translation, speech recognition and information extraction. In this particular thesis the
subproblem of text classification is a central task of the Semantic Content Annotator
module. Therefore, it will be amplified separately in Subsection 3.2.7.
There are various basic approaches to handle the problems of natural language process-
ing, which will be discussed in Subsection 3.2.1. The essential subfields of text analysis
are mostly derived from the linguistic language description layers. Namely they are
lexical, morphological, syntactic, semantic and pragmatic analysis [37]. These are the-
matized in the Subsections 3.2.2 to 3.2.6.
3.2.1 Approaches
The basic approaches for natural language processing can be broadly divided in sym-
bolic, statistical, connectionist and hybrid approaches [37]. The symbolic approach rests
on the usage of explicit knowledge representations like logic propositions, rule sets or
semantic networks for language analysis. It is based on the assumption that an exhaus-
tive formal representation of words, grammar rules, possible syntactic and semantic
word relations and other linguistic data must provide a machine with all necessary in-
formation to perform text procession. A given text shall thereby be stepwise analyzed
and transformed, until it is directly displayed in the intended machine-understandable
format.
The statistical approaches are on a widely varying degree based on mathematical statis-
tics and often strongly related with machine learning techniques. The corresponding
methods make use of large sets of already worked out machine-understandable text
data. These data sets can for example be used to train naive Bayesian networks, which
thereon build up a statistical model. This model can afterwards be used to transform
unprocessed texts in the same way, that was shown in the training data.
Connectionist approaches base on the idea of neural networks, that intelligence emerges
from parallel interaction of many single neuron-units. These approaches combine sym-
bolic knowledge representation with statistical methods. The knowledge is stored in
![Page 38: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/38.jpg)
28 Essentials
the weights of the neuronal connections, but the network is trained up like the statisti-
cal approaches, until it is capable of solving unprocessed cases itself.
Finally hybrid approaches pay attention to the fact, that all three preceding approaches
have strengths and weaknesses and may be optimally used in combination by utilizing
them in those NLP subtasks, in whose individual requirements they fit in best.
3.2.2 Lexical Analysis
Lexical analysis deals with text segmentation tasks. The central program of these anal-
ysis is called a tokenizer and it divides the text in known token-units like words, punc-
tuation and numbers. The sentence splitter is responsible for the segmentation of the
text in separate sentences. Another related tool is the part of speech tagger, that matches
sentence parts to word classes like noun, verb and adjective. This is necessary to resolve
ambiguations for the tokenizer.
3.2.3 Morphological Analysis
Morphological analysis is charged with the word structure and the morphological pro-
cesses it results from. The goal is to normalize a word into a morphology independent
form. This is important to simply reduce the size and complexity of the underlying lex-
icons. It is easier to store morphology independent forms and a set of rules, expressing
how any word can be reduced to it, than to store every possible morphologically trans-
formed form and maybe even to add word heritage relations just to gain a comparable
expressiveness.
Morphology independent forms can be stems or lexemes. Stems are the remaining part
of a word, when all suffixes are cut off. For example the words ’category’, ’categorical’
and ’categories’ do share the same stem ’categ’. Lexemes on the other hand are basic
words like those one can find in a lexicon. For example the lexeme of words like ’took’,
’taken’ and ’taking’ would be ’take’. The latter is harder to implement but also more
expressive, because stemmer would not be able to match ’took’ to ’take’ or ’better’ to
’good’ while a program for lexemes will.
Part of speech tagger, which were already introduced in 3.2.2, are of importance for this
analysis layer too, because the way morphological processing is done relies strongly on
the affected part of speech type. Therefore, it is reasonable to use morphological data
for part of speech tagging and vice versa.
![Page 39: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/39.jpg)
3.2 Natural Language Processing 29
3.2.4 Syntactic Analysis
Syntactic analysis works with the syntax of sentences. It deals with word order and
phrase structure. Phrases are thereby word groups with a collective function in this
particular sentence. The word order in those phrases and the phrase distribution in
the sentence follows language inherent rules and carries information of grammatical
states. So syntactic analysis contributes to the extraction of central linguistic concepts
like sentence type, tense and morphologic case. Another application for this analysis
level is text parsing, which is the verification of a sentence by means of syntactic well-
formedness.
3.2.5 Semantic Analysis
The semantic analysis aims to perceive the meaning of text . It is generally dividable in
lexical and compositional semantics. Lexical semantics deal with the semantic of single
words or phrases. This may involve the classification into a relation network of similar
synonyms, hierarchical related higher classed hypernyms or lower classed hyponyms,
contrary antonyms and others. An important application of this analysis step is the
word sense disambiguation.
Compositional semantics deal with those semantics emerging from the composition of
words and phrases to bigger clusters like sentences or whole texts. An instance for this
is the semantic deduction, which is drawn from a reflexive pronoun referring to the
noun of a preceding sentence. The application of semantic analysis to the interrelation
between sentences of a text is called discourse analysis.
3.2.6 Pragmatic Analysis
Pragmatic analysis is responsible for the highest level of text understanding. The se-
mantic meaning is considered by its relation to a wide-ranging set of context, back-
ground knowledge and conventions to extract hidden inherent information like action
implications, speaker motivation, irony or citation. The ability of mastering this anal-
ysis level is probably a main obstacle of a machine to reliably pass the turing test and
may according to Alan Turing [62] therefore count as equivalent to human intelligence.
However, highest does in no way suggest, that other levels are of lesser importance -
pragmatic understanding can not be achieved without profound preparatory work at
the preceding levels.
![Page 40: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/40.jpg)
30 Essentials
3.2.7 Text Classification
Text classification is a subfield of natural language processing. It determines, whether a
text is a member of certain categories. Such categories may for instance refer to the text
genre or to topic domains. The latter categorization task can be labeled separately as
term assignment, but in this thesis we will include it under the term text classification
for reasons of clarity and simpleness. Generally classification is useful for supporting
effective access to big amounts of information. Hence it is especially of great interest in
regard to the rapidly growing world wide web.
Text classification is based on a controlled vocabulary, which lists all permitted clas-
sification terms. The opposite is text clustering, which freely arranges document sets
according to shared words, phrases or even just shared relations to words. On the one
hand this free indexing strategy has the advantage, that it is domain independent and
more flexible towards unexpected inputs. Controlled indexing on the other hand pro-
vides better performance in its special domain and provides predictable output, that is
easier to work with on application side. Furthermore it can be easier semantically used
by preparing a specialized semantic net for the anticipated outputs and the consistency
with human classification is higher.
Text classifiers usually consists of a knowledge extractor and a filter. The knowledge
extractor creates class models containing sets of weighted features. These are mostly
displayed as word- or letter n-grams and represent extracted text data like frequency
counts, entropy and correlations. Each module can either work in a statical way, which
is usually symbolical and rule-based, or self-learning, which involves training data and
statistical or connectionist methods.
![Page 41: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/41.jpg)
CHAPTER 4
Tools and Standards
This chapter describes important tools and standards, that were used during the thesis
work. Section 4.1 explains the purpose and features of the SIOC ontologies, which are
used by CompleXys’ Content Type Indexer to express social media specific metadata.
Section 4.2 surveys the GATE project, that is utilized as a basic framework for the Se-
mantic Content Annotator module and provides one of the implemented methods for
semantic extraction and annotation. Section 4.3 treats the taxonomy description lan-
guage SKOS, that serves as ontology description language for the CompleXys domain
taxonomy. Section 4.4 discusses the keyphrase extraction package KEA and its follow-
ups, on which another approach for semantic extraction is based on. Finally section
4.5 introduces the Calais initiative and its semantic annotation web service OpenCalais,
that was the third utilized semantic data source within the Semantic Content Annotator.
4.1 SIOC
The abbreviation SIOC [9] is short for Semantically-Interlinked Online Communities.
The intitiative aims to bridge the gap between the social web and semantic web tech-
nologies. To achieve this it provides a series of ontologies, defining a description stan-
dard for the domain of online communities.
The ontology structure specifies different abstraction levels, that relate to each other.
For example Figure 4.1 presents the semantic net of the SIOC main classes. The abstract
items relate to a superordinate container, which in turn belongs to a certain space. In
case more details are known, the item may be more precisely described as a post and
the container as a forum, both located in a concrete site. A post may have replies, tags,
categories and a creator. The creator may be member of a usergroup, have a function
in the forum and may be further related to special person description ontologies like
![Page 42: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/42.jpg)
32 Tools and Standards
FOAF [11].
Generally said these concepts enable people to describe and consolidate their identity
across the social web and possibly merge all the multiple accounts of today’s web life
into a coherent web identity. This is coupled with rapid access and processing capacity
for community related data and thereby with many interesting application options.
But on the other hand it may also lead to an increased potential of abusive data storage,
hence increasing the necessity for public awareness towards data parsimony.
Figure 4.1: The SIOC main classes in relation1
4.2 GATE
The abbreviation GATE [16] is short for a General Architecture for Text Engineering. It
is an infrastructure for language processing software development. The contained soft-
ware architecture defines a fundamental organization schema for NLP software based
on loosely coupled GATE layers. These can also be externally utilized, by accessing the
corresponding open source API set of the GATE Embedded framework, whose compo-
nents are visualized in Figure 4.2.
1Accessed on January 25, 2010: http://wiki.sioc-project.org/images_sioc/f/f2/Sioc_spec_5_small.png
![Page 43: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/43.jpg)
4.2 GATE 33
Figure 4.2: The APIs, which form the GATE architecture2
It is easily perceivable, that GATE has a meticulous focus on clean level separation, di-
viding its APIs in IDE GUI-, Application-, Processing-, Language Resource-, Corpus-,
Document Format-, DataStore and Index Layer as well as Web Services. The inter-
nal resources are structured in three categories. Basic data and language documents
like lexicons, ontologies and corpora are termed as Language Resources (LR). Algo-
rithmic components like part of speech tagger, tokenizer and parser are called Process-
ing Resources (PR). Visualization- and GUI related components are denoted as Visual
Resources. A division, that obviously mirrors the Model-view-controller architectural
pattern. The mutual set of these three resource types is collectively known as CREOLE,
which is short for Collection of REusable Objects for Language Engineering.
Furthermore, GATE contains a graphical IDE, a ready-to-use data model for corpora
and documents discussed in Subsection 4.2.2 and an elaborate Information Extraction
system called ANNIE, which will be discussed in detailed in Subsection 4.2.3.
2Accessed on January, 2010: http://gate.ac.uk/sale/talks/gate-apis.png
![Page 44: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/44.jpg)
34 Tools and Standards
4.2.1 Corpus Data Model
The corpus data model is used as document and annotation format for the Semantic
Content Annotator and Semantic Filter modules of CompleXys. It can be described by
the six essential data objects, whose relation network is visualized in Figure 4.3.
Figure 4.3: A data model diagram for GATE’s corpus layer
The corpus object is per definition a large, structured set of texts and therefore contains
an arbitrary big set of documents. Additionally it is identified by a name and may con-
tain a FeatureMap, which lists descriptive features of an object in key-value pairs. The
documents possess the actual document content, a name, a source URL, a FeatureMap
and AnnotationSets. An AnnotationSet contains any number of annotations and an
identifying set name.
An Annotation has an id and a type, potentially connecting it to an ontology concept.
Further information can also be noted in an attached FeatureMap. The Annotation ob-
jects are implementations of the externally referenced semantic annotation approach,
that is discussed in Subsection 3.1.3. This means, that the annotations are neither em-
bedded in the content itself nor even referred to from within the content, but point
to the respective text interval by simply externally describing a start node and an end
node offset. The format, which is a modified form of TIPSTER [20] , is useful on the
one hand, because it cleanly divides content and semantic description and preserves
the original text. On the other hand even slight modifications of the text will result
in reference inconsistencies from annotation to content. Thus a more flexible reference
approach would be desirable.
![Page 45: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/45.jpg)
4.3 SKOS 35
4.2.2 ANNIE
GATE delivers in bundle with a ready set of Processing Resources for information ex-
traction named A Nearly-New Information Extraction system or short ANNIE. It’s cen-
tral processing resources are tokenizer, sentence splitter, part of speech tagger, gazetteers
and for our purposes the semantic tagger.
The tokenizer is responsible to divide the text in known token-units like words, punc-
tuation and numbers, while the sentence splitter has to identify the sentences and split
the text into them. The part of speech tagger categorizes tokens and token clusters as
part of speech like noun, verb or adjective. All three were already introduced as lexical
analysis related tools in Subsection 3.3.
A gazetteer is by word heritage a geographical directory listing information about
places. However, in the domain of NLP the term meaning changed and now gener-
ally implies a set of wordlists each referring to a certain category, for example lists
of persons, cities or companies. Listed words must thereby not exclusively be entity
names, but can also be mere indicators like Ltd. is for a company. Furthermore a GATE
gazetteer module provides lookup functionality to match text parts to words occur-
ring in the respective list and annotating them with the respective list category. This
functionality can be implemented by finite state machines or hashtables.
The semantic tagger builds upon the gazetteer principle, using JAPE rules to further
describe matching patterns and the resulting annotations. JAPE [17] is short for Java
Annotations Pattern Engine and it provides finite state transduction over annotations
based on regular expression. Hereby it is possible to assign and process rules like
"When a text part has been already tagged by the gazetteer with the name x, then
add a feature ontology referring to the corresponding ontology concept y.". In this way
gazetteers can be used to automatically assign semantic annotations to text. This is one
of the semantic extraction approaches, that are used in the Semantic Content Annotator.
Its application is explained in detail within Subsection 6.2.3.
4.3 SKOS
SKOS [31] is a W3C standard ontology description language and a particular imple-
mentation language for the ontology concept described in Subsection 3.1.1. The ab-
breviation SKOS is short for Simple Knowledge Organization System. It is based on
RDF and thereby natively integrated in the semantic web environment. Furthermore
it is a light-weight modeling language specialized on hierarchical data structures like
![Page 46: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/46.jpg)
36 Tools and Standards
thesauri, taxonomies and classification schemes.
SKOS concepts can possess three kinds of labels - prefLabel, altLabel and hiddenLabel.The prefLabel property defines the preferred label of the concept and altLabel defines al-
ternative labels, which is useful to assign synonyms, acronyms and abbreviations. A
hiddenLabel is a label, that can be used internally, for tasks like search operations and
text-based indexing, but should never be visibly displayed. A practical example there-
fore might be common misspellings of actual labels. Every kind of label can optionally
have a language tag, that restricts the scope of a label to this particular language and
by doing this, enables an executing entity to preferably display the label in the native
language of the calling instance.
Figure 4.4: An exemplary SKOS taxonomy
SKOS allows three relation types - broader, narrower and related. Broader and narrower can
be used to build up a concept hierarchy as demonstrated in the example in Figure 4.4.
A relation broader of a concept tinyBooks towards a concept books would express, that
tinyBooks is a subconcept of books and so every narrower instance of tinyBooks is also
an instance of books. A relation narrower of the concept book towards tinyBooks implic-
itly expresses the same. The relation related can be used to express a non-hierarchical
connection between two concepts. For example dog and dogOwner are in no way sub-
concepts of each other, but they are naturally related.
![Page 47: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/47.jpg)
4.4 KEA 37
4.4 KEA
The abbreviation KEA [63] is short for Keyphrase Extraction Algorithm. This piece of
Java-based open source software is supposed to analyze text documents to extract a set
of keywords or keyphrases, which are multi-word units. Keyphrases are widely used
in corpora to shortly describe the content of single documents and to provide a basic
sort of semantic metadata, that can be reused by other processing tasks.
The task of assigning keyphrases to a document is called keyphrase indexing. Tradi-
tionally authors or special indexing experts have done this task manually, but with an
increasing amount of texts in digital libraries and the whole world wide web this ap-
proach is no longer sufficient. KEA provides a software-driven, free indexing approach
to automate this task.
Figure 4.5: The KEA algorithm diagram together with KEA++3
The diagram in Figure 4.5 visualizes the overall process of KEA. It can be divided in
two basic subtasks. The first step candidate extraction is accordingly termed as extractcandidates within the schema. It is further described in Subsection 4.4.2. The second step
is the filter process to extract those keyphrases, that are most likely to be useful. This
3Accessed January 6, 2010: http://www.nzdl.org/Kea/img/kea_diagram.gif
![Page 48: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/48.jpg)
38 Tools and Standards
subtask essentially involves the schema entities compute features and compute entitieswhen it actually works, but also includes compute model, while being in training mode.
It will be detailed in Subsection 4.4.3. Finally Subsection 4.4.4 will introduce the KEA
advancement KEA++.
4.4.1 Candidate Extraction
Candidate Extraction is responsible for extracting candidate phrases out of a plain text
using lexical methods. (see Subsection 3.2.3 for general information about lexical anal-
ysis) This subtask is again dividable in the three basic steps input cleaning, candidate
identification and normalization of phrase candidates.
Input cleaning normalizes the raw input text into a standardized format. Therefore,
the text is divided into tokens, using spaces and punctuation as splitting clues. The
outcome is modified by separating out single or framing symbols like marks, brackets,
numbers and apostrophes as well as non-token characters and those tokens, that do not
contain any letters. Furthermore hyphenated words are split in pieces.
Phrase identification is the task of considering all token sequences as possible phrases
and finding the suitable candidates among them. KEA uses three conditions to match
suitability. The first condition claims, that a candidate phrase is composed of a limited
number of tokens. Three words appeared to be a good choice of length configuration.
The second condition claims, that proper names can not be candidate phrases and the
third one, that phrases can not begin or end with a stopword. Stopwords are thereby a
wordlist containing types like conjunctions, articles, particles, prepositions, pronouns,
anomalous verbs, adjectives and adverbs, that are unlikely to begin or end a useful
phrase.
The third task is meant to normalize the identified phrase candidates by stemming and
case-folding. Stemming is usually achieved by iteratively cutting the suffix of the can-
didate until just the stem is left. Case-folding is simply done by a general lower case
conversion. Additionally multi-word phrases can be re-ordered, so that for instance
Technical Supervisor and supervising technician both result in the normalized form supertech. This extracted form is called a pseudo-phrase. Besides the most frequent original
phrase of every pseudo-phrase is investigated to be presented as phrase label to human
users.
![Page 49: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/49.jpg)
4.4 KEA 39
4.4.2 Filtering
The filtering task is responsible for choosing the most suitable keyphrases out of a given
set of keyphrase candidates. To achieve this the candidates have to be measured in a
way, that makes them comparable. Thereupon, it must be decided which candidates
will be chosen as keyphrases. The three applied metrics for free indexing are TFxIDF,
first occurrence and phrase length.
TFxIDF is a frequency metric, that relates the phrase occurrence in a particular docu-
ment to the its occurrence frequency in all preceding documents. The idea behind this
is that a phrase is more likely to be a keyphrase, when it occurs often inside the respec-
tive document and this occurrence is also generally rare in corpora average. Rareness
relates in this case to unpredictability and thereby to a higher amount of information
gain. For example the fact that a document is related to chemistry is not very surprising
inside a purely chemical corpora, but may be a useful descriptor, when the document
is member of a computer science corpora. The formula for TFxIDF is:
TFxIDF =f req(P, D)
size(D)x− log2
d f (P)N
, where
1. TF, is the term frequency in the actual document
2. IDF, is the inverse document frequency, which measures the probability of a term to
occur in a document of the corpora
3. freq(P,D) is the number of times term P occurs in document D
4. size(D) is the number of words in document D
5. df(P) is the number of documents containing the term P in the global corpus
6. N is the size of the global corpus.
The second feature metric is the relative position of first occurrence. It is calculated
with the following formula:
FO =prec(P, D)
size(D), where
1. FO, is the relative value of the first term occurence
2. prec(P,D), is the number of words in document D preceding the first occurence of the
term P
3. size(D) is the total number of words in document D
The third feature is the phrase length. This metric gives attention to the idea, that
human indexing experts tend to choose two word phrases instead of one- or three word
phrases. Therefore, it might be reasonable to weight these candidates higher.
![Page 50: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/50.jpg)
40 Tools and Standards
After the features have been derived, the selection itself must be processed. Therefore,
KEA applies a machine-learning algorithm based on WEKA [64], that is supposed to
learn how to valuate a phrase candidate and afterwards do it autonomically. Within the
approach categorization introduced in Subsection 3.2.2 KEA uses a statistical approach,
working with naive Bayesian networks to build up the prediction model. Hence KEA
needs a set of already annotated documents in the first place to train the model how
to distinguish useful keyphrases among the candidates. Once the model is sufficiently
trained, KEA is able to differentiate useful and useless keyphrases from unknown doc-
uments quite well.
4.4.3 KEA++
KEA++ [43] is an advancement of KEA, that enhances it by the possibility of controlled
indexing. Since version 4 it is also included in the KEA main distribution. Controlled
indexing in contrast to free indexing restricts the number of possible keyphrases to a
fixed set of determined phrases. The advantages and disadvantages of these two ap-
proaches were already discussed in Subsection 3.2.7. Summarizing one may say, that
controlled indexing is very useful in fixed domains and in cases where predictable
keyphrases are an important requirement. The most fundamental resource of con-
trolled indexing is the thesaurus. KEA++ is hereby designed for using SKOS tax-
onomies. (see Section 4.3 for information on SKOS)
An effect of the advancement on the candidate extraction is that a phrases had to be
successfully matched with a thesaurus entry before it is considered as a keyphrase can-
didate. The matching process is done by normalizing the thesaurus entries in the same
way as the candidates and comparing the pseudo-phrases instead of the originals in
order to avoid complex morphology handling.
An additional metric in the feature extraction process of the controlled index approach
is the node degree, which uses the number of direct semantic relations from one can-
didate to the others as clue for the representativity, thus modifying its weight. This
feature has the interesting effect, that even phrases, that do not actually appear in the
text, might become keyphrase candidates, just because they are well connected to the
other candidates. For instance a text, that mentions astronomy, biology, physics, chem-
istry and earth science might be well described by the common related term natural
science, albeit it does not appear in the text.
Another new metric, directly resulting from the preceding one, is the actual appearance
of a phrase in the text. Although a particular node degree might lead to the conclusion,
![Page 51: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/51.jpg)
4.5 Calais 41
that a not appearing thesaurus term may be a good candidate, appearance is still a
strong indicator, that should be considered in the selection process.
4.5 Calais
Calais [12] is a strategic initiative at Thomson Reuters, that aims to improve the inter-
operability of content. Therefore, it utilizes state-of-the-art natural language processing
techniques to "turn static text into Smart Media that is enriched with open data and con-
nected to a dynamic Linked Content Economy" [61]. More precisely calais provides free
metatagging services, developer tools and an open standard for the generation of se-
mantic content. The key component of their efforts is the OpenCalais webservice, that
will be detailed in Subsection 4.5.2. Finally the underlying data format will be treated
separately in Subsection 4.5.3.
Figure 4.6: Input and output data of the OpenCalais web service4
4.5.1 OpenCalais WebService
OpenCalais is the web service at the core of the Calais initiative. It is an API, that takes
unstructured plain text as input, processes it with natural language processing and ma-
4Accessed on January 8, 2010:http://enioaragon.files.wordpress.com/2009/12/12-03-calais.jpg?w=450&h=325
![Page 52: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/52.jpg)
42 Tools and Standards
chine learning methods and returns a semantically annotated text version to the user.
The access to the web service is even for commercial use basically free, but requires a
registration to get an API key, that is mandatory for every request. Furthermore the re-
quest frequency for a single API key is currently limited to fifty thousand transactions
per day and four transactions per second.
Method invocation can be done by sending either SOAP or REST requests. Calais takes
the committed content, that must not be larger than one hundred thousand characters,
identifies the occurring entities therein and tags them with metadata. Relevant entity
classes are categories, named entities, facts and events as shown in Figure 4.6. The exact
data model is described further in Subsection 4.5.3. Web service responses return the
improved content with all the assigned tags, document IDs and URIs as RDF, Microfor-
mats, JSON [15] or Calais’ hybrid Simple Format.
4.5.2 Data Model
OpenCalais’ data model is strongly oriented on the linked data design principle, prop-
agated by W3C director Tim Berners-Lee [5]. This principle can be described by four
simple rules:
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information, using the standards
(RDF, SPARQL)
4. Include links to other URIs. so that they can discover more things.
According to the points one and two, OpenCalais identifies every relevant object with
an HTTP URI. Common URIs relate to types, type instances, documents, text instances
or resolution nodes. Types are the predefined entity categories, that Calais provides.
Their URIs are statically formed like http://s.opencalais.com/1/type/em/e/Company.
Type instances are special individuals of a certain type. For example the ClearForest
Ltd. would be a type instance of Company. Their URIs are composed of a type related
prefix and an instance specific hash token. The URI of the ClearForest Ltd. would be
http://d.opencalais.com/comphash-1/899a2db3-ce69-3926-ba4f-6dea099c3fc9. If the
relevance feature is turned on, the RDF also includes a score, that estimates the im-
portance of the entity for the document.
Document URIs refer to the actual text, that is sent within the request, and are com-
posed like type instances, but with a prefix referring to document. An example URI is
http://d.opencalais.com/dochash-1/00b00ecd-7e8b-3773-b30f-2169abd75efe.
![Page 53: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/53.jpg)
4.5 Calais 43
Type instances like the ClearForest Ltd. on their part possess instances in the docu-
ment, each describing an actual occurrence in the text. These text instances are referred
to by an ID composed of their container document’s URI and a document internal in-
stance counter as suffix. For example the sixth instance of a document may be named
http://d.opencalais.com/dochash-1/00b00ecd-7e8b-3773-b30f-2169abd75efe/Instance/6.
A special problem are ambiguous type instances. These cases are resolved by noting a
general ambiguous type entity followed by a resolution node, that represents the most
likely disambiguate instance. For example the city Paris would be described by an am-
biguous type instance http://d.opencalais.com/genericHasher-1/56fc901f-59a3-3278-
addc-b0fc69b283e7 and the additional resolution node http://d.opencalais.com/er/geo/city/ralg-
geo1/797c999a-d455-520d-e5cf-04ca7fb255c1, describing the special individual Paris,
France and rating the certainty of disambiguation correctness by an attribute score.
Figure 4.7: An example application of linked data5
Like point three and four suggest, the URIs can be looked up again to gain additional,
RDF-based information, including new related URIs for further research. In doing so
OpenCalais helps to semantically link the data, creating an expanding semantic net-
work. Furthermore it links with external open data sources like Wikipedia6, IMDB7,
5Accessed on January 8, 2010:http://www.slideshare.net/KristaThomas/simple-opencalais-whitepaper
6http://wikipedia.org/7http://www.imdb.com/
![Page 54: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/54.jpg)
44 Tools and Standards
Geonames8, Citeseer9, IEEE10, Project Gutenberg11 and many more to even intensify
the usefulness of the provided metadata.
In some cases the resulting linked data can be yet used to identify relationships between
people, businesses, and others, thereby potentially easing investigation processes a lot
and providing many other automatization possibilities. Traditionally a human infor-
mation search would take some serious time to link the IBM Corporation with Warren
Buffett, whereas a software with an extracted semantic net like that of Figure 4.7 may
draw the conclusion quite rapidly.
8http://www.geonames.org/9http://citeseer.ist.psu.edu/
10http://www.ieee.org/portal/site11http://www.gutenberg.org/wiki/Main_Page
![Page 55: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/55.jpg)
CHAPTER 5
Related Work
This chapter provides an overview of the related work, that has been done in the field
of information filtering, which is the most relevant research area to the CompleXys
project. The focus will lie on recent papers, theses and projects, that are mostly repre-
sentative for extensive subfields. They were chosen due to their similarities and differ-
ences towards the characteristics of CompleXys. Each of them is suitable to spot the
accomplished work within the context of germane, international research.
Various different definitions have been made for information filtering, especially con-
cerning the relationship of information filtering to information retrieval. A frequently
cited paper in this aspect is [3], which states that "Information filtering is a name used
to describe a variety of processes involving the delivery of information to people who
need it." Belkin and Croft finally concluded, that information retrieval is on such an ab-
stract level just another side of the same coin and even a superior discipline to filtering
as its specialization. This view influences the research up to today, but also receives
critic for constricting the perspective and thereby suppressing attention to the spe-
cific attributes of information filtering [47]. Oard phrases another popular definition
in 1996, stating "Text Filtering is an information seeking process in which documents
are selected from a dynamic text stream to satisfy a relatively stable and specific infor-
mation need." and thereby classifying it among related fields as specified in Table 5.1
[48]. Others draw the line between access styles, dividing between information filtering
as passive push and information retrieval as active pull research [34]. And yet others
see information retrieval as equivalent to one-time-querying and information filtering
as equivalent to continuous querying or selective dissemination of information [66].
One early idea of information filtering was already presented in 1982 in [18]. The au-
thors proposed an approach to automatic ordering of incoming emails according to
their priority. This work marked the beginning of the email classification as a very
![Page 56: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/56.jpg)
46 Related Work
Process Information Need Information Sources
Information Filtering Stable and Specific Dynamics and Unstructured
Information Retrieval Dynamic and Specific Stable and Unstructured
Database Access Dynamic and Specific Stable and Structured
Information Extraction Specific Unstructured
Alerting Stable and Specific Dynamic
Browsing Broad Unspecific
Entertainment Unspecific UnspecificTable 5.1: Examples of information seeking processes [48]
intensive application field of information filtering. Nowadays nearly all providers use
some kind of spam filter, making them probably the most frequent information filtering
applications.
Information filtering generally needs some kind of source metadata to match it with cer-
tain filter conditions and thereby decide whether to filter or display a single resource.
So it can be divided by the coarse sources of data into content-based filtering and col-
laborative filtering. In the subfield of content-based filtering this data is extracted out
of the content itself, while the competing collaborative filtering uses averaged com-
munity behavior in comparable contexts as model. The latter received much attention
recently, due to several very successful applications like the Amazon1 and Last.fm2
portals, which leverage the approach for product and music recommendations. Fur-
thermore the social web hype is a fortunate time for building and leveraging web com-
munities. Ongoing research work for collaborative filtering involves projects like the
social web page aggregator MakeMyPage [29]. This thesis however focuses on content
based filtering, because we can not assume to always have an adequately big commu-
nity present. Content-based filtering has not yet been as successfully utilized as its
competing approach [47], albeit the great potential of personalized access to informa-
tion networks beyond community borders is obvious and often recognized.
The content data itself is not of any filtering use without the information, what the
user actually wants to get. There are various approaches to select and present informa-
tion tailored to the user interest. The two most important possibilities are personalized
models and queries. Queries are the most common expression of information need in
today’s information systems. They are traditionally collections of metadata phrases,
that describe a special one-time resource request. They are used in nearly every web-
1http://www.amazon.com2http://www.last.fm
![Page 57: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/57.jpg)
47
site from search engines over weblogs to e-stores. However, these queries are normally
restricted to relatively simple tasks and lack any possibility of support for long time
research. Instead one had to manually combine a lot of those queries over time, provid-
ing all necessary context metadata over and over again. This is just natural in respect
of their heritage in information retrieval, but makes them unhandy for information
filtering. A special idea to overcome these shortcomings are continuous [65] or retroac-
tive [14] [24] queries. They base on the idea, that some queries can not be sufficiently
answered in the first research step and ongoing investigation might be interesting. Rea-
sons for this can be, that new information could be available later and the researcher
wants to stay up to date or that some information’s value is simply coupled to certain
events, that had not happened yet. Retroactive queries are an intersecting approach be-
tween queries and personalized models, because they are structured like queries, but
persistently retained like personalization data.
CompleXys itself actually uses a user model driven, personalized filtering approach.
User models could either be short-term, long-term or intermediate models. Short-term
models concentrate on the recent behavior of the user, while long-term models focus
on permanent user interests. Task models [32] are an intermediate approach between
both - they can spread over many sessions and time units, but are restricted on a certain
task scope. It’s purpose is similar to that of retroactive queries, trying to satisfy the spe-
cial needs of complex, multi-session tasks. This thesis’ work is essentially combinable
with all of them, although the ontology annotation implicitly suggests a correspond-
ing ontology based profile. The benefit of ontologies in user modeling was intensively
investigated during the last years [23] [55] [53] and is likely to become quite common
with the emergence of a Semantic Web, which both underlines the implication.
Information filtering is often seen as a binary text classification problem [49] [36]. This
thesis’ work differs from most text classification based filters insofar, that it does not tra-
ditionally derive one fixed category, but lists category candidates ordered by its proba-
bilities, thereby combining a multi-value classification with fuzzy data. This is useful,
because a text can be relevant in many aspects - for example one may want to read
this thesis, because of its relation to information filtering, because it applies GATE or
because it is part of an adaptive portal project. However, traditional text classification
would be inappropriate to achieve this, because it narrows the perspective to a sin-
gle aspect and deletes the less probable data. Furthermore a simple binary decision
for each classification candidate ignores certainty levels and provides no possibility to
order documents based on their category relevance. Using the probability order ap-
proach, priorisation can be a natural part of the information filtering process, among
![Page 58: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/58.jpg)
48 Related Work
others easing smart, dynamic display styles in information portals.
A fundamental design decision of each information filtering system is the domain de-
pendency. A domain specific approach can consider the unique circumstances of a
certain domain and harness it to raise its performance. Such approaches were for in-
stance investigated for the high energy physics in [54] and the agriculture domain in
[43]. Both however used domain ontologies for this purpose, so their basic system is
theoretically applicable to any other domain by simply exchanging the underlying do-
main model. This is also the approach, that was chosen for the CompleXys project,
because it focuses on the complexity science domain and competing domain indepen-
dent approaches like the wikipedia based [58] do, albeit making appreciable progress,
not yet achieve a similar classification performance.
Recent research also implied, that the consideration of background knowledge is capa-
ble of raising the classification performance. Domain ontologies are an obvious source
for such approaches, even when they are automatically created [7]. But also other the-
sauri like WordNet were already successfully utilized [19] [41] [25]. Within this thesis’
work the keyphrase extractor KEA [44] was employed to harness the relation data be-
tween the domain specific keyphrase concepts.
A comparable project to CompleXys was h-TechSight [42], which in 2004 implemented
an information filtering system on a controlled vocabulary text classification basis. Be-
sides their goal of better performance, they also utilize ontology technologies, thereby
being natively compatible with the Semantic Web. This leads to various possibilities
like achieving performance gains by utilizing external linked data and elaborate Se-
mantic Web reasoners.
A more recent related project is the PIRATES framework [2], whose name is an abbrevi-
ation for Personalized Intelligent Recommender and Annotator TEStbed. They provide
a promising architecture for text-based content retrieval and categorization with a focus
on constructive interoperability with social bookmarking tools, describing their project
as "a first step towards the generation and sharing of personal information spaces de-
scribed in [13].". There are some obvious intersections with the current approach of the
CompleXys project, including a module, that utilizes GATE for information extraction
on the basis of part of speech tagging, and one module, that uses a variation of the
KEA algorithm for keyphrase extraction. Albeit the main goals are quite different, a
properly configured PIRATES implementation including a fitting domain model possi-
bly provides a suitable alternative to the CompleXys system. But although a detailed
comparison would have been interesting, the focus of this thesis had not yet allowed to
do one.
![Page 59: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/59.jpg)
49
Recent research is apparently often influenced by the ongoing social web emergence.
Besides the PIRATES project, there was research for the multi-value classification of
very short text like comments or microblogging entries in [26] and the categorization of
blogger interests based on short blog post snippets in [39]. Up to now CompleXys uses
mostly whole websites and reformats them into the SIOC format. There is no differ-
entiation between the content size yet, but due to the positive results of the mentioned
papers, this may be changed in the future.
![Page 60: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/60.jpg)
50 Related Work
![Page 61: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/61.jpg)
CHAPTER 6
Semantic Content Annotator
The Semantic Content Annotator module, which has been the first of the two practical
development goals , is the focus of this chapter. It extracts and annotates semantic data
from input text and classifies the text according to concepts of a CompleXys domain
ontology, that was required to perform the text classification in this module. Section
6.1 describes the design and implementation of the central domain ontology. Section
6.2 introduces the principle of CompleXys Tasks and describes their implementation
instances, thereby explaining how the module’s goals are achieved.
6.1 CompleXys Domain Ontology
This section is dedicated to the design of a CompleXys domain ontology. Therefore,
Subsection 6.1.1 will outline the focus and adjacent topics of complexity. Afterwards
Subsection 6.1.2 will describe the CompleXys taxonomy, that is derived from the for-
merly identified issues.
6.1.1 Complexity
Complexity is basically an interdisciplinary approach to science and society. It tries to
describe, understand and apply a wide range of chaotic, dissipative, adaptive, nonlin-
ear and complex systems and phenomena [8]. It cannot be strictly defined, but only sit-
uated in between order and disorder. A complex system is usually modeled as a set of
interacting agents, which represent diverse components like people, cells or molecules.
Because of the non-linearity of the interactions, the overall system evolution is to an im-
portant degree unpredictable and uncontrollable. Still the system tends to self-organize,
in the sense that local interactions eventually produce global coordination and synergy.
![Page 62: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/62.jpg)
52 Semantic Content Annotator
The resulting structure can often be modeled as a network, with stabilized interactions
functioning as links connecting the agents. Such complex, self-organized networks typ-
ically exhibit the properties of clustering, being scale-free, and forming a small world
[27]. Complex patterns can be found in many traditional research fields, so that com-
plexity is nearly as universally applicable as information science and mathematics and
attracts increasing attention throughout the last decades. However, this universality is
hard to catch and to describe in a formal way like a comprehensive domain model.
6.1.2 CompleXys Taxonomy
The first considerations regarding the domain model led, among others, to the con-
clusion that complexity is a topic with strong relations to many scientific areas. And
these scientific areas are classified to certain disciplines like biology, mathematics and
computer science, each involving many subfields again. So it is obvious to reuse this hi-
erarchical classification in the scaled-down area of complexity related terms. Moreover
this style of hierarchical topic division is well-known, easy to visualize and intuitively
understandable and usable by the users of the system. These thoughts also led to the
decision, that a hierarchical taxonomy would be sufficient for CompleXys and that us-
ing the taxonomy modeling language SKOS (see Section 4.3) is thereby a sensible choice
to be made.
The task of building an exhaustive domain model is one, that is beyond the scope of this
thesis. So, for a proof-of-concept implementation the domain model was restricted to
a semiautomatically created version, that may be revised as part of future work. In the
first place a set of representative and broad scoped textdata was needed to be found.
A good place to look for such a thing seems to be a specialized encyclopedia, so the
first chosen source was the topical table of contents1 of the Encyclopedia of Complexity
and Systems Science [45]. The second identified source were the titles of talks, that
were given at several complexity conferences. Three conferences were chosen for this
purpose:
• Complexity in Science and Society : International Conference & Summer School,
Greece, 14-26 July 20042
• Conference on Nonlinear Science and Complexity, China, 7 - 12 August 20063
111.01.2010: http://www.springer.com/cda/content/document/cda_downloaddocument/Topical+Table+of+Contents.pdf?SGWID=0-0-45-783798-p173779107
211.01.2010: http://www.math.upatras.gr/ crans/complexity/311.01.2010: http://www.siue.edu/ENGINEER/ME/NSC2006/
![Page 63: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/63.jpg)
6.1 CompleXys Domain Ontology 53
• Conference on Nonlinear Science and Complexity, Portugal, 28 - 31 July 20084
In the the next step these text documents had to be analyzed in order to extract impor-
tant terms. A suitable tool to achieve this is the keyphrase extractor Maui[44]. Maui
is based on KEA, that was already introduced in Subsection 3.4, but is designed to na-
tively accomplish a broader range of tasks. It works on a machine learning basis, so it is
necessary to train a model, before the actual keyphrase extraction can be done. In order
to train Maui well for our requirements the fitting training set needs to be specialized
on an extensive amount of general topics, because, as we stated in 6.1.1, complexity is
distributed over many traditional research fields. Such a training set is the CiteULike-
180 data set5 [44], which has been automatically extracted from the huge collection of
tags assigned to the bookmarking platform CiteULike6.
After being trained, Maui is used to extract a big amount of tag candidates out of the
four source text files by calling the MauiTopicExtractor with the n parameter set to
thousand. The resulting .key files contain the extracted tag candidates, but also in-
clude many, that are too general to be sufficient as terms of a domain model. So the
files are superficially, manually scanned and obviously unfitting terms like "group" or
"number" are sorted out. Then the remaining terms were manually clustered into main
categories. These categories are derived from those major scientific disciplines, the
terms are related to most. Additionally some other terms, resulting from a preceding
investigation, were added manually. The final taxonomy includes 297 terms divided in
ten main categories. The terms are shallowly organized within two hierarchical levels,
main categories and appendant terms. They may be further hierarchically classified to
improve their expressiveness in the future. Furthermore some terms are interconnected
by the relation type related to express either topical closeness between two terms or an
ambiguity of belonging, when a term could be assigned to more than one main clas-
sification. Figure 6.1 shows an excerpt of the model as taxonomy circle. The ten main
categories are displayed in the inner circle, while the outer circle contains examples of
appendant terms. The connections between some of the terms are exemplarily for the
use of the related relationships.
411.01.2010: http://www.gecad.isep.ipp.pt/NSC08/511.01.2010: http://maui-indexer.googlecode.com/files/citeulike180.tar.gz6http://www.citeulike.org/
![Page 64: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/64.jpg)
54 Semantic Content Annotator
Figure 6.1: An excerpt of the CompleXys taxonomy
6.2 Semantic Content Annotator Pipeline
This section introduces the Semantic Content Annotator pipeline. This pipeline is re-
sponsible for the extraction of semantic data out of the incoming documents and for
the annotation of this data back to the resources. Furthermore, it is meant to decide
whether a resource is relevant for Complexity and to what main classification it should
be assigned to. The succeeding subsections describe how these problems are solved, by
explaining the principle of the CompleXys Tasks and their implementation instances.
Subsection 6.2.1 gives an overview to the structure and purpose of the pipeline and
the Subsections 6.2.2 to 6.2.6 describe the components Crawled Content Reader, Onto
Gazetteer Annotator, Open Calais Annotator and Content Writer.
![Page 65: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/65.jpg)
6.2 Semantic Content Annotator Pipeline 55
6.2.1 Introduction
The Semantic Content Annotator module is meant to take a potentially high number
of documents as input, to analyze them in order to extract semantic data, to decide
whether they are relevant for complexity, to fuzzily classify them into the topics of
the domain model and to finally put them out again. Obviously this involves several
sequential steps, in which each document had to be processed. That makes this module
a perfect candidate for a parallel processing pipeline structure. The main advantage of
such an approach is the effective exploitation of distributed processing and most of
all of multi-core processor systems. Therewith, it is a strong way to raise processing
performance and scalability.
Figure 6.2: The CompleXysTask principle
The Semantic Content Annotator utilizes the java package java.util.concurrent to im-
plement such a pipeline. The basic principle is visualized in Figure 6.2. Every coherent
component is implemented as a runnable task object and submitted to a thread pool.
Concurrent Linked Queues handle the communication between the several tasks. Each
queue has a sender task and a receiver task. Whenever a sender task has finished its
function with a certain document, it sends it to its output queue. At the other side of
the queue the receiver task takes every document in First-In-First-Out order and starts
processing it. Every task possesses a unique name and a set of features, that can be
used to transmit all kinds of special information, that may be needed by a task. For
example the standard feature, that is used up to now, is debug. It enables a centralized
control of debugging output in the Semantic Content Annotator main class. Generally
there are three kinds of tasks in this module, differentiated by the number and usage
type of their queues, by their determination dependency and by their basic duties -
the initiating task, CompleXys Tasks and the finishing task. Furthermore every task
is linked to a Future object, which is basically a flag, that describes the determination
![Page 66: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/66.jpg)
56 Semantic Content Annotator
state of the thread it runs in. Every task, except the first, listens to the preceding task of
the pipeline and terminates then and just then, when the preceding task has terminated
and no document is left in its input queue.
The initiating task is the first task in the pipeline. Accordingly it possesses an output,
but no input queue. Instead it is responsible for collecting the necessary resources and
the already included metadata itself, which is its main purpose as well. It terminates,
when no documents are left to collect. The implemented initiating task of CompleXys
is the Crawled Content Reader. In Subsection 6.2.2 it will be described in detail.
Figure 6.3: The Semantic Content Annotator Pipeline
CompleXys Tasks are implementations of the abstract class CompleXysTask. They are
characterized by possessing both input and output queue. Their purpose is the actual
analysis, semantic data extraction and classification of the incoming documents. Three
CompleXysTask instances were implemented throughout this thesis’ work. The Onto
Gazetteer Annotator is further described in Subsection 6.2.3, the KEA Annotator in
Subsection 6.2.4 and the Open Calais Annotator in Subsection 6.2.4.
The finishing task is the last task in the pipeline. Accordingly it possesses an input,
but no output queue. Instead it is responsible for outputting the documents itself. The
whole pipeline and therewith the Semantic Content Annotator terminates, when the
finishing task is done. In CompleXys the finishing task is called Content Writer. It will
be further described in Subsection 6.2.5.
Figure 6.3 provides an overview of the implemented pipeline and its connected data
stores.
![Page 67: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/67.jpg)
6.2 Semantic Content Annotator Pipeline 57
6.2.2 Crawled Content Reader
The Crawled Content Reader is the first component of the pipeline and its main purpose
is to gather the documents from the input data store, to decide whether they should be
processed, to wrap them into the GATE data format (see Subsection 4.2.2) and to send
them into the output queue for further processing.
First it builds up connections to both the input data store, where the unprocessed docu-
ment are stored, and the output data store, where the processed documents are stored.
The former will be referred to as Harvester DB, because the Harvester module is re-
sponsible to steadily fill it with documents and the latter will be referred to as SemanticDB, because it is filled by the Semantic Content Annotator module and the stored data
additionally includes the semantic information. The connection to the Semantic DB is
built by using an intermediate persistence layer consisting of a set of DataAccessOb-
jects7 (DAO), Factories and GATE’s Hibernate Persistence Layer, resulting in a strong
layer division and high data store exchangeability.
It fetches all the documents stored in the Harvester DB and checks if the document is
already stored in the Semantic DB and thus was already processed once. This must be
done, because the Harvester DB actually needs the old documents to check for subse-
quent modification and to decide whether a resource is new. This step should become
obsolete in the future, because a modified time stamp or an unprocessed flag can help
to pointedly access only new and modified documents. If the document can be found
there, both versions are compared by a hash value of their content to find out if the
text was modified in the meantime. If it was not modified and the hash value is equal,
the document is ignored, because everything is still up-to-date, but if something was
changed the correctness of all recent annotations and potentially even of the classifica-
tion is uncertain. This dilemma is currently solved by simply deleting the document
out of the Semantic DB and treat it further as it is a new one.
If the documents needs further processing it is wrapped as a GATE document object,
thereby commiting it to the GATE persistency management of the Semantic DB. After
this is accomplished, the document is send to the output queue and the next document
is handled. The Crawled Content Reader terminates, when no unprocessed or modified
document is left in the Harvester DB.
713.01.2010: http://java.sun.com/blueprints/corej2eepatterns/Patterns/DataAccessObject.html
![Page 68: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/68.jpg)
58 Semantic Content Annotator
6.2.3 Onto Gazetteer Annotator
The Onto Gazetteer Annotator searches the text for keywords, that are listed in the
gazetteers and annotates found terms with the corresponding concept. The frequency
of occurring concept annotations can be used as a simple indicator for the classifica-
tion. The implemented version does not make use of preprocessing NLP methods like
stemmers or part-of-speech tagging and is therefore likely to provide worse recall and
precision results than the KEA Annotator, that is described in the next Subsection 6.2.3.
The central element of this component is the OntoGazetteer or semantic tagger, that is
included in the information extraction system ANNIE (see Subsection 4.2.3). It is not
directly applicable to the SKOS CompleXys taxonomy, but can make use of a derived,
rule-based version. Therefore, every main category of the domain model gets its own
.lst gazetteer file, wherein all subordinate terms are listed one per line. The terms are
noted in both forms singular and plural to prevent at least the worst effects of the miss-
ing stemmer. A file mappings.def defines the mapping rules from the .lst files to SKOS
concepts in the form: "COMPLEXITY.lst:SkosTaxonomy.rdf:Complexity", where COM-
PLEXITY is the name of the .lst file, SkosTaxonomy the path of the SKOS taxonomy
and Complexity the name of the concept. However, the expressiveness of the gazetteer
data is very limited, so the relationships can not be transformed and an improvement
of the taxonomy to more than two hierarchical layers would complicate the mapping
enormously.
At the beginning the Onto Gazetteer Annotator component initiates the OntoGazetteer
object. Therefore, it disables case sensivity, because it would unnecessarily filter out
words and can not compensate this disadvantage by any perceptible effect. GATE’s de-
fault gazetteer is used as processing basis and the resulting annotations are written into
a special "OntoGazetteerAnnotator" annotation set, so they can be pointedly accessed
later on. When a document is received through the input queue it is simply given as
parameter to the OntologyGazetteer, where it is tagged. After taking the document
back, it is forwarded to the output queue. The Onto Gazetteer Annotator terminates,
when the Crawled Content Reader has terminated and no documents are left in the
input queue.
6.2.4 Kea Annotator
The KEA Annotator classifies a document into the concepts of the CompleXys domain
model. To achieve this it utilizes automatically extracted KEA keyphrases (see Section
![Page 69: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/69.jpg)
6.2 Semantic Content Annotator Pipeline 59
4.4) as indicators for term relevancy to a text. To ensure that the keyphrases are match-
able to the domain model anyway, it simply uses the CompleXys taxonomy itself as
controlled vocabulary for the extraction process. Additionally it utilizes the related re-
lationships as weight boosting functions. It is likely to raise the classification precision
by using elaborate relevance indicators, semantic background knowledge and a pre-
filtering of unlikely candidates. The Kea Annotator is expected to outperform the Onto
Gazetteer Annotator.
A pre-implemented GATE plugin for KEA, that converts native KEA keyphrases to
GATE annotations, already existed. However, it was not yet adapted to the use of con-
trolled indexing and general KEA++ functionality, that was required for the fixed-terms
classification task. So the missing intermediary functions were manually implemented
as part of this thesis and the plugin was adapted to KEA++. As classification model
CompleXys uses a pre-trained model, again using the CiteULike-180 data set8, that
was already applied in the ontology extraction process (see Section 6.1).
The Kea Annotator component itself is quite simple. It just fetches every new document
from the input queue, gives it as parameter to the Kea class of the GATE plugin, that
extracts and annotates the keyphrases into a special "KEAAnnotator" annotation set,
and finally writes it back into the output queue. The Kea Annotator terminates, when
the Onto Gazetteer Annotator has terminated and no documents are left in the input
queue.
6.2.5 Open Calais Annotator
The Open Calais Annotator utilizes the OpenCalais web service (see Subsection 4.5.2)
to semantically annotate the text’s entities. The so obtained data is not yet used for
the domain classification, but links the data to the wide external set of Calais’ stored
semantical knowledge base. To exploit these relations has great potential in further
improving the classification, but also for other features like enriching the displayed
resources in the front end with additional data. However, to actually make use of these
data had to be part of future works on CompleXys.
CompleXys uses the existing OpenCalais GATE plugin to access the web service and to
convert its responses to GATE annotations. The plugin had to be initiated with the web
service URL and an OpenCalais API key, that can be requested on the Calais website9.
Then every incoming document can be given to the plugin, which processes it and
811.01.2010: http://maui-indexer.googlecode.com/files/citeulike180.tar.gz9http://www.opencalais.com/
![Page 70: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/70.jpg)
60 Semantic Content Annotator
writes the annotations into a special "OpenCalaisAnnotator" annotation set, so that they
can be pointedly accessed later on. The processed document is forwarded to the output
queue. The Open Calais Annotator terminates, when the Kea Annotator has terminated
and no documents are left in the input queue.
6.2.6 Content Writer
The Content Writer ensures, that every document is correctly stored in the Semantic DB,
before the pipeline terminates. Optionally it can calculate comparing or summarizing
values of the former component’s results, because its finishing position guarantees, that
every value is in its final version, when it arrives at the Content Writer.
First it builds up a connection to the Semantic DB and waits for the first documents to
arrive at the end of the pipeline. After receiving a document from the input queue, it
calculates one or more main categories according to the results of the Onto Gazetteer
Annotator and the Kea Annotator in the corresponding annotation sets and writes it as
"mainCategory" feature to feature set of the document. Then it stores the actual version
of the document in the Semantic DB and tries to process another document. The Content
Writer terminates, when the Open Calais Annotator has terminated and no documents
are left in the input queue. When it has terminated, the whole pipeline terminates too.
![Page 71: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/71.jpg)
CHAPTER 7
Semantic Filter
This chapter focuses on the Semantic Filter module, which is the second practical de-
velopment goal of this thesis. It provides a filter mechanism, that extracts resource
subsets by their compliance to semantic conditions. Therefore, Section 7.1 introduces
the principle of the applied filters and their implementation. In addition the matching
resources must be further converted into a standardized, useful output format. Section
7.2 describes the converters for the two formats RSS and Sesame triples, that have been
chosen for this purpose.
7.1 Filter
This section introduces a filter system for CompleXys. Therefore, Subsection 7.1.1 dis-
cusses certain filter approaches, that were considered to be used. Subsection 7.1.2 de-
scribes the central AbstractFilter class and its internal iterating data structure. Subsec-
tion 7.1.2 explains how these filters can be used to implement a framework for propo-
sitional logic. Subsection 7.1.3 introduces the basic filters, that actually evaluate the
semantic data of the documents.
7.1.1 Filter Approaches
The first step in designing a useful filter system for CompleXys was to identify a suit-
able filter approach. One requirement were that it has to be able to handle the large
amount of resources, that are likely to arise in the long-term use of an information fil-
tering system in the world wide web. Furthermore the approach has to be expressive
enough to describe composed filter queries, because the dynamical user interests are
unlikely to be reducible to a single standard condition. And finally the filter has to pro-
![Page 72: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/72.jpg)
62 Semantic Filter
vide good runtime performance, because it tends to be directly involved in the front
end’s output composition, thereby directly influencing the response time for the users
of CompleXys.
A possible approach is an abstract filter layer to the underlying data store itself, that
would try to directly convert user interest queries into data store queries. Advantages
of such an approach are the high expressiveness of potentially usable query languages
like SQL, the native capability to handle big amounts of resources and good perfor-
mance, because data stores are usually optimized for rapid access. Main disadvantages
are the complexity of matching complex user interest queries to data store queries and
above all the high coupling of the filter component to the underlying data store imple-
mentation and structure.
Another approach is the direct conversion of the resources into a triple store format, that
could be used with Sesame1 (see 7.2.2). Filter queries could be made with such suitably
powerful query languages as SPARQL2 and the queries themselves would have been
optimized for good performance too. However, the main disadvantage of this approach
is, that each document had to be converted to the triplestore format in the first place.
This leads to unacceptable high performance flaws and inevitable problems with huge
amount of documents. A possibility to overcome this problems would have been to
exchange the persistence layer of the Semantic Content Annotator and to replace the
GATE solution with a triplestore, so no conversion would have been necessary. How-
ever, this possibility was still discarded, because of the high expense of creating a new
persistence layer between the GATE document model and Sesame and because another
suitable alternative was available.
The third approach and the one, that was finally chosen, is based on the FilterIterator
pattern [57]. Iterators are data structures, that are designed to iteratively access the
elements contained in them. Usually each element is given out, but contrary to this
common habit the FilterIterator controls the elements beforehand. In doing so it checks
if the elements meet a certain filter condition and simply skips those, that do not. This
approach is apparently poorer in performance than the direct data store access, because
in order to evaluate, if they meet the filter criterions, it had to indifferently fetch all doc-
uments in the first place. But on the other hand it is supposed to outperform the triple
store conversion approach, because it discards unnecessary documents before invest-
ing any further processing in them. in addition the performance of logically composed
filter conditions can be further improved by short circuits as discussed in Subsection
1http://www.openrdf.org/2http://www.w3.org/TR/rdf-sparql-query/
![Page 73: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/73.jpg)
7.1 Filter 63
7.1.3. The main advantage of this approach is the flexibility of the system. The filters
can be arbitrarily composed to complex, logical expressions and the system can be ex-
panded by simply writing new condition methods. The actual implementation of the
FilterIterator pattern and the particular filters is described in the succeeding subsec-
tions.
7.1.2 Abstract Filter
The central element of the Semantic Filter module is the AbstractFilter, that is oriented
on the FilterIterator pattern [57]. It is an abstract java class, that contains three elements.
The first element is the private data structure FilterIterator. It implements java’s Iterator
interface, but differs from the standard implementations in two important points. First
its constructor takes another iterator as parameter and wraps it. This results in the
possibility to recursively wrap FilterIterators in another one, thereby creating complex,
composed FilterIterators. This possibility will be leveraged within the logic filters, that
are described in Subsection 7.1.3. The other difference is the toNext() method, that is
responsible to shift the position pointer of the iterator one step forward. However, this
particular implementation does not take one step, but as many steps as are required to
find an element, that meets the filter condition.
The filter condition, that is checked in the toNext() method is described within the
abstract method passes(). This method had to be implemented by any instance of the
AbstractFilter and characterizes the behavior of the particular filter. Therefore, it takes
an element as parameter, checks if it meets the filter condition and simply returns the
truth value.
The third and last element of the AbstractFilter is the filter() method. This is merely
an intermediate class between any external caller and the private FilterIterator data
structure. To achieve this, it takes an iterator as input parameter, uses it to instantiate a
FilterIterator and returns it back to the calling instance.
7.1.3 Logic Filters
Single filter methods can take parameters in order to provide a basic level of flexibility,
but it is unnecessarily time-consuming to provide a new filter class for any possible
condition of a semantic query. A possibility to partially overcome this is the use of
logic filters. These are filters that accept an arbitrary big amount of another filters as
parameters to link them in a logical way.
![Page 74: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/74.jpg)
64 Semantic Filter
Three of those filters have been implemented in the Semantic Filter module, the And-
Filter, the OrFilter and the NotFilter. These filters perform the corresponding operators
and, or and not on those filters, that have been committed to them. Due to the well-
known fact, that the missing operators implication and equivalence can be simulated
by compositions of the three implemented ones, these logical filters form a complete
framework for propositional logic.
The AndFilter takes an arbitrary big number of AbstractFilters and stores them in an
iterable array. Its passes() method iterates over the stored filters and checks each sub-
ordinate passes() condition for fulfillment. Since all propositions of an and operation
had to be true in order to successfully terminate the operation itself, the operator is
cut short and falsified whenever a proposition is found to be incorrect. This behav-
ior can potentially improve the performance of more complex queries. AndFilters are
useful for precise searches, which makes them a prominent choice for queries of the
"Search" use case, that is defined in Subsection 5.2.2. The OrFilters are structured like
the AndFilters, but differs insofar, that just one proposition had to be true to prove the
truthfulness of the whole operator. Analogue to this a short cut to a true termination
is taken, whenever one proposition is found to be true. OrFilters are useful in person-
alized queries, that are trying to express a general "give me all resources, that might
be interesting for the user". The NotFilter does nothing else, then to negate the actual
boolean result of the committed filter’s passes() method. In contrary to the other logic
filters it accepts just one AbstractFilter as input.
Figure 7.1 provides an example for the application of propositional logic in filtering
queries. The visualized case is equivalent to the verbalized condition "Give me all doc-
uments, that are not related to ComplexMathematics, but related to GraphTheory and
DataMining, as well as to at least one of the topics SystemsGenetics and ComplexCom-
puterScience."
7.1.4 Basic Filters
If the logic filters described in Subsection 7.1.3 build up a framework for propositional
logic, the basic filters can be seen as its propositional elements. They filter documents
according to certain concrete criterions. Up to now two basic filters have been im-
plemented in CompleXys, the GazetteerAnnotationFilter and the KeaAnnotationFilter.
Both utilize the document’s semantic annotations, that were extracted within the Se-
mantic Content Annotator module, to decide if a document is sufficiently related to a
certain term. Therefore, they take the particular term string and the float numbered
![Page 75: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/75.jpg)
7.2 Output Variants 65
Figure 7.1: An example for the application of propositional logic in filtering queries
approval threshold between zero and one as input parameters. In the passes() method
they do look up the semantic annotations, that were stored in the respective annotation
set. In case of the GazetteerAnnotationFilter this is the "OntoGazetteerAnnotator" set
and in case of the KeaAnnotationFilter the "KeaAnnotator" set. Then the annotations
terms are iteratively compared to the criterion term. If both match, they further check
if the classification probability of the annotation is equal to or greater than the approval
threshold. If it is the criterion is matched and a true value is returned. Else the passes()
method returns a false value.
7.2 Output Variants
Up to this point the output of the filter process is still a non-standardized java object,
although it forwards the document to the front end module, that converts them into
rendered user output. Two converter classes were developed in order to change this
flaw, to support the requirements of the front end module and for demonstration rea-
sons. The RSS converter is described in Subsection 7.2.1 and the Sesame Triplestore
Converter is described in Subsection 7.2.2.
![Page 76: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/76.jpg)
66 Semantic Filter
7.2.1 RSS Converter
RSS3 is a content syndication format, that is based on XML. In version 2.0 its name is
defined as abbreviation for Really Simple Syndication. It is beside its competitor Atom4
the de facto standard for news feeds in the internet, which makes it an interesting and
highly reusable output choice.
To convert the document into the RSS format CompleXys uses JDOM to dynamically
build up the corresponding XML tree. The XML structure of an RSS feed always con-
tains an rss element, with a subordinate channel element. This channel element contains
any number of item elements, that usually relate to entities like articles of a news site,
entries of a blog or posts of a forum. In the case of CompleXys these items will represent
the filtered documents. The document’s metadata is stored in subordinate attribute el-
ements. The document title is written to the title as well as to the description element,
the document URL to the link element and the categories, that are sorted by probability
and divided by comma, to the category element. A converted item looks the following
way:
<item>
<title>This is the document title</title>
<link>http://www.source.com/original_resource_url.html</link>
<description>This is the document title</description>
<category>cx:example,cx:complexity,cx:thesis</category>
</item>
7.2.2 Sesame Triplestore Converter
Sesame5 is a storage and querying middleware for RDF and is thus based on the triple
store technique. Additionally it also supports SPARQL queries and is in this combina-
tion a powerful candidate to become a common persistence solution for the semantic
web. It was also considered to be used for the filter itself. This possibility was dropped
mainly because of the high implementation expenses for a new persistence layer be-
tween the GATE document model and Sesame in the Semantic Content Annotator
module, as well as the need for an abstracting layer between Semantic Filter queries
3http://www.rssboard.org/rss-specification4http://www.atompub.org/rfc4287.html5http://www.openrdf.org/
![Page 77: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/77.jpg)
7.2 Output Variants 67
and SPARQL, which together would have exceeded the scope of this thesis. However,
Sesame triple stores remain an interesting candidate for semantic web interoperabil-
ity and further processing, so the Sesame Triplestore Converter was implemented to
provide a corresponding output possibility.
Triple stores are named as that, because every stored basic fact is composed by a triple
of elements, a subject, a predicate and an object. Each of these elements is identified
by a URL, so a main task of the mapping was to find unambiguous, readable URLs
for each individual document and annotation we want to store. This was solved by
identifying them by a URL prefix, that clarifies their type, and an hash id as suffix,
that identifies them by hashing a combination of their metadata and content. The re-
lations between objects are stored as fixed URL terms with a common predicate prefix.
The concepts of the CompleXys domain model do not need to be converted, because
they already possess a taxonomy URL. To convert the document into the XML format
CompleXys uses again JDOM to dynamically build up the corresponding DOM tree.
Finally an exemplarily fact, expressing that a specific document is annotated with the
term "ChaosTheory", looks the following way:
<http://complexys.de/datamodel/annotationhash-1/a3cca2b2aa1e3b>
<http://complexys.de/datamodel/pred/annotates>
<http://complexys.de/datamodel/dochash-1/5b3b5aad99a8529074>
<http://complexys.de/datamodel/annotationhash-1/a3cca2b2aa1e3b>
<http://complexys.de/datamodel/pred/hasCategory>
<cx:ChaosTheory>
![Page 78: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/78.jpg)
68 Semantic Filter
![Page 79: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/79.jpg)
CHAPTER 8
Evaluation
This thesis proposes a solution for the two CompleXys modules Semantic Content An-
notator and Semantic Filter. It accomplishes semantic enrichment of complexity related
content, text classification and the filtering of documents in dependency of their se-
mantic data. This chapter evaluates this solution to decide, whether the requirements,
that were defined in Section 2.2, are met and how the system generally performs. Due
to the limited time frame of thesis only the most critical evaluations have been done.
Section 8.1 discusses the achieved quality of the important text classification task by the
common metrics precision and recall. This is only done for the Semantic Content An-
notator module, because the Semantic Filter performs no additional classification, but
just utilizes the already existent ones. Section 8.2 analyses the response time behavior
of the Semantic Filter solution, because this module can directly influence the response
time for the end user and is therefore time-critical. In contrast to that, the response time
of the Semantic Content Annotator is not as critical, because it usually runs concur-
rently in the system background and does not directly influence the end user response
time. Finally Section 8.3 summarizes the evaluation results and concludes strengths
and weaknesses of the system.
8.1 Classification Quality
This section is dedicated to the evaluation of the achieved text classification quality
of CompleXys. The quality is measured in the popular classification quality metrics
precision and recall (see also Subsection 2.2.3). Subsection 8.1.1 introduces the utilized
set of test documents and explains how it was created. Subsection 8.1.2 explains the
applied test strategy and finally Subsection 8.1.3 discusses the test results.
![Page 80: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/80.jpg)
70 Evaluation
8.1.1 Document Set
Text classification systems are usually evaluated by performing a classification on huge
corpora like the popular Reuters-215781 collection or the succeeding RCV12. Then the
classification results of the system are compared to the already existing human-made
classifications, which are taken as the ideal standard. However, the problem of a do-
main specific system like CompleXys is that it must be tested in its own domain to gain
a useful measure of its quality. In fact it is not even allowed to classify the texts to the
most categories, that are used for the general data sets, thus these can not be reasonably
used. More comparable domain specific text classification systems also tend to have a
huge domain corpora with existing classifications present [54], but apparently there is
none for the domain of complexity. The only suitable alternative is to create a new com-
plexity data set, but to do this manually was hardly feasible in regard to the scope of
this thesis. Finally a test document set was automatically designed.
The documents of the set need to fulfill five conditions. They had to be written in en-
glish, be scientific and nested in markup, because CompleXys is supposed to collect its
resources mostly from english, scientific blogs and news sites. They had to be clearly
dividable into those that are related to complexity and those who are not, because it
needs to be automatically decidable whether a certain decision of CompleXys to keep
and store documents is correct. Finally they had to be enough in number to be statisti-
cally relevant, because small samples can too easily be unrepresentative.
In order to comply with these conditions, the final document set was derived from re-
sources, that are tagged on Citeulike3. Ten documents were chosen for each of the ten
main classifications in the CompleXys taxonomy. These hundred documents were com-
plemented by another two hundred, that were not tagged with any term of the Com-
pleXys taxonomy, so that wrong chooses become possible too. All of this documents
are excerpts from english, scientific websites, so the first three conditions are naturally
met. Furthermore, by referring to their tags, they are dividable into a complexity re-
lated part and another that is not. This is not an exact classification, but anyhow an
appropriate rule of thumb, that approximates precise classification to an acceptable de-
gree. Finally the test set contains three hundred documents, which should be enough
to avoid unrepresentative behavior.
1http://www.daviddlewis.com/resources/testcollections/reuters21578/2http://trec.nist.gov/data/reuters/reuters.html3http://www.citeulike.org/
![Page 81: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/81.jpg)
8.1 Classification Quality 71
8.1.2 Test Strategy
The evaluation of the Semantic Content Annotator is basically performed by text clas-
sification quality. Therefore, the test data set, that was introduced in the preceding
subsection, is processed by the Semantic Content Annotator pipeline. This execution
is done for a series of comparable configurations. In the first test the Onto Gazetteer
Annotator is the only component that performs the classification. In the following ten
test configurations the Kea Annotator processes the documents. The latter differ by the
threshold of occurrence, that had to be exceeded by a term, before it counts as relevant
for the text. This variable is supposed to significantly influence the performance of the
classification, because complexity terms like "complexity" or "chaos" also frequently oc-
cur in texts, that are not pertinent to complexity as such. But those irrelevant words are
likely to occur significantly less often, than relevant words, so a sophisticated threshold
can help to sort the wheat from the chaff.
The obtained binary classification data is used to calculate the standard metrics preci-
sion and recall, that can be compared between the configuration cases, but also to the
performance of other text classification systems. Additively to the relevance decision
the test stores URL, main category and the weight of the main category. The main cat-
egory weight is the relative share of terms from a certain main category within the set
of all term annotations of a resource. It is used to compare certain main categories in
order to identify the most important ones for the text. This data is used to take random
samples to empirically evaluate the quality of the categorization into main categories
and analyze the correlation of correctness and weight.
8.1.3 Test Results
The results of the classification quality tests had to be considered within the contrast of
recall and precision. Accordingly Figure 8.1 presents the measured values distributed
across these two dimensions. The particular points represent the several test runs. The
label gaz refers to the Onto Gazetteer Annotator test and the kea labels to the Kea Anno-
tator tests, with the number standing for the different minimum term occurence values.
It can be perceived, that the gaz test achieves a very high recall value, but by doing so
clearly fails to fulfill the precision target value of 0.7. The kea2 test performs even worse,
but the higher the occurrence threshold is adjusted, the better become the precision
values. This tendency can be continued up to kea80, that misses the target value just
by 0.06, while still complying to the recall requirement. However, kea90 significantly
![Page 82: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/82.jpg)
72 Evaluation
Figure 8.1: The distribution of the quality test runs across the dimensions precisionand recall
declines in both values recall and precision, so it can be assumed that kea80 forms a
local optimum and can not simply be further improved by increasing the minimum
occurrence threshold.
The main category values, that were annotated to the resources, are evaluated by taking
random classification samples. These were manually compared to the actual content to
get an empirical clue to how this classification task performs. The samples generally
achieve a success rate of approximately 50%. Considering the negative effect on the
user experience, that a one out of two error rate has, this is obviously not a very good
result. However, this flaw seems partially to be a consequence of the previous false
complexity classifications. This impression is underlined by the fact, that the success
rate of those resources of the random samples, that were correctly classified as com-
plexity, is approximately three out of four, which is significantly higher. Furthermore
the main category classifications are often understandable and vaguely right, but rated
as false, because one or two other categories would definitely fit better. Due to the
interdisciplinary nature of complexity, this is a frequently occurring case. So this key
characteristic is likely to be utilizable in order to improve the quality of main category
classifications. Further improvement proposals are made in the future work considera-
tions in Chapter 9.
Additionally the analysis of the main category classifications and their weights uncov-
ers another interesting relation. Table 8.1 shows, that the resources, that were classified
to their main category with a total weight of 1.0, have a significantly lower complexity
![Page 83: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/83.jpg)
8.1 Classification Quality 73
weight
range
document
number
average
precision
0,0 - 0,4 11 0,72
0,4 - 0,5 15 0,66
0,5 - 0,6 17 0,65
0,6 - 1,0 23 0,78
1,0 21 0,38Table 8.1: Average precision and number of documents within the top test kea80, clus-
tered by main category weight ranges with at least 10 documents
classification precision than others The rapidly falling trend line in Figure 8.2 visualizes
this effect. Based on this knowledge, the result of the top performing test kea80 was re-
evaluated, by simply discarding all resources with the 1.0 value. This virtual test run
is referred to as kea80+ and it is listed among the other tests in Figure 8.1. On one side
the KeaAnnotator finally exceeds the precision requirement value in this configuration,
but on the other it also slightly misses the target recall by 0.02. However, it is still the
best result, that was achieved throughout the tests and if it can be a little bit further im-
proved it will totally achieve the requirements. Suggestions to how this improvements
might look like are discussed in Chapter 9.
Figure 8.2: The correlation of main category weight and precision within the top testkea80
![Page 84: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/84.jpg)
74 Evaluation
8.2 Response Time
This section is dedicated to the performance evaluation of the Semantic Filter. The
performance is measured in response time (see also Subsection 2.2.3). Subsection 8.2.1
explains the applied test strategy and Subsection 8.2.2 discusses the test results.
8.2.1 Test Strategy
The Semantic Filter is evaluated by his response time behavior, because it is time-critical
insofar that it can directly influence the time a user had to wait for the system response.
To evaluate this response time a test should be able to simulate various influencing
variables. Thus the tests vary in four basic dimensions. The number of documents, that
had to be filtered, scales in the steps 10, 100 and 1000. The number of considered terms
scales from 1 to 251 in steps of fifty. The usage of logical filters is varied by either using
an AndFilter or an OrFilter, each with all BasicFilters inverted by NonFilters, without
any invertion or both randomly mixed. Finally a complex nested Filter will be sim-
ulated by randomly chunked BasicFilters, that are nested in randomly chosen logical
filters, that can be nested within another filters again and so on. This mix is supposed
to simulate complexity in the structure of filter systems and measure its effects on the
performance. All filter combinations are visualized in Table 8.2. The tested documents
are only required to possess a certain number of random semantic annotations, so they
can be instantly and automatically created.
Test and or not random
not
and plain x
and not x x
and random not x x
or plain x
or not x x
or random not x x
mixed x x xTable 8.2: The characteristics of the performed test series
The tests were performed on a Macbook with a two gigahertz IntelCore2Duo processor,
two gigabyte DDR2 SDRAM, the operation system MacOSX 10.4.11 and Java 5. This is
not a representative server system, but should be sufficient to reveal the basic runtime
![Page 85: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/85.jpg)
8.2 Response Time 75
behavior and eventual scalability problems.
8.2.2 Test Results
The performance evaluation was dependent from many variables, so the results are
displayed in two perspectives. The first one calculates the average values of the respec-
tive term numbers to be able to examine the relation between document number and
response time. Its results are visualized in Figure 8.3.
Figure 8.3: Average response times over several test series and numbers of handleddocuments
The results reveal that the OrFilter test runs are not significantly influenced by the
number of documents. The response time of the AndFilter test runs on the other side
steadily increases over the three document scales. Furthermore, are those AndFilters,
that contain additional NotFilters, slower than those, that do not and the response time
of the mixed filters does also constantly increase. This behavior can be explained by the
combination of two facts. Firstly it was more likely for a proposition to be false than
true, with a distribution of approximately 120 successes occurring in 1000 documents.
Secondly NotFilters are an additional filter layer, that always costs extra time. The first
fact leads to a better performance of and plain and the inverted OrFilters, because they
can frequently shortcut their decisions. Opposed to that, or plain and the inverted And-
Filters had to check more BasicFilters before they can return their results. However,
![Page 86: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/86.jpg)
76 Evaluation
the increased number of iteration turns alone does apparently not cause any notewor-
thy problems. These do emerge only then, when the iterations are multiplied with the
additional execution time of a NotFilter.
Speed requirements alone can often be satisfied by simply using better hardware for
the servers, but this approach soon stops to be feasible if the response times grow ex-
ponentially. The attribute of software, that measures this behavior is the scalability. To
observe it for the number of documents, the response times are normalized to a rel-
ative response time per document value. The results of this procedure are presented in
Figure 8.4. They reveal that none of the test series grows faster than the linear increas-
ing number of documents, so it can be concluded that the system is scalable within
this dimension. More than that, the relative response time decreases, which is likely to
be caused by a relatively high initial loading time for the code, that is followed by an
efficient processing iteration.
Figure 8.4: Normalized average response times over several test series and numbersof handled documents
The second perspective of the evaluation is the number of used terms. In order to
evaluate this dimension, the average values for the test runs with different numbers
of documents are calculated and presented in Figure 8.5. It can be perceived that the
OrFilters are again indifferently fast in every test. The and plain test run rises to a satu-
ration level and stagnates. Only those filters that include both AndFilter and NotFilter
are steadily dependent on the number of terms. But, as the normalized presentation in
![Page 87: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/87.jpg)
8.2 Response Time 77
Figure 8.5: Average response times over several test series and numbers of terms
Figure 8.6 shows, this growth is not exponentially and hence not critical. The behavior
of the mixed test is too random to provide useful clues in this dimension.
Figure 8.6: Normalized average response times over several test series and numbersof terms
![Page 88: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/88.jpg)
78 Evaluation
8.3 Discussion
The evaluation of the complexity relevance decision reveals that the unadjusted KEA
Annotator as well as the Onto Gazetteer Annotator fails to achieve the precision re-
quirements, that were stated in Subsection 2.2.3. However, further configuration of the
minimal term occurrence variable and the additional discarding of a special class of
documents, whose main category classification was performed with a weight of 1.0,
leads to a test run, that fulfills the precision requirement and just slightly misses the
target recall value. So after adjustment the module already performs this task nearly
satisfying. Approaches for a further improvement of its performance are suggested in
Chapter 9.
According to the random samples the main category classification performs with an
approximate error ratio of fifty percent. This is an unusable state, so it is necessary to
investigate the causes of this flaw and further improve the solution of this task.
The response time of the Semantic Filter module never grows exponentially, so it can be
considered as scalable. The tests revealed a clear performance difference between the
runtimes of OrFilters and those of AndFilters with and without nested NotFilters. Andand not are apparently a slow combination. Several theories for the causes of certain
performance patterns were constructed, but have not yet been verified. Generally the
performance requirements of the Semantic Filter should be achievable, if some further
code optimization is performed and when the system runs on a more powerful server
hardware.
![Page 89: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/89.jpg)
CHAPTER 9
Summary and Future Work
This thesis investigated the applicability of semantic metadata for the task of utilizing
social media resources in topic specific and context aware systems. It is embedded into
the CompleXys project, that develops an adaptive information portal for the field of
complexity. More precisely this thesis implemented the two modules Semantic Content
Annotator and Semantic Filter. The former uses GATE, KEA and OpenCalais to extract
semantic data from incoming documents. Thereupon it decides whether the resource
is considered as relevant for the topic of complexity and into which domain category it
should be classified. A newly created complexity taxonomy was used as a controlled
vocabulary for this process. The Semantic Filter applies the combined concept of filter
iterators and propositional logic to provide a flexible access interface to the semantically
indexed documents.
The evaluation of this work is split into a quality evaluation of the classification process
in the Semantic Content Annotator and a performance evaluation of the time-critical Se-
mantic Filter. The quality requirements, that were stated in Subsection 2.2.3, demand a
precision value of at least 0.7 and a recall value of at least 0.5 for the complexity clas-
sification. While several tested configurations were able to met one of these two goals,
none was able to met both at the same time. However, the kea80+ test run succeeds the
precision threshold and misses the required recall by just 0.02. So, it can be regarded as
a good starting point for further quality improvement. This top configuration is based
on a surprisingly high minimal term occurrence value of eighty. It can be assumed
that the success of this value was caused by the huge average text size of the scientific
documents in the test set. But not all documents of the set are as big as this average
and a broader amount of sources is likely to cause even a bigger variance of text sizes.
Unfortunately high occurrence values are nearly a guarantee for smaller than average
documents to be discarded without distinction. A possibility to handle this fact is to
implement the minimal term occurrence not as a constant, but as a relative value to the
![Page 90: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/90.jpg)
80 Summary and Future Work
size of the document. Another effect, that was harnessed by the kea80+ test run, was
that a main classification with a one-sided weight of 1.0 towards a single category is
likely to be a false success and can be rejected to increase the precision. However, it had
to be clear that doing so probably also decreases the recall value, because it skips all
documents that just contain words from one main category. Additionally, it is possible
that this approach fails for small documents, because short texts are far more likely to
contain just terms of one category.
A further empirical evaluation of the main category classifications revealed an error
rate of approximately fifty percent. It must therefore be regarded as a current weakness
of the system and should be improved. A first approach to do so is to increase the
precision of the complexity classification, because documents that are not relevant for
complexity can hardly be correctly classified into a complexity category. Furthermore,
many classification are not strictly wrong, but just chose a classification, that would
not be the first choice of a human classifier. The interdisciplinary nature of complexity
even boosts this effect, because most of the texts could possibly be classified to more
than one main category. Therefore, it is worth to consider if a multi-value classification
would not be a better choice than the current one.
General improvements to the quality of the Semantic Content Annotator can also be
made by utilizing the document structure by increasing the term candidate weight of
words with emphasizing markup like boldness or headline elements. Yet another pos-
sibility is the use of additional data sources. For example the links in the text could be
loaded and analyzed too, the already annotated OpenCalais data can be applied and
the title can be searched in sites like Google, Citeulike, Technorati or Delicious to extract
additional context and collaborative classification suggestions. User tags in CompleXys
itself can also help to improve the classification. They can not just subsequently refine
the classification quality, but also provide feedback for certain classification decisions,
which can be used for a steady training of the classifier.
The performance evaluation of the Semantic Filter revealed a sufficient scalability of
the filter systems. Minor flaws are likely to be compensated by powerful hardware
and additionally reduced by further code optimization. Performance improvements
can generally be made by decoupling the text annotations from the filter process. Up
to now the Basic Filters iterate over all semantic annotations of a certain text to find
fitting terms. This step can be fastened by separately storing occurring terms and their
occurrence number, which would limit the maximum number of accesses to the number
of terms in the taxonomy. If this data is additionally sorted, the filters can also apply
advanced search techniques to accelerate the procession. Apart from the performance,
![Page 91: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/91.jpg)
81
an improvement of usefulness for information filtering purposes can be achieved by
implementing fuzzy filters, that pick documents not according to discrete criteria, but
just return the top matching candidates sorted by their additive interest probability.
![Page 92: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/92.jpg)
82 Summary and Future Work
![Page 93: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/93.jpg)
References
[1] C. Anderson. The Long Tail. Random House Business, 2006.
[2] A. Baruzzo, A. Dattolo, N. Pudota, and C. Tasso. A general framework for person-
alized text classification and annotation. In Workshop on Adaptation and Personaliza-tion for Web 2.0, UMAP’09, 2009.
[3] N.J. Belkin and W.B. Croft. Information filtering and information retrieval: two
sides of the same coin? Commun. ACM, 1992.
[4] T. Berners-Lee. Information Management: A Proposal, 1989. URL http://www.
w3.org/History/1989/proposal.html.
[5] T. Berners-Lee. Linked Data - Design Issue, 2006. URL http://www.w3.org/
DesignIssues/LinkedData.html.
[6] K. Bittner. Use Case Modeling. Addison-Wesley Longman Publishing Co., Inc.,
2002.
[7] S. Bloehdorn, P. Cimiano, and A. Hotho. Learning ontologies to improve text clus-
tering and classification. In From Data and Information Analysis to Knowledge Engi-neering: Proceedings of the 29th Annual Conference of the German Classification Society(GfKl’05), 2005.
[8] J. Bogg and R. Geyer. Complexity, science and society. Radcliffe Medical Press, Ox-
ford, 2008.
[9] U. Bojars and J.G. Breslin et al. SIOC Core Ontology Specification, 2009. URL http:
//rdfs.org/sioc/spec/. Revision 1.33.
[10] G. Booch, I. Jacobson, and J. Rumbaugh. The Unified Modeling Language User Guide.
Addison-Wesley, 1999.
[11] D. Brickley and L. Miller. FOAF Vocabulary Specification 0.96, 2009. URL http:
//xmlns.com/foaf/spec/.
![Page 94: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/94.jpg)
84 References
[12] Open Calais Documentation. Calais, 2009. URL http://www.opencalais.com/
documentation/opencalais-documentation.
[13] P. Casoto, A. Dattolo, F. Ferrara, N. Pudota, P. Omero, and C. Tasso. Generating
and sharing personal information spaces. In Proc. of the Workshop on Adaptation forthe Social Web, 5th ACM Int. Conf. on Adaptive Hypermedia and Adaptive Web-BasedSystems, 2008.
[14] J. Chen, D. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query
system for internet databases. In In Proc. of SIGMOD, 2000, 2000.
[15] D. Crockford. The application/json Media Type for JavaScript Object Notation (JSON),2006. URL http://tools.ietf.org/html/rfc4627.
[16] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework
and graphical development environment for robust NLP tools and applications.
In Proceedings of the 40th Anniversary Meeting of the Association for ComputationalLinguistics, 2002.
[17] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. The GATE User Guide,
2002. URL http://gate.ac.uk/.
[18] P.J. Denning. Electronic junk. Commun. ACM, 1982.
[19] S.J. Green. Building hypertext links by computing semantic similarity. In IEEETransactions on Knowledge and Data Engineering, 11, 1999.
[20] R. Grishman. TIPSTER Architecture Design Document Version 2.3., 1997. URL http:
//www.itl.nist.gov/div894/894.02/relatedprojects/-tipster/.
[21] T.R. Gruber. A translation approach to portable ontology specifications. Knowl.Acquis., 1993.
[22] S. Handschuh and S. Staab. Cream - creating metadata for the semantic web. Com-puter Networks, 2003.
[23] D. Heckmann, E. Schwarzkopf, J. Mori, D. Dengler, and A. Kröner. The user model
and context ontology gumo revisited for future web 2.0 extensions. In C&O:RR,
2007.
[24] J. Heinz. Implementation of an approximate information filtering approach
(maps). Bachelor thesis, Universität des Saarlandes, 2008.
![Page 95: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/95.jpg)
References 85
[25] P. Herron. Automatic text classification of consumer health web sites using word-
net. Technical report, The University of North Carolina at Chapel Hill, 2005.
[26] A. Heß, P. Dopichaj, and C. Maaß. Multi-value classification of very short texts.
In KI ’08: Proceedings of the 31st annual German conference on Advances in ArtificialIntelligence, 2008.
[27] F. Heylighen. Encyclopedia of Library and Information Sciences, chapter Complexity
and Self-organization. Marcel Dekker, 2008.
[28] J. Howe. Crowdsourcing: A definition. Crowdsourcing: Tracking the Rise of the Am-ateur (weblog, 2 June), 2006. URL http://crowdsourcing.typepad.com/cs/2006/
06/crowdsourcing_a.html. (accessed on jan 10, 2010).
[29] F. Iacobelli, K. Hammond, and L. Birnbaum. Makemypage: Social media meets
automatic content generation. In Proc. of ICWSM 2009, 2009.
[30] IEEE. IEEE Recommended Practice for Software Requirements Specifications, 1998. URL
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=720574.
[31] A. Isaac and E. Summers. SKOS Simple Knowledge Organization System Primer, 2009.
URL http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/.
[32] A. Jae-wook, P. Brusilovsky, Daqing He, J., and Qi Li. Personalized web explo-
ration with task models. In WWW 2008 / Refereed Track: Browsers and User Interfaces,
2008.
[33] J. Kahan and M.R. Koivunen. Annotea: An Open RDF Infrastructure for Shared
Web Annotations. In Proc. of the International World Wide Web Conference (WWW),2001.
[34] J. Kim, D. Oard, and K. Romanik. Using implicit feedback for user modeling in
internet and intranet searching. Technical report, University of Maryland CLIS,
2000.
[35] A.B. King. Website optimization. O’Reilly, 2008.
[36] C. Lanquillon. Enhancing Text Classification to Improve Information Filtering. PhD
thesis, Fakultät für Informatik, Otto-von-Guericke-Universität Magdeburg, 2001.
[37] E. D. Liddy. Encyclopedia of Library and Information Science, chapter Natural Lan-
guage Processing. Marcel Dekker, 2003.
![Page 96: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/96.jpg)
86 References
[38] W.W. Lie and B. Bos. Cascading Style Sheets, level 1, 1996. URL http://www.w3.org/
TR/REC-CSS1/.
[39] J. Liu, L. Birnbaum, and B. Pardo. Categorizing blogger’s interests based on short
snippets of blog posts. In CIKM ’08: Proceeding of the 17th ACM conference on Infor-mation and knowledge management, 2008.
[40] K. Mahesh. Text retrieval quality: A primer.
http://www.oracle.com/technology/products/text/htdocs/imt_quality.htm, 1999.
[41] D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, M. Theobald, and G. Weikum.
Word sense disambiguation for exploiting hierarchical thesauri in text classifica-
tion. In Knowledge discovery in databases: PKDD 2005 : 9th European Conference onPrinciples and Practice of Knowledge Discovery in Databases, 2005.
[42] D. Maynard, M. Yankova, N. Aswani, and H. Cunningham. Automatic creation
and monitoring of semantic metadata in a dynamic knowledge portal. In AIMSA,
2004.
[43] O. Medelyan. Automatic Keyphrase Indexing with a Domain-Specific Thesaurus.
Master’s thesis, Philologische, Philosophische Fakultät | Wirtschafts- und Verhal-
tenswissenschaftliche Fakultät, Albert-Ludwigs-Universität Freiburg i. Br., 2005.
[44] O. Medelyan. Human-competitive automatic topic indexing. PhD thesis, Department
of Computer Science, University of Waikato, Hamilton, New Zealand, 2009.
[45] R.A. Meyers, editor. Encyclopedia of Complexity and Systems Science. Springer, 2009.
[46] R. Mitkov. The Oxford handbook of computational linguistics. Oxford University Press,
2003.
[47] N. Nanas, A. Roeck, and M. Vavalis. What happened to content-based information
filtering? In ICTIR ’09: Proceedings of the 2nd International Conference on Theory ofInformation Retrieval, 2009.
[48] D. W. Oard and G. Marchionini. A Conceptual Framework for Text Filtering, 1996.
[49] D.W. Oard. The state of the art in text filtering. User Modeling and User-Adapted Inter-action 7(3). Kluwer Academic Publishers, 1997.
[50] B. Parsia, A. Kalyanpur, and J. Golbeck. Smore - semantic markup, ontology, and
rdf editor, 2005. URL http://www.mindswap.org/papers/SMORE.pdf.
![Page 97: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/97.jpg)
References 87
[51] I. Peacock. Showing robots the door, (w)hat is (r)obots (e)xclusion (p)rotocol? Ari-adne, 1998.
[52] T. Pellegrini and A. Blumauer. Semantic Web : Wege zur vernetzten Wissensge-sellschaft. X.media.press, 2006.
[53] C. Da Costa Pereira and A. Tettamanzi. An evolutionary approach to ontology-
based user model acquisition. In WILF, volume 2955 of Lecture Notes in ComputerScience, 2003.
[54] A. Montejo Raez, L.A. Urena-Lopez, and R. Steinberger. Automatic Text Categoriza-tion of Documents in the High Energy Physics Domain. PhD thesis, Granada Univ.,
2006.
[55] L. Razmerita, S. Antipolis, G. Gouardères, E. Conté, and M. Saber. Ontology based
user modeling for personalization of grid learning services. In ELeGI Conference,
2005.
[56] D. Rosem and C. Nelson. Web 2.0: A new generation of learners and education.
Computers in the Schools, 2008.
[57] S. Schmidt. PHP Design Patterns. O’Reilly, 2006.
[58] A.V. Smirnov and A.A. Krizhanovsky. Information filtering based on wiki index
database. CoRR, 2007.
[59] T. Berners-Lee. Semantic Web Road map, 1998. URL http://www.w3.org/
DesignIssues/Semantic.html.
[60] T. Berners-Lee and D. Conolly. Hypertext Markup Language (HTML) - A Represen-tation of Textual Information and MetaInformation for Retrieval and Interchange, 1993.
URL http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt.
[61] K. Thomas. Opencalais whitepaper, 2009. URL http://www.slideshare.net/
KristaThomas/simple-opencalais-whitepaper.
[62] A.M. Turing. Computing machinery and intelligence. Mind, 1950.
[63] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea:
Practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conferenceon Digital Libraries, 1999.
[64] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Tech-niques (Second Edition). Morgan Kaufmann, 2005.
![Page 98: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/98.jpg)
88 References
[65] B. Yang and G. Jeh. Retroactive answering of search queries. In WWW ’06: Pro-ceedings of the 15th international conference on World Wide Web, 2006.
[66] C. Zimmer. Approximate Information Filtering in Structured Peer-to-Peer Networks.
PhD thesis, Universität des Saarlandes, 2008.
![Page 99: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/99.jpg)
List of Figures
1.1 The Semantic Web layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 The relationships between actors and use cases . . . . . . . . . . . . . . . 12
2.2 The CompleXys overview schema . . . . . . . . . . . . . . . . . . . . . . 20
3.1 The basic microformats schema . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 The SIOC main classes in relation . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 The APIs, which form the GATE architecture . . . . . . . . . . . . . . . . 33
4.3 A data model diagram for GATE’s corpus layer . . . . . . . . . . . . . . . 34
4.4 An exemplary SKOS taxonomy . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 The KEA algorithm diagram together with KEA++ . . . . . . . . . . . . . 37
4.6 Input and output data of the OpenCalais web service . . . . . . . . . . . 41
4.7 An example application of linked data . . . . . . . . . . . . . . . . . . . . 43
6.1 An excerpt of the CompleXys taxonomy . . . . . . . . . . . . . . . . . . . 54
6.2 The CompleXysTask principle . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 The Semantic Content Annotator Pipeline . . . . . . . . . . . . . . . . . . 56
7.1 An example for the application of propositional logic in filtering queries 65
8.1 The distribution of the quality test runs across the dimensions precision
and recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2 The correlation of main category weight and precision within the top test
kea80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.3 Average response times over several test series and numbers of handled
documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.4 Normalized average response times over several test series and numbers
of handled documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.5 Average response times over several test series and numbers of terms . . 77
![Page 100: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/100.jpg)
90 List of Figures
8.6 Normalized average response times over several test series and numbers
of terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
![Page 101: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/101.jpg)
List of Tables
5.1 Examples of information seeking processes [48] . . . . . . . . . . . . . . . 46
8.1 Average precision and number of documents within the top test kea80,
clustered by main category weight ranges with at least 10 documents . . 73
8.2 The characteristics of the performed test series . . . . . . . . . . . . . . . 74
![Page 102: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/102.jpg)
92 List of Tables
![Page 103: An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed8b83d6714ca7f476871fe/html5/thumbnails/103.jpg)
Author’s Statement
I hereby certify that I have prepared this diploma thesis independently, and that only
those sources, aids and advisors that are duly noted herein have been used and / or
consulted.
January 26, 2010
Oliver Schimratzki