and scientific resources deliverable d3.4 final federated...

© EEXCESS consortium: all rights reserved

EEXCESS

Enhancing Europe’s eXchange in Cultural Educational

and Scientific reSources

Deliverable D3.4

Final Federated Recommender Prototype

Identifier: EEXCESS-D3.4-Final-Federated-Recommender-Prototype

Deliverable number: D3.4

Author(s) and company: Hermann Ziak (Know-Center)

Heimo Gursch (Know-Center)

Roman Kern (Know-Center)

Davide Magatti (Mendeley)

© EEXCESS consortium: all rights reserved

Internal reviewers: Christin Seifert (Uni-Passau)

Michael Granitzer (Uni-Passau)

Work package / task: WP3, D3.4

Document status: Final

Confidentiality: Public

Version 2016-05-31

D3.4


© EEXCESS consortium: all rights reserved page iii

History

Version Date Reason of change

1 2016-05-15 First draft of document created

2 2016-05-18 Initial proof reading

3 2016-05-20 Internal Review

4 2016-05-30 Incorporate changes from internal review

5 2016-05-31 Finalised Document

Impressum

Full project title: Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources

Grant Agreement No: 600601

Workpackage Leader: Know-Center

Project Co-ordinator: Silvia Russegger, Jr-DIG

Scientific Project Leader: Michael Granitzer, Uni-Passau

Acknowledgement: The research leading to these results has received funding from the European Union's

Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 600601.

Disclaimer: This document does not represent the opinion of the European Community, and the European

Community is not responsible for any use that might be made of its content.

This document contains material, which is the copyright of certain EEXCESS consortium parties, and may not be

reproduced or copied without permission. All EEXCESS consortium parties have agreed to full publication of

this document. The commercial use of any information contained in this document may require a license from

the proprietor of that information.

Neither the EEXCESS consortium as a whole, nor a certain party of the EEXCESS consortium warrant that the

information contained in this document is capable of use, nor that use of the information is free from risk, and

does not accept any liability for loss or damage suffered by any person using this information.

D3.4


© EEXCESS consortium: all rights reserved page v

Table of Contents

1 Executive Summary .................................................................................................................................... 1

2 Introduction ................................................................................................................................................ 3

2.1 Purpose of this Document ........................................................................................................................ 3

2.2 Scope of this Document ............................................................................................................................ 3

2.3 Status of this Document ............................................................................................................................ 3

2.4 Related Documents ................................................................................................................................... 4

2.5 Relation to the Research Roadmap .......................................................................................................... 4

2.6 Structure of this Document....................................................................................................................... 4

3 Federated Recommender ........................................................................................................................... 6

3.1 Final System Architecture ......................................................................................................................... 6

3.2 Modules of the Federated Recommender ................................................................................................ 8

3.2.1 Source Selection ......................................................................................................................................................................... 8

3.2.2 Query Processing ....................................................................................................................................................................... 8

3.2.3 Partner Recommender ............................................................................................................................................................... 9

3.2.4 Result Processing ....................................................................................................................................................................... 9

3.2.5 Result Ranking ........................................................................................................................................................................... 9

3.3 Final Federated Recommender API ........................................................................................................ 10

3.3.1 API Service Calls ....................................................................................................................................................................... 10

3.3.2 Recent Changes in the API Format ........................................................................................................................................... 11

3.3.3 Secure User Profile ................................................................................................................................................................... 12

4 Published Evaluation Results .................................................................................................................... 16

4.1 International Conference on Theory and Practice of Digital Libraries (TPDL 2016) – Accepted as Poster

16

4.2 International Conference on Information and Knowledge Management (CIKM 2016) – Under

Submission ........................................................................................................................................................ 20

4.3 Conference and Labs of the Evaluation Forum (CLEF 2016) – Under Submission .................................. 21

4.4 Social Book Search Lab - Accepted .......................................................................................................... 24

4.5 International Workshop on Text-based Information Retrieval (TIR 2016) – Under Submission ............ 26

4.6 European Conference on Knowledge Management (ECKM 2016) - Accepted ....................................... 27

5 PartnerWizard .......................................................................................................................................... 29

5.1 Query Configuration ............................................................................................................................... 29

D3.4


© EEXCESS consortium: all rights reserved page vi

5.2 Query Generator Testing ........................................................................................................................ 30

5.3 Deployment ............................................................................................................................................ 33

6 System testing and performance evaluation ............................................................................................. 34

6.1 Scenario 1 ................................................................................................................................................ 34

6.2 Scenario 2 ................................................................................................................................................ 35

6.3 Scenario 3 ................................................................................................................................................ 36

6.4 Scenario 4 ................................................................................................................................................ 38

7 Conclusions on the Federated Recommender ........................................................................................... 40

8 Narrative Path .......................................................................................................................................... 41

8.1 Experiment 1: Mining narrative paths from survey papers .................................................................... 41

8.2 Ground truth dataset construction and testing ...................................................................................... 41

8.2.1 Noise filtering .......................................................................................................................................................................... 43

8.2.2 What can be done next on the evaluation dataset? ................................................................................................................ 44

8.3 Experiment 2: Mining Narrative Paths from Mendeley reading logs...................................................... 45

8.3.1 Generating the Markov chain .................................................................................................................................................. 45

8.3.2 Developing the client application ............................................................................................................................................ 46

8.3.3 Limitations ............................................................................................................................................................................... 47

8.3.4 Evaluation................................................................................................................................................................................ 47

8.3.5 How is this different from the previous narrative paths bookmarklet?.................................................................................... 50

8.3.6 Possible improvements & Future work .................................................................................................................................... 51

8.4 Conclusions ............................................................................................................................................. 51

9 References ................................................................................................................................................ 52

10 Glossary .................................................................................................................................................... 53

D3.4


© EEXCESS consortium: all rights reserved page 1

1 Executive Summary

This document describes the technical details and evaluation results of the final prototype of the EEXCESS

Federated Recommender. In particular, it gives an overview of the recent changes within the Federated

Recommender implementation, hosted on Github1, and the according ongoing research to go beyond state of

the art within work package 3. This deliverable covers the Federated Recommender and also the work on

Narrative Paths.

The main task of the Federated Recommender is to provide an infrastructure to distribute requests from the

front ends to the partners and return the aggregated results. The initial request is incorporated in the Secure

User Profile with according additional contextual information of the users and their preferences. These

contextual information and preferences can be extracted and provided automatically by the front end or

changed and introduced by the user itself (e.g. spoken language, age, media type preferences). Since every

partner has his own way of processing queries and returning results, the architecture is designed in a manner

where the main tasks to process and distribute the requests happen within the Federated Recommender and

the processing specific for each partner takes place within the so called Partner Recommender.

These Partner Recommenders are individually tailored towards each of the according partners to i) translate

the Secure User Profile to a query the partner is able to process by additionally supporting individual features

(e.g. filter queries), ii) work around individual constraints and finally iii) translate the results to the internal

format of the system with according metadata including enrichment (e.g. partners returning not a list of

documents but a list of document IDs, access restrictions). The Partner Recommenders are responsible to

register themselves with the Federated Recommender component to become part of the EEXCESS system.

Therefore, they also need to provide additional information about their partner system (e.g. location

preferences, appropriate age of the partner’s content target user group). In order to assure that the Federated

Recommender knows about their presence even after restarts or system failure, these registration calls are

conducted periodically. In consequence this architecture provides the ability to distribute the system over

several servers while at the same time enabling the content providers to remain under full control of their

content.

In recent months the Federated Recommender could prove its stability, reliability and universal applicability in

three different deployments. The demo and testing server of KNOW, the development server of JR-DIG for

unified testing of all components, and the stable server of JR-DIG for the release candidates. While KNOW uses

Linux as server operation system, JR-DIG uses Windows. The robustness and versatility of the EEXCESS

architecture was demonstrated by assigning different quality of service levels to the partners. These levels

indicate if a partner is an additional resource in the EEXCESS eco-system or a specialised partner system for

testing purposes. Furthermore, tests showed that the remaining issues regarding the performance of the

system are mainly caused by the response latency of the partner systems.

1 https://github.com/EEXCESS/

D3.4



In the following the tasks of the Federated Recommender are summarised:

Parsing and analysing of the received Secure User Profile, e.g. loading of aggregation algorithms,

language detection

Source selection based on several parameters and textual features to restrict the set of partner to just

a well-matching subset, e.g. via the language either provided or extracted from the query, user’s age,

special fields, domain based

Query pre-processing, e.g. grouping of the query, introduction of diversity and serendipity

Partner Recommender management and query distribution, e.g. partner registration, partner access

restriction

Result filtering, e.g. de-duplication, language detection

Result aggregation, e.g. preference dependent aggregation, textual feature based aggregation

The tasks of the Partner Recommender component are:

● Analyse the Secure User Profile as generated by the Federated Recommender

● Selection of the best matching query reformulation strategy for the partner system

● Creating and issuing the call to the partner system, handling all necessary networking operations

● Parsing and processing the response as sent by the partner system, involving the components

developed by WP4 to translate the response and the contained meta-data into the EEXCESS format

● Reporting back the partners result to the Federated Recommender component

The tasks associated with the Narrative Path component are:

● Given a single resource, find matching resources, with together with the original resource represent a

sequence for consumption

● Allow different functions for the sequence building, to enable research on what part of the associated

information can be exploited to form narrative paths

● Conduct studies of the usefulness of narrative path for users

D3.4



2 Introduction

2.1 Purpose of this Document

This document is the final deliverable of work package 3 (WP3) of the EEXCESS project. In this document all

generated results and achievements are described. Many aspects mentioned in this document have already

been reported in previous deliverables. To give a comprehensive overview of all WP3 proceedings,

achievements already reported earlier are briefly mentioned.

The EEXCESS framework is an open and extensible Federated Recommender system focused on cultural and

scientific resources from the Web. Although the initial set of content sources is pre-set, additional partner

systems can easily be added to the EEXCESS framework as new content sources. EEXCESS offers a unified entry

point for all connected partner systems to support the automatic recommendation resources to users. The

recommendations are based on the short- and long term user profile as established by the EEXCESS

recommender system. Figure 1 gives an overview about the main components of the final prototype. To

achieve a high degree of scalability, the EEXCESS architecture is designed in a distributed manner. Hence, the

different parts of the EEXCESS system shown in Figure 1 are designed to be distributed across multiple

machines and are connected via the Internet.

Figure 1: A typical configuration of the framework includes a client, a Federated Recommender, and several partner

systems.

2.2 Scope of this Document

The core part of this document is the description of the final prototype developed in the EEXCESS project,

focusing on the so called Federated Recommender. Additionally, parts of the partner system connectors, called

Partner Recommenders, have also been part of WP3, for example query formulation, which uses specialized

partner API features (e.g. detail search call), and the automated registration of a Partner Recommenders at the

Federated Recommender. Other aspects of the partner connections were developed by JR-DIG and reported in

the corresponding deliverable. The PartnerWizard has been developed jointly by KNOW and JR-DIG,

representing the two work packages. With the PartnerWizard potential partners can join the EEXCESS

ecosystem without the need to develop a Partner Recommender on their own, i.e. no programming skills are

necessary. Instead, they are guided through the configuration of a Partner Recommender by a Web GUI.

2.3 Status of this Document

This is the final version of D3.4.

D3.4



2.4 Related Documents

This document represents the final deliverable of Work Package 3. Some of the covered content is related to

the preceding documents (D3.1, D3.2 and D3.3). Some parts of the PartnerWizard are documented in

deliverable D4.3. The demonstrator of the Federated Recommender is also described in deliverable D7.6.

2.5 Relation to the Research Roadmap

In the preceding deliverables a number of development issues and a research roadmap were presented. The

following steps were taken:

● The integration of user’s interest into the query processing have been implemented and evaluated.

● A block ranking approach has been implemented and made available via a dedicated demonstrator; an

evaluation strategy via crowdsourcing has been designed and internally validated; the actual

evaluation was conducted on a crowdsourcing platform and results are under submission.

● The separate components of the PartnerWizard (joined work of KNOW and JR-DIG within WP3 and

WP4) were merged and finalized. In the course of this work, further modifications of the

recommender were necessary.

● The final prototype of the PartnerWizard was presented to the public on the International Science 2.0

Conference and EEXCESS Final Conference

● In regard to query splitting we developed the planned approach and evaluated its performance. The

results of this evaluation are currently under submission.

● Work on the deeper integration and analysis of the partner systems was continued and methods for

query formulation adopted accordingly.

● In terms of personalisation, a new result aggregation approach was implemented. Making it is possible

to boost results of certain kinds (e.g. media types, nature of the license) further up the top of the

recommendation list. The according parameter can be either learned out of the user’s behaviour by

the front end or defined by the user itself. To support this approach format changes needed to be

undertaken.

● In regard to source selection, apart from the already implemented methods like selection on age,

language or special features, the existing algorithms for domain based source selection were further

refined. Numerous aspects of our source selection approach we thoroughly evaluated. The result of

this evaluation are currently under submission.

● We introduced Wikipedia for Schools domain detection approach for content based source selection,

complementing the WordNet Domains approach

● To support the newly integrated features in the Federated Recommender framework the dedicated

demonstrator of KNOW was updated.

● Introduction of the FedWeb Dataset of the TREC conference into the EEXCESS system for evaluation

purposes.

● The general system performance was evaluated and further improvements on the Federated

Recommender were undertaken.

2.6 Structure of this Document

In the first chapter of the document the developed Federated Recommender, its architecture, and interfaces is

described. The organisation of this section closely follows the papers (conference and workshop contributions)

presented in the subsequent chapter. Next, the PartnerWizard is described, followed by a detailed analysis of

the performance evaluations profiling the Federated Recommender. In the final section the project results are

D3.4



recapped. This section elaborates on lessons learned, current limitations and future use and exploitation

strategies. The second chapter summarises the work done on Narrative Paths.

D3.4



3 Federated Recommender

3.1 Final System Architecture

In this section the complete system architecture and API of the Federated Recommender is presented. The

Federated Recommender consists of a number of individual modules that closely interact with each other.

Figure 2 shows the modules of the Federated Recommender and the information flow between them. The user

context encoded into a Secure User Profile serves as an input for each recommendation process. Therefore, it

is can be interpreted as a query modelling the information need of a (human) user. Each recommendation

process starts with the source selection and the query processing. While the source selection decides on the

ideal content provides for answering a query, the query processing transforms the incoming query into a form

ideal for generation recommendations. The processed query is then send to the subset of sources, based on

the source selection process. This step is conducted by the individual Partner Recommender components. Each

queried source responds with a result list. Due to possible downtime or too much traffic, the Partner

Recommenders must care for the fact that a source might not respond immediately. Therefore, the Partner

Recommenders only waits for a configurable timeout (default: five seconds) for the partner systems to

respond. All non-empty responses arriving within this time frame are then included into the final result list. The

result processing component extracts information from the results lists for further processing. This information

is then exploited to steer the result re-ranking. The result list undergoes a de-duplication process to avoid

situations, where a user is confronted with a list of seemingly identical results. The outcome of this processing

is a personalised result list, which is the answer to the initial query and can be presented to the user. Each of

the mentioned modules will be explained in further detail in dedicated sections. Moreover, also the API to

communicate with the Federated Recommender will be described in the remainder of this chapter.

D3.4



Figure 2: Architecture of the EEXCESS Federated Recommender. The arrows denote the information flow starting with

the incoming query at the top and ending with the results at the bottom of the diagram. The user context encoded into a

Secure User Profile serves as an input for each recommendation process.

D3.4



3.2 Modules of the Federated Recommender

3.2.1 Source Selection

Source selection refers to a method to restrict the sources being using within a federated setting to a certain

subset. Different criteria may steer the decision which sources are kept. Depending on the query and the

available sources, one, more or all sources might prove to be eligible. The benefit of selection only a subset of

all available sources is twofold. Firstly, the perceived quality of the returned results may be better since less

non-fitting material is excluded. Secondly, if fewer sources are queried, the more responsive the system will be,

helping to achieve an overall higher usability and acceptance of the system. All source selection strategies rely

on the Secure User Profile and the description of the sources as provided by the Partner Recommenders.

Different aspects of the Secure User Profile are evaluated consecutively resulting in a filtering pipeline applying

the source selection filters one after another. Our implementation also allows the ability to apply filter multiple

times based on different aspects. The different implementation source selection strategies can be grouped into

the following categories:

● Language: Each source might contain content in a different language; some sources might even

provide content for multiple languages. Users cannot be expected to understand all languages present

in the sources. If the preferred language is contained in the Secure User Profile, this information will

be used to select content in the specified language. If the user has not set any preferred language, an

automatic language detection of the query is done. The detected language query is then used to select

content in the same language. The language of the content is part of its meta data. Depending on the

source, this is specified for all documents in the source or for each document individually.

● Age: Content from various partners might be suitable for different age groups. For instance, scientific

papers and reports might not be suitable for children and teenagers. Each partner can describe the

target age group of the provided content. Users who also provide their age as part of the Secure User

Profile will then get recommendations according to matching age groups. Users are not required to

specify their age, disclosing this information remains completely voluntary.

● Category: For each query the topical categories it covers are detected. Similarly, also the categories

provided by a content source are detected. For each query, content sources with the best match in

categories are selected. The categorisation of sources based on domains they cover is done

automatically. A large number of queries are sent to each source. The response of the source is

analysed for each query and based on this analysis, the categories get assigned. This is a complex and

computationally intensive task. Therefore, it is only repeated in adjustable intervals, where there

might be a change in the sources. The same algorithm used to judge the response of the partners is

also used to categorise each query when it is issued. The categories of a query are then matched to

the category of the connected and categorised content partners to decide to which partner the query

is sent.

● Time Range: Some partners allow a filtering based on a creation or modification date of content. A

query can also include a desired time range. For all sources supporting a time-based filtering this

information will be passed on from the query to the source to select the appropriate content.

3.2.2 Query Processing

Partners in the EEXCESS ecosystem can have very different implementations and search algorithms running to

look through their content. Some of these algorithms work well will long queries (i.e. many keywords), some

work well with short queries (i.e. few keywords), and some can work with queries of any length. Depending on

the partner system characteristics and the issued query, the query might need to be extended or split up into

sub queries.

D3.4



● Query expansion: To expand and diversification queries a pseudo-relevance feedback approach is

employed. The initial query is used to select results from an index filled individual paragraphs from

Wikipedia articles. In the next step keywords are then extracted from the retrieved Wikipedia

paragraphs. These keywords are appended to the original query by a disjunction, i.e. a logical OR.

● Query splitting: Long queries are split up into sub-queries for partners not handling long queries well.

Each keyword is represented as a vector based on the Word2Vec database [Mikolov, 2013]. To split up

the query, the dot product of the vectors is used as similarity measure to re-group the keywords into

shorter queries.

● Query diversification: Additional keywords representing the user’s interest are added to the query as

a conjunction, i.e. a logical AND. The user interests are taken from the interests specified in the Secure

User Profile.

3.2.3 Partner Recommender

The Partner Recommender handles the communication with each partner system. For each partner a dedicated

Partner Recommender exists, representing the partner’s content in the EEXCESS ecosystem. The Partner

Recommender can be seen as a translator between the EEXCESS query and result list formats and the formats

used by the partner system. The effort to develop of a new Partner Recommender, mainly depends on the API

provided by the partner system. For partners providing a HTTP GET call to issue a query and return their results

in the JSON- or XML-format, a new partner recommended can be created by the PartnerWizard without any

the need for dedicated programming. If the partner system offers an API not compliant with the

aforementioned query and response formats, a new Partner Recommender can be created by re-implementing

selected parts of the Partner Recommender reference implementation. This might be needed if a partner

system does not deliver the results in a self-contained result list but a result list only containing result

identifiers. To get the individual results at such an API a separate call for each result is needed.

The Partner Recommenders are designed to be run on any Apache Tomcat server. This allows that they can be

run on the same machine as the Federated Recommender or on other dedicated machines. The decision for

one of these possible setups is usually influenced by the expected number of users and therefore the number

of concurrent user requests the system should be able to process. When one query from a user is processed, all

Partner Recommenders are queried in parallel. Hence, the waiting period for the user is determined by the

slowest response time of all sources and not by the accumulated response times.

3.2.4 Result Processing

The result processing serves as a pre-processing step for the following result ranking and result list

combination. The result processing extracts various features from the result lists. What features will be used

for the result ranking, depends on the result ranking implementation used to answer the query. One of the

features in question is the query-result distance calculated for the result list of every partner source. This

distance is calculated by the number of common keyword in the query and the results divided by the number

of all keyword in the query and the results.

3.2.5 Result Ranking

The result ranking module finally combines all separates result lists into a consolidated result list send to the

user. Different implementations are available for the result recombination. Each of the implementations first

executes a de-duplication of all the entries in the result lists. A fuzzy hash is calculated over the title and the

textual description of each result. If the fuzzy hash of two results is similar, the two results are treated as

identical. After the deduplication, different result aggregation strategies can be selected by settings done in the

Secure User Profile. The simplest aggregation strategy is round robin where one result is taken from each list

one after another. This can also be done in a weighted manner, to favour results from one source over the

D3.4



others. Another aggregation strategy operates in the vector space where the query and each result is

represented as a vector. The entries in the vectors represent the keywords and meta data of the query and the

results, respectively. The similarity of the results with the query is calculated by the dot product of the vector

representing the query and the vectors of each result. The final result list contains the individual results

ordered descending by similarity with the query.

Apart from a simple result list format, a blocked result list format is also available. In this case, the result list is

divided into blocks, each one being optimised according to a different optimisation criterion. Three types of

optimisation criteria are available. The first one is on optimisation for diverse results, the second one is an

optimisation for high serendipity in the results and the third option marks the standard setting represented by

the unaltered query. For each criterion an optimised query is send to the sources. This means that the effort to

generate a blocked result list is three times higher compared to a normal result list. The final result list is

divided into blocks. The first block holds results from the un-optimised query, the second block from the query

optimised for diversity and the third block holds results from the serendipity optimised block.

3.3 Final Federated Recommender API

Within this section the final API of the Federated Recommender is outlined. Readers, interested in more

detailed information are invited to read the documentation on GitHub2.

3.3.1 API Service Calls

Within this section we describe the Federated Recommender API calls. These calls provide all the services

available to the frontends or are needed by the Partner Recommenders to communicate with the Federated

Recommender. Detailed information about the individual calls can also be found on GitHub3. All this calls are

accessible via the general recommender service call and use JSON to as exchange format:

http://{SERVER}/eexcess-federated-recommender-web-service-1.0-

SNAPSHOT/recommender

Recommend

The recommend-call accepts a Secure User Profile as input and returns the according result list. The documents

within the list are collected from the registered or selected partners and are aggregated by the default or by

the selected aggregation algorithm.

Get Details

The experience in the last year regarding system performance lead us to the decision to split the recommend

call into two sub calls. The recommend call now only returns the most important fields to the initial queries.

More detailed information about the recommended objects can now be received by sending a list of document

IDs.

Get Partner Favicon

The get parnter favicon call was introduced to give the frontends the possibility to provide visual

information to the user from which partner the result was returned.

2 https://github.com/EEXCESS/eexcess/wiki

3 https://github.com/EEXCESS/eexcess/wiki/Federated-Recommender-Service

D3.4



The link to this favicon is send to the Federated Recommender within the partner badge when the partner is

registered. The image is then retrieved by the Federated Recommender and stored within the system. The

frontend has to send the according partner ID as parameter to the Federated Recommender to get the correct

favicon.

Get Preview Image

The get preview image call service a similar purpose then the get partner favicon call.

By request it serves a preview image of the by the frontend provided media type. The idea behind this call was

to have a uniform presentation of the preview images over all frontends.

Get Recommender Stats

The get recommender stats call returns a small information about the current status of the Federated

Recommender. It shows the average time the system took in total to distribute the request and gather the

results for the last ten calls and also the average time the selected result aggregation algorithm needed to

generate the final result list.

Register

The register call is the interface for the Partner Recommender to register itself at the Federated

Recommender. To register itself the Partner has to provide the so called Partner Badge with all the needed

information. This call also functions as heartbeat. Thus in a fixed interval the Partners registration thread sends

his registration information again to assure that the Federated Recommender knows about its status.

Unregister

The unregister call is used by the Partner Recommender when it is shut down to tell the Federated

Recommender that it is not available anymore.

Get Registered Partners

The Registered Partners call has two purposes. First it returns all the partners currently registered in the

system with the according information. Second it also provides information about the status of each partner

including response time, amount of failed request since the initial registration of the partner. Here the system

also distinguishes between timeouts and actual system failures.

3.3.2 Recent Changes in the API Format

Since the protocols between the Federated Recommender and the frontends were already optimized recently

as part of the work undertaken in the second Federated Recommender Prototype. Since then, only slight

adoption had to be performed to support new features.

The extensive description of each field within the Secure User Profile, the Result List, the Details Query and

Details Response is documented on Github4.

In the following the adoption that had to be applied to the Secure User Profile are described.

4 https://github.com/EEXCESS/eexcess/wiki/Request-and-Response-format

D3.4



3.3.3 Secure User Profile

One of the key changes regarding the Secure User Profile is the introduction of the user content type

preferences. It is a feature that should further support contextualisation and personalization. Within these

preferences field it is possible to state weightings about certain types of content. This can be utilized either by

the user itself via a context menu within the frontend to state which type of content (e.g. image, text, open

licenced) he would have preferred or it could as well be automatically learned by the frontend itself from the

items the user clicked.

Depending on the underlying aggregation algorithm this information can either be used to boost results that

apply to the given content type or even filter results that do not apply to the preferences.

In the current implementation the defined fields cover pictures, text, videos, items with an unrestricted licence

or expert level items. The latter applies to a field that can be set within the configuration of the partner. This

field is designated for partners that have highly specialised content like Mendeley or ZBW which is only

partially suitable for a broad audience.

The results of this changes can be tested within the dedicated EEXCESS Demo5 website of KNOW and are

demonstrated in the following images.

5 http://eexcess-demo.know-center.tugraz.at/v3/#/system-demo

D3.4



Figure 3: Here no preferences of the user are given and the algorithm works by its natural weighting schema based on

the occurrence of the query terms within the textual content but also on the position in the original ranking.

D3.4



Figure 4: The user indicated or the frontend learned from the user's history that a high amount of pictures within the

final results list is preferable.

D3.4



Figure 5: The user is an expert within the field and therefore should get recommended items from more specialized

sources like ZBW or Mendeley if available.

D3.4



4 Published Evaluation Results

Within this section WP3 presents the latest dissemination activities in regard to the Federated Recommender.

The presented work covers the traditional topics of federated retrieval systems, i) resource representation ii)

resource selection and iii) result aggregation, including topics that arise within the real world application of

such systems [Lu, 2005].

Therefore, following research questions were examined:

Can traditional methods of resource representation be improved by preferring ambiguous terms for

probing?

Does the WordNet Domain based resource selection approach yield meaningful results?

To which extend has a partner to be probed so that the knowledge based domain mapping

representation approach yields stable results?

To which degree can small niche sources be integrated within a market share based resource selection

approach without affecting typical evaluation measures?

Which kind of approach suits the purpose of integrating diverse and serendipitous results within an

aggregated result list according to the opinion of users?

Can contextual data of users be helpful in the process of re-ranking by the usage of the documents

metadata?

Can knowledge based methods be applied for topic separation of query terms?

Which are the major factors in regard to response time for federated retrieval systems?

In the following section we present our findings to this questions organized according to our publications.

4.1 International Conference on Theory and Practice of Digital Libraries (TPDL

2016) – Accepted as Poster

Search engines typically keep their data in an index which is continuously updated by the crawler. This crawler

gathers, analyses and finally saves the results within the index.

Out of the resulting statistical information the ranking scheme for potential result documents can be

calculated.

In comparison to this, in a federated search or recommender setting, where the query is forwarded to a

number of attached search engines, the content of the collections is unknown. This also includes the key

statistics like term frequencies which are usually make use of for ranking the documents in algorithms like

Okapi BM25. Literature within the field of federated search refers to this problem as the resource

representation problem in an uncooperative environment. The most prominent proposed solutions to this

problem is based on the idea to approximate the statistics of the dataset by retrieving sample documents by

querying the source, hence the name “query-based-sampling” (QBS) [Callan, 2001]. Here the stopping criterion

is often defined in the number of queries send or number of documents sampled. Although QBS has been

analysed and refined in a wide variety of settings, the question if ambiguous samples help to establish good

coverage of the sources in question is unvisited.

D3.4



Literature suggests that random words from an English dictionary are not able to create a distinctive enough

representation. Our hypothesis is that ambiguous nouns should require fewer requests to create a

representation which can be exploited later on for source selection.

To conduct this evaluation, the FedWeb Greatest Hits dataset [Demeester, 2015] was used. This dataset is

composed of results of 150 specialized and unspecialized search engines. From these search engines between

20,000 and 80,000 documents were retrieved, giving a total number of over 5 million documents. Our

implemented query selection method used WordNet as dictionary to select four different kinds of potential

queries terms.

Factotum Queries: Within WordNet synsets terms that do not belong to a specific domain are called

factotum, meaning that they do not belong to a specific domain but can appear in almost all domains

therefore they can be considered as highly ambiguous.

Ambiguous Queries: Terms that drawn from WordNet are included at least in three different domains.

Random Queries: For comparison random terms out of WordNet were chosen. This is considered to be

the baseline for the evaluation

Random Queries without retainment: Here the queries were not retained between the consecutive

probing attempts.

Additionally, we wanted to evaluate to which extend query based probing has to be performed to create an

adequate representation of the partner in regard to the Wordnet domain mapping. The approach itself takes

an input and maps it into a set of domains with attached weights. In that regard the input could be either the

query or a document retrieved by the QBS. The final result of this approach is a list of domains covering the

query itself and lists of domains representing the sources which can be matched.

In Table 1 one can see an example of such a mapping of queries and domains:

Input Terms Output Categories

battle of trafalgar { military, history }

women wage gap { economy }

world cup football { sport }

java 8 features { computer science, food }

dinosaur t-rex { animals }

sentimental tears emotions { psychological features }

kittens for sale { animals, commerce }

department of justice { administration }

Table 1

Figure 6 shows the results of a first validation attempt. Here, three sets of twenty documents each were

retrieved. Each set consisted of results belonging to one of three domains, namely mathematics, religion and

health. The used queries were chosen in a way where even non expert users could classify them correctly with

ease. Two of the three domains, mathematics and religion, can take the first two places, the last one, health, is

switched with domain medicine on place three. Although this is not totally correct it should not interfere with

the main goal of the system as long as this behaviour is consistent.

D3.4



Figure 6 The plot shows the domains that are detected by our implementation in relation to their corresponding weights.

All of the three original domains are recovered except that health and medicine are switched due to their similarity.

In Figure 7 we present the results of the probing with 100 up to 2000 queries. With each iteration the

agreement in the list of domains was measured between the current amount of queries and its predecessor.

The results seem to indicate depending on the partner the about 300 to 700 queries are necessary to get to a

stable state. Further we found that it seems to be possible to distinguish between encyclopaedic partners and

niche-source partners. We define niche-source partners were depending on the textual content of the

documents the domain is easily identifiable whereas encyclopaedic partners cover all kind of topics (e.g. CERN

(niche), WordPress (encyclopaedic)).

D3.4



Figure 7: The x-axis represents the number of probed documents, in steps of hundred. The y-axis represents the results

of the “rank biased overlap”-measure, where numbers close to 1 represent a stable distribution of topics. The upper plot

demonstrates an example graph on a niche source called “CERN Documents”, as indicated by a visible convergence. The

lower one shows a graph for an encyclopaedic source in comparison. All approaches yield more stable results on the

niche source. As expected the highest fluctuation is produced by the “random without replacement” approach.

D3.4



Finally, in Figure 8 we can answer the question if ambiguity yields more stable results in an earlier state.

According to our results it seems that there is only a slight benefit from the factotum and ambiguous approach

in comparison to the random approach, only the random approach without replacement stands out with lower

results in general and bigger deviations from the mean.

Figure 8: Comparison of the 4 different query generation methods with each other. Each diagram shows the mean

performance on niche and encyclopaedic sources with corresponding standard deviation. All 4 methods show a similar

mean performance, except the last method exhibits extreme deviations from the mean.

4.2 International Conference on Information and Knowledge Management

(CIKM 2016) – Under Submission

Although approaches presented by Jin et al. [Jin, 2014] show that a marked share based source selection

approaches yield considerably better results than most of the state of the art approaches, there is one major

drawback to this approach. In the setting of EEXCESS where a mixture of smaller and larger databases can be

expected a marked share based approach could lead to the effect that the algorithm always selects the same

partners regardless of the initial query. Thus small partners covering specialized long tail content, previously

referred as niche sources, might be left behind. Here our source selection approaches could help to resolve

such issues.

On the other hand, according to in the FedWeb Greatest Hits included gold standard dataset, introducing this

niche sources seem to lower measures like precision and NDCG. Hence the ideal solution would be a system

that introduces just as much niche sources into the final result lists that there is only a minor measurable

negative effect. Therefore, we combined two ranking functions which give a weight to each of the sources.

The first ranking function is a reimplementation of the current state of the art while the second one should bias

towards niche sources. The influence of each of the ranking functions is controlled by a parameter alpha,

D3.4



ranging between 0 and 1. To gain insights which is the optimum value for alpha were the performance is not

severely affected we conducted an evaluation based on the already described FedWeb Greatest Hits dataset.

Aside from the Wordnet Domains based source selection approach we had a look into the “Wikipedia for

Schools” corpus. Wikipedia for Schools is a subset of the original Wikipedia corpus matching the UK National

Curriculum aiming for educational purposes for children. This corpus has already been used by us to

demonstration the source selection based on age, as this corpus is intended to be consumed by pupils. While

the original Wikipedia category graph has a vastly heterogeneous structure, including cycles, the graph of

Wikipedia for Schools has only about 120 categories. The assignment of categories to the documents or queries

is done by measuring the overlap coefficient of the terms with the documents being assigned to a certain

category. Table 2 shows sample results of domains being assigned to queries from both approaches:

Input Terms WordNet Domains

Categories

Wikipedia for Schools Categories

battle of trafalgar { military, history } { pre 1900 military, military people }

women wage gap { economy } { animal and human rights }

world cup football { sport } { sports teams, religion }

java 8 features { computer science, food } { computer and video games, cartoons }

dinosaur t-rex { animals } { dinosaurs }

sentimental tears emotions { psychological features } { }

kittens for sale { animals, commerce } { mammals }

department of justice { administration } { law }

Table 2

Table 3 shows the results of our evaluation in regard to the optimization introduced niche sources.

nDCG@10 nDCG@20 nDCG@100 P@1 P@5

Baseline (200P) 0.27 0.32 0.49 0.2 0.26

Baseline+LT (α = 0.10) 0.26 0.31 0.47 0.2 0.26

Baseline+LT (α = 0.25) 0.24 0.29 0.46 0.2 0.21

Baseline+LT (α = 0.50) 0.19 0.24 0.42 0.11 0.15

Baseline+LT (α = 0.75) 0.13 0.17 0.39 0.03 0.12

Table 3

4.3 Conference and Labs of the Evaluation Forum (CLEF 2016) – Under

Submission

Within the work submitted to the CLEF 2016 conference we described our approach to conduct a

crowdsourcing based evaluation on our block ranking approach and alternatively an interleaving based

approach to integrate diverse and serendipitous results into the final result list. Even further we tried to assess

the impact factor of our algorithms for diversity and serendipity in result lists.

D3.4



The actual evaluation was conducted on the crowdsourcing platform CrowdFlower6. Over 300 workers

conducted the evaluation producing a total amount of over 1500 judgments. This work is the follow-up work

submitted last year to the CLEF conference [Ziak, 2015] where we described our design of our dedicated result

list evaluation framework and conducted a small user study of our query reformulation based diversification

approaches. The findings of this work are laying the foundations for the current evaluation.

To gain all the needed information we created a dedicated dataset with the help of query logs of the EEXCESS

system. The finally selected query set contained 52 queries. The evaluation dataset contained the actual user

query, contextual information of the user (e.g. history) and the results created by our blocking algorithms.

Our evaluation setup up contained a total of four scenarios:

In Scenario 1 we wanted to compare our blocking approach against the basic list where no diverse or

serendipitous results were introduced. Here we wanted to gain a basic understanding of the

acceptance level of potential users for such a setup.

Here the workers were instructed to get into the mind-set of a potential user and were given

additional information about the query and the according context.

The worker had to decide which of the two lists suited his information need.

Within Scenario 2 the workers got presented a shortened result list containing either the diverse or

the serendipitous block compared against the also shortened basic list.

The goal of this evaluation was to eliminate the potential possibility that one of the approaches has an

adverse effect on each other since the algorithm to generate serendipitous results was evaluated yet.

Scenario 3 covers the direct comparison of the block ranking and the interleaved approach.

Each task was assessed by six different workers whereas to reduce the potential risk of a bias towards the list

presented first for 50% of the workers the lists are interchanged. Figure 9 shows an example of a task a worker

had to perform.

6 http://www.crowdflower.com/

D3.4



Figure 9: Example of a task a worker had to perform. One of the lists shows diversified results at the bottom, the other

one shows the unmodified list. There is no indicator given to the user which contains which content.

D3.4



We reported the agreement on item level with the arithmetic mean of the percentage of the biggest

agreement. The second reported figures are the percentage of the selected algorithm per approach according

to the preferences of the workers.

Table 4 shows the results according to the first evaluation scenario. The worker's agreement is about 70 for the

interleaved and the blocked approach. We further analysed the queries that got the majority of votes for either

the blocked or the interleaved approach.

Item Agreement Decision Percentage

Interleaved 0.692 0.358

Blocked 0.721 0.355

Table 4

Here only one query was present in both sets which indicates that for most queries either the blocking or the

interleaving approach is beneficial.

Table 5 represents the second evaluation scenario with the goal to assure that the diversity and serendipity

approach do not have adverse effects upon each other. Here all measures produce similar figures for both

approaches. Therefore, we assume that both approaches work on a similar level.

Item Agreement Decision Percentage

Diverse 0.769 0.307

Serendipitous 0.746 0.31

Table 5

The result of the direct comparison of the interleaved and blocking approaches is shown in Table 6. Here both

approaches obtain similar results with a slight tendency towards the block ranking approach.

ItemAgreement DecisionPercentage

Blocked vs Interleaved 0.647 0.532

Table 6

4.4 Social Book Search Lab - Accepted

The Social Book Search Lab on the CLEF 2016 conference consisted of three tracks: suggestion, mining and

interactive track. KNOW took part in the two tracks suggestion and mining. These challenges have been of

interest since both challenges are in the domain of recommender engines. Therefore, the achieved results

could be potentially utilized within the EEXCESS Federated Recommender. Furthermore, similarities can be

found in the usage of data considered to be in the cultural heritage domain. To be more specific, literature with

enriched with matching metadata.

The task of the suggestion track was to recommend books matching the context of users within a forum about

books. The supplied dataset used in both tracks was a book catalogue of 2.7 million crawled records from

D3.4



Amazon.com7 enriched with metadata from LibraryThing8. This dataset with all supplied metadata (e.g.

authors, tags, browse nodes, binding) was indexed by using the Lucene framework9. The task specification

supplied data containing a feature rich dataset of the postings from the LibraryThing platform, matching

metadata of the user's reading catalogue, and in some cases examples of mentioned book titles.

In this lab, an approach was presented which relied on performing a similarity search on metadata, tags, and

the initially given books out of the user's catalogue and examples. The initially given books from the user's

catalogue are called browse nodes. They contain categorization information from the Amazon dataset. The

initially created lists of browsed nodes and tags from the catalogue combined with the examples were used to

create two queries. These two queries where later employed to generate lists of similar books. For re-ranking

and merging the resulting lists, a Latent Semantic Indexing (LSI) algorithm was implemented. For the LSI

algorithms, query vectors were created again out of the posting of the user. An example of the results of the

LSI approach can be seen in Table 7. The insights gained in this challenge were applied in a simplified manner in

the new result aggregation algorithm of the EEXCESS Federated Recommender framework.

Book Titles

Catalogue Entries Data Mining: Practical Machine Learning Tools and Techniques

Statistics, Data Analysis, and Decision Modelling

Software Architecture in Practice

Results Introduction to Algorithms

Software Engineering: A Practitioner’s Approach

Artificial Intelligence: A Modern Approach

Artificial Intelligence (Handbook Of Perception And Cognition)

Machine Learning (Mcgraw-Hill International Edit)

Prolog Programming for Artificial Intelligence

An Introduction to Support Vector Machines and Other Kernel-based Learning

Methods

Table 7: The table shows the inputs and the generated recommendations. The book titles in the "Catalogue Entries" row

are used as inputs; the row "Results" hold the generated recommendations.

The mining track was divided into two tasks, the linking and the classification task. Within the linking task the

goal was to identify entities within sentences. This means, that books titles mentioned in the user’s posting

should be recognised. The objective of the classification task was the identification of a user’s request for a

book recommendation, which can be interpreted as an implicit request for suggestions.

For the linking task, a dataset of LibraryThing postings with the associated ground truth data was provided by

the organizers. A gazetteer like system based on the book titles out of the previously mentioned Amazon index

was implemented to fulfil the task. Furthermore, experiments were conducted to refine the resulting book title

candidate lists by removing false positives, hence improving precision (e.g. sentences classification or

7 https://www.amazon.com/

8 https://www.librarything.com/

9 https://lucene.apache.org/core/

D3.4



identification containing books, author co-occurrence). Although this final post processing steps could not be

applied on the finally submitted runs, our results still are still within the best of this track.

For the classification task about 2000 threads from LibraryThing and also 250 threads from Reddit10 were given.

In particular, content from the Subreddits “suggestmeabook” and “books” was included. For this task, our

approach relied on typical features from the area of Natural Language Processing (NLP). These features are n-

grams, number of terms within the test and also uncommon features like the average spelling error within the

text or associated tags and browse nodes from the Amazon dataset. Three different classification algorithms

were trained with this feature set, a Random Forest classifier, a Naive Bayes classifier, and a Decision Tree.

The presented approach managed to achieve the third place for the LibraryThing testing set with an accuracy of

91 percent. We were beaten by two baseline runs from the organisers achieving first and second place. For the

Reddit based dataset we managed to come second by achieving an equal accuracy of 82 percent with the top

team on the first place. These two tracks resulted in two publications accepted at the SBS Lab of Clef 201611.

4.5 International Workshop on Text-based Information Retrieval (TIR 2016) –

Under Submission

One of the main challenges within the field of context-driven query extraction is the relevant context

identification. Here the goal is to identify the actual information need of the user, the topics the user is

interested in. Since this task of identifying the main topic of interest of the user is not easy to solve the

resulting query might contain several unrelated topics. Literature even suggests that it can be beneficial to

cover several of these topics at once within a recommender related context [Rhodes, 2000]. While such a

procedure will certainly work in many cases, there is a downside to it in a federated setting. Different sources

have the tendency to respond to certain queries inconsistently.

While one source might yield good results with multi-topic queries another one might not return results at all.

For that reason, it might beneficial to provide a possibility to topically partition queries. Most approaches for

query splitting appear to rely on two sources of information. Either via the usage of query logs or the initial

probing of the query within the source. Both approaches are not easy to apply federated setting. Query logs of

federated recommendation systems are difficult to obtain and additionally might be biased by the algorithm

creating the query. The probing approach bears the problem of introducing additional latency which is already

a challenge to cope with in a distributed setting.

Instead of making use of directly accessible information one can resort to use external knowledge sources. An

example of approaches relying on knowledge bases is Word2Vec which gained a lot of attention recently. The

work submitted to the TIR2016 conference contains the evaluation of two approaches to topically separate

queries by clustering the query terms of unrelated queries with the help of the well-known K-Means algorithm

and the Google News’ Word2Vec model. This approach was compared to a very simple approach where the

according query is just split into N groups of equal length. Within the evaluation N ranges from two to four.

As query dataset the well-studied Webis-QseC-10 dataset was used consisting of 5000 user queries. Although

this dataset does not ideally match the setting of auto generated queries it should still be sufficient to show the

general validity of the approach. As measures Rand Index and V-Measure were used.

10 https://www.reddit.com/

11 http://social-book-search.humanities.uva.nl/#/mining16

D3.4



In the first evaluation setup N unrelated queries of the dataset are joined. Here we assume that the queries

send to the system are in a topical sequence. For example, such a setting could be the result from a context-

driven query extraction approach based on paragraphs where two paragraphs are falsely unified. Here the

vectors created from Word2vec extended by the positional information of the term.

The results of this evaluation are presented in Table 8. In this setting the most important information seems to

be the position of the query terms. Although this might be based on the fact that the largest share of user

queries tends to be in the range of three to five terms.

Two Queries Three Queries Four Queries

Word2Vec Kmeans Rand Index 0.71 0.64 0.59

Word2Vec Kmeans V Measure 0.77 0.77 0.76

Split Approach Rand Index 0.71 0.66 0.63

Split Approach V Measure 0.77 0.78 0.78

Table 8

Within the second evaluation setup we assume a situation where the query terms are not in topical related

order. This might happen in a system that extracts keywords out of several paragraphs and returns a list of

queries weighted by the importance of the topic. Therefore, the already joined queries were randomized.

The results of this setup are shown in Table 9. As one would have expected the simple split approach yields

results that can be interpreted as totally random behaviour whereas the Word2vec based approach seems to

work better on a quantifiable level.

Two Queries Three Queries Four Queries

Word2Vec K-Means Rand Index 0.088 0.071 0.056

Word2Vec K-Means V-Measure 0.373 0.341 0.373

Split Approach Rand Index 0.008 0.003 0

Split Approach V-Measure 0.278 0.267 0.236

Table 9

4.6 European Conference on Knowledge Management (ECKM 2016) -

Accepted

A publication analysing the overall responsiveness of the EEXCESS systems was submitted and accepted by the

Web 2.0 Models, Methods and Tools in Knowledge Management minitrack on the 17th European Conference

on Knowledge Management (ECKM 2016). The paper is due to be published in the third quarter of 2016. The

paper is titled “Context-Driven Federated Recommendations for Knowledge Workers” and highlights the

capabilities of EEXCESS supporting knowledge workers by automatically providing suggestions for useful

material. For environments where it is not possible to integrate EEXCESS into the applications the knowledge

workers interact with, the usage of the generic Web interface is suggested. In such a scenario, EEXCESS cannot

automatically suggest material, but serves as a unified access point for all connected partner systems.

D3.4



Even though some of the EEXCESS functionality cannot be used in when used only the Web-Interface, it is still

beneficial for knowledge worker compared to using all partners individually. The paper goes on to describes

the general architecture of the EEXCESS system and the Federated Recommender core in particular. In three

test scenarios the parallel processing capabilities of the EXCESS systems are evaluated. While on of the test

scenarios reconstructs a system as it is likely to be deployed in a real world scenario, the other two are corner

cases achieving the best and worst system performance. The evaluation shows how many requests can be

processed in parallel on a specified processing hardware. To judge the service quality, the recommender's

response times and the number of failed queries are used. A query as counted as a fail, if the partner system

does not respond within ten seconds.

The evaluation shows that a machine with four CPU cores and four gigabytes of main memory the EEXCESS

Federated Recommender can process up to 100 requests in parallel with at ten different partner sources

connected via the internet. The overall response time in the test was dominated by the response times of the

response times of the partner systems. This highlights the efficiency of the request transformation, result

transformation and aggregation, and re-ranking implemented in the Federated Recommender.

D3.4



5 PartnerWizard

The PartnerWizard is a software tool to create new Partner Recommender without requiring any programming.

New partners not having programming resources are thus enabled to join up the EEXCESS ecosystem. The idea

of the PartnerWizard was first formalised in the half-year management report February-July 2015 for work

package 3. The design and development of the PartnerWizard was already presented in the Deliverables D9.5

and D3.3. KNOW and JR-DIG both have been working on different parts of the PartnerWizard. Figure 10 depicts

the individual parts of the PartnerWizard. Figure 10 also shows the task and the required order in which they

need to be completed when a new Partner Recommender is generated. All interaction with the PartnerWizard

is done via a Web-based graphical user interface (Web GUI). KNOW had been working on the query

configuration and query generator testing (depicted in green). JR-DIG had been working on the parts providing

the initial configuration, the result list mapping, and the final deployment of the generated Partner

Recommender. In this deliverable only the parts developed by KNOW are described in detail; the parts

developed by JR-DIG are reported in the deliverable D4.3. Many aspects have already been reported in

deliverable D3.3 and are updated and consolidated for this final deliverable.

Figure 10: The PartnerWizard guides users through the necessary configuration steps. Parts depicted in blue have been

developed by JR-DIG; parts depicted in green by KNOW. In this deliverable only the parts developed by KNOW are

reported.

5.1 Query Configuration

Example queries are needed for the test of the query generation. Therefore, a list of queries covering large

areas of different knowledge domains are pre-loaded by the PartnerWizard to do the query generator testing.

It is possible to modify this list of queries just before the query generator testing is started. Keywords can be

added or deleted from a query, new queries can be added, existing queries can be deleted, and the main topic

of a query can be set. All of this might be desired due to special material covered by the partner, which might

otherwise not be correctly reflected in the test. Each query can consist of one or more keywords. A keyword is

not only restricted to one term, but can have more one or multiple terms. Similarly, each query can have a

main topic, meaning that one keyword describes the central concept of the complete multi-keyword query.

These query options are available for the entire EEXCESS system, not only for the PartnerWizard. Figure 11

shows the Web GUI to configure the queries.

D3.4



Figure 11: Web GUI to configure the queries used to test the query generation.

5.2 Query Generator Testing

After the queries are selected the main test run can start. In this test run different query generator

configurations compete against each other. A query generator configuration consists of a query generator

implementation with the settings for query splitting and query expansion. For each query generator either

query splitting, or query expansion, or neither of them can be enabled. Hence these three settings have to be

tested for each query generator implementation, meaning that for each query generator implementation three

different configurations need to be tested. Each of these configurations is than tested with each of the queries

configured earlier. In total there are

numberOfConfigurations = 3 ∙ numberOfQueryGeneratorImplementations ∙ numberOfQueries

configurations which need to be tested.

The testing procedure is depicted in Figure 12. At the beginning of the tests all query generator

implementations are tested if they produce a not empty result set for any of the queries. Only if this is the case,

they are used further. In the next step all possible configurations compete against each other. If two

configurations produce result lists which are not equal, the user has to decide which one fits the query best.

D3.4



The Web GUI showing the two result lists to the user is depicted in Figure 13. When all configurations have

been tested, the configuration which was chosen most times is stored as the winning configuration. If there is a

draw, the simplest configuration is used. To judge the simplicity of a configuration, the list of query generator

classes internally stored in the PartnerWizard is ordered by ascending complexity of the implementation.

Furthermore, no query modification is considered simpler as query splitting, and query splitting is considered

less complex than query expansion. With this set of rules, always a winning configuration can be found.

Figure 12: Activity Diagram showing all necessary steps to select the query generator configuration. This diagram is a

modified version of Figure 3 from deliverable D3.3.

D3.4



Figure 13: Web GUI showing two non-identical result lists to the user. The user has to decide which result list matches

the query best. The query is shown above the two result lists, where the main topic of the query is set in a bold typeface.

All interactions between the user, the PartnerWizard and the partner systems are depicted in Figure 14. The

user has to trigger the process via the Web GUI by completing the query configuration. Then the PartnerWizard

automatically test all query generator implementations known to it. All query generators implementations

returning a non-empty result list for at least one query, are potential candidates for the desired configuration.

For each of these candidates all possible configurations are iterated. Next, two configurations compete against

each other. If they provide the same result list, the PartnerWizard automatically counts this as a draw. If the

result lists are not equal, the user has to vote which result list fits the query best. After all pairs have been

voted on, the winning configuration is determined and stored by the PartnerWizard. The final Partner

Recommender can then be deployed with the winning configuration.

D3.4



Figure 14: Sequence Diagram depicting the interactions done during the query generator testing between the user, the

PartnerWizard and the partner system. This diagram is a modified version of Figure 4 from deliverable D3.3.

5.3 Deployment

After the query generator testing, all parameters to configure and generate a new Partner Recommender are

complete. To create the new Partner Recommender, the initial configuration, the result list configuration, and

the query generator testing are combined. With all this information a Java Web Archive holding the configured

Partner Recommender is created. The Partner Recommender is a Java Servlet than can be run on any Apache

Tomcat 8.

D3.4



6 System testing and performance evaluation

A series of tests was carried out to objectively judge the performance of the Federated Recommender

framework. For all tests two different system environments where used. Test environment #1 consists of a

Linux virtual machine with four CPU cores and four gigabyte of main memory; tests environment #2 consists of

a Linux virtual machine with eight CPU cores and eight gigabyte of memory. The test covered up to ten of the

available partner recommends and the Federated Recommender with its query transformation, result

aggregation and re-ranking.

In four scenarios the overall system performance was evaluated. All scenarios were tested with the same sets

of 1,649 queries. One third of the queries originated from the EEXCESS query logs, while the other two thirds

were selected from the AOL query dataset [Pass, 2006]. The set of queries was split into subsets containing 10,

30, 50, 100, 150, and 500 queries. Each query in a subset was sent to the by a dedicated thread to simulate

parallel requests send to the Federated Recommender. A ten seconds intermission was made between two

subsets of queries, to let the partner systems recover to a state of normal operation. This implicated that, the

bigger the subsets of queries are, the faster the test run is done since less subsets and therefore also less

intermissions are needed.

Three measurements are used to judge the performance, namely the number of query send in parallel, the

average response time, and the number of failed queries. A query is counted as a fail, if the Partner

Recommender does not provide a result list within five seconds. All scenarios are considered in vivo tests and

the partners are connected via the Internet. Hence, variations between the individual runs due to network

latencies and load from requests by other users are expected. The internet uplink speed for all test systems

offered a bandwidth of 200 Mbps.

6.1 Scenario 1

In this scenario the Federated Recommender and the Partner Recommenders are run on the same virtual

machine. The virtual machine is equipped with four CPU cores and four gigabyte of main memory. To test the

setup with different system loads, two runs are conducted. In the first run three partners (Europeana,

KIMPortal, and Mendeley) are queried and in the second run ten (Deutsche Digitale Bibliothek, Deutsche

Nationalbibliothek, Deutsche Zentralbibliothek für Wirtschaftswissenschaften, Digital Public Library of America,

Europeana, KIMPortal, Mendeley, RijksMuseum, Swissbib, Wissen Media). All partners were connected via the

internet.

In Table 10 and Figure 15 the results for the test runs in scenario 1 are shown. As expected, the total response

time decreases with increasing batch size. This is due to the fact, that with increasing batch size less batches

are needed to send all of the 1,649 queries. Hence, less interval pauses between the batches are needed and

the total time consumed decreases. This shows that the response times are heavily dominated by the intervals

between the batches. When comparing the run with three and ten sources only in the case of 500 parallel

requests a performance deterioration can be identified for ten sources. This counts towards the argument, that

100 parallel requests can be process for ten partners with the Federated Recommender and the Partner

Recommenders running on the same machine.

The situation is different when looking at the average number of timeouts, as they show a clear increase

between the runs with three and ten sources even for the smallest batch. This supports the claim, that the

Partner Recommenders are not capable of handling all requests within the required window of five seconds. To

investigate this aspect further, scenario 2 was designed.

D3.4



Total response time including

intermission in seconds

Average number of timeouts

Parallel Requests Three Sources Ten Sources Three Sources Ten Sources

10 1,641 1,643 6 24

30 543 547 3 338

50 324 328 0 827

100 164 168 268 1,225

150 111 112 700 1,463

500 107 131 1,406 1,521

Table 10: Accumulated response times and average number of timeouts for three and ten knowledge sources connected

over the Internet.

Figure 15: Accumulated response times and average number of timeouts for three and ten knowledge sources connected

over the Internet.

6.2 Scenario 2

In this test setting the run with ten partners (Deutsche Digitale Bibliothek, Deutsche Nationalbibliothek,

Deutsche Zentralbibliothek für Wirtschaftswissenschaften, Digital Public Library of America, Europeana,

KIMPortal, Mendeley, RijksMuseum, Swissbib, Wissen Media) from scenario 1 connected via the Internet was

repeated. To judge the influence of the available computing resources on the average number of timeouts, in

this scenario a virtual machine with eight CPU cores and eight gigabytes of memory was used. Table 11 and

Figure 16 shows the results from this run and for comparison also the results of scenario 1 with ten sources.

When looking at the total response time, only in the case of 500 parallel requests an improvement can be seen.

D3.4



At the average number of timeouts, the improvement is biggest for small batches of parallel requests and

declines with increasing batch size.




Parallel Requests 4 CPUs,

4 GB RAM

8 CPUs,

8 GB RAM

4 CPUs,

4 GB RAM

8 CPUs,

8 GB RAM

10 1,643 1,643 24 1

30 547 546 338 32

50 328 327 827 179

100 168 167 1,225 790

150 112 111 1,463 1,098

500 131 86 1,521 1,473

Table 11: Accumulated response times and average number of timeouts for three knowledge sources connected over the

Internet with the EEXCESS system run on two different hardware configurations.

Figure 16: Accumulated response times and average number of timeouts for three knowledge sources connected over

the Internet with the EEXCESS system run on two different hardware configurations.

6.3 Scenario 3

The setup used in this scenario consists of two virtual machines. Both have four CPU cores and four gigabyte of

main memory. One of the virtual machines hosts the Federated Recommender and the other one hosts three

Partner Recommenders. Europeana, KIMPortal and Mendeley are used as partners and are connected via the

internet. Table 12 and Figure 17 shows the results of the tests run. The results for the single machine run are

taken from scenario 1 with the same three partners. For the setup with two separate machines running the

Federated Recommender and Partner Recommenders, the maximum accumulated CPU load never exceeded

20%. Concerning the total response time, an improvement is noticeable for 500 queries issued in parallel. This

D3.4



can be interpreted that the Partner Recommenders generate a considerable load which is too big for one

machine with four CPU cores to handle. The average number of timeouts does not vary considerable between

both setups. This aspect is investigated further in scenario 4.




Parallel Requests Single machine Separated

Machines

Single machine Separated

Machines

10 1,641 1,642 6 0

30 543 543 3 0

50 324 324 0 0

100 164 164 268 218

150 111 109 700 747

500 107 79 1,406 1,359

Table 12: Accumulated response times and average number of timeouts for Partner Recommenders and federates

recommend running on together one single and separated two machines.

Figure 17: Accumulated response times and average number of timeouts for Partner Recommenders and federates

recommend running on together one single and separated two machines.

D3.4



6.4 Scenario 4

In this scenario only one locally hosted Partner Recommender is used. This scenario was designed to be a

benchmark for the preceding runs. Since the Partner Recommender is hosted locally, no network delays or

requests from other parties from the Internet can be encountered.

The FedWeb Greatest Hits [Demeester, 2015] collection web dump indexed by Apache Solr has been used as

the locally hosted Partner Recommender. The FedWeb dataset consists of search results from 150 different

search engines. From each search engine 20,000 to 80,000 documents are contained in the dataset. In total,

the dataset contains over five million documents which were indexed by a local Apache Solr instance. To query

the Apache Solr index, a purpose build Partner Recommender was implemented, directly accessing the index

on the hard disk.

In Table 13 and Figure 18 the results are shown again compared with the results from scenario 1 with three

partners connected via the internet. All measurements shown in Table 13 and Figure 18 were taken on a single

virtual machine with four CPU cores and four gigabytes of main memory. Looking at total response times, only

for 500 parallel requests a difference is notable. For the average number of timeouts, the situation is more

radical as the locally hosted do not result in request timeouts for any number of parallel request. As the

Partner Recommender is running on the same machine in both cases, this gives raise to the conclusion that the

partners connected over the internet are not able to respond in time for large number of more than 50

concurrent queries. Which might either be the limitation of their systems or a simple quota (either IP or API

based).




Parallel Requests Internet Local Internet Local

10 1,641 1,640 6 0

30 543 540 3 0

50 324 321 0 0

100 164 161 268 0

150 111 104 700 0

500 107 35 1,406 0

Table 13: Accumulated response times and average number of timeouts for sources connected via the Internet and a

locally hosted Apache Solr search engine.

D3.4



Figure 18: Accumulated response times and average number of timeouts for sources connected via the Internet and a

locally hosted Apache Solr search engine.

D3.4



7 Conclusions on the Federated Recommender

The work of performance optimization and feature engineering within the Second Federated Recommender

Prototype has been continued in the last iteration of the project. Furthermore, efforts have been made to

scientifically evaluate and refine the already proposed or introduced methods. Finally, evaluations on the

experimental studies on the system performance and according potential drawbacks of such federated systems

have been undertaken.

The joined efforts of WP4 and WP3 regarding exploitation and dissemination, the PartnerWizard, could already

prove its benefits. Several new partners were introduced into the system without further development efforts

needed. Based on the experience of the last years this was an important step since the potential lack of

resources of arising partners became obvious to be the greatest obstacle for further uptake. In addition, the

PartnerWizard was presented to the general public on the International Science 2.0 Conference and EEXCESS

Final Conference were subsequent new potential data providers indicated their interest to join the EEXCESS

system.

The exploitation of the already introduced partner sources was driven further by refining the query formulation

process based on the feedback of internal partners but also external users of the framework. Apart from

finalisation and refinement of the Federated Recommender Prototype, the Partner Recommender framework

and the PartnerWizard, KNOW focused on the scientific exploitation of newly introduced and core algorithms

of the system resulting in a series of publications. Topically these publications cover most of the main

challenges of federated systems, source selection, source representation, result aggregation and query

formulation but also challenges arising from the real world application of such systems.

Some of the results presented here could already be integrated as new features in the Federated

Recommender Framework. Here the most prominent example is the new result aggregation algorithm. It

promotes personalization and contextualisation of the results lists in several ways. With the accompanying

changes of the Secure User Profile it is now feasible that i) the user can state its preferences by himself and

therefore interfere with the ranking of the recommended documents on its own ii) the frontend adopt to the

user's behaviour by automatically learning the preferences. Although this approach already received positive

feedback within the consortium we want to extend this approach by learning to rank algorithms.

Although one of the biggest challenges within the real world application of such a federated system, the

adoption and connection of new partner systems, could be resolved, one challenge is still present: the

challenge of achieving low latencies within such a system. From our performance centred evaluation we found

that the main problem here is the response time of the partners. Although we could improve the response

time of the whole system substantially by splitting the requests into the recommendation and into the details

request the response time of the partners is still the biggest influence factor to the rest of the system. The only

potential solution to this problem is the hosting of the Partner Recommenders directly at the partner's side

which is supported by the system architecture.

D3.4



8 Narrative Path

Deliverable 3.3 described a prototype for narrative paths exploiting citations in scientific literature in order to

create a set of paths linking the resources. The algorithm has been evaluated against a gold dataset, and the

items ordering has been defined based on reader count. The approach has been evaluated surveying a sample

of Mendeley’ users and showed positive results.

In Deliverable 3.4 the research has been focused on whether it is possible to identify consistent pattern of

citation between papers in order to identify paths as information seeking journeys [Fernando et al. 2013].

In order to test this hypothesis two distinct experiments have been carried out. The objective in both cases was

the induction of a directed graph describing a path between articles that can help the researchers to navigate a

collection of literature using the idea of paths as sequences of items.

In the first experiment a corpus of literature has been mined to extract in-line citations and their relative

position. If the order in which they appear is consistent enough across the corpus is then possible to induce a

directed acyclic graph (DAG) where nodes represent articles and edges the relationship “read after.”

The second experiments instead exploited Mendeley logs extracting the users' reading sequence. The crowd-

sourced solutions aim to identify frequent reading patterns that can induce a Markov Chain in which states are

articles and transitions reflect the probability of reading a given paper after the current one.

8.1 Experiment 1: Mining narrative paths from survey papers

In this experiment, we asked the question of whether there are any consistent patterns in a body of research

papers on the citation order, i.e. is the citation to paper A located after a citation to paper B consistent?

In order to verify our assumption, a ground truth dataset is needed. Such ground truth should contain an

absolute ordering for a set of papers in the literature defining the order in which they appear.

As first step we wanted to evaluate if the assumptions on the ordering were valid, by assessing the level of

agreement across different raters, where each rater is represented by a document from which the relationship

pairs are extracted. High agreement would confirm the assumption while a low agreement would mean that no

evidence of agreement is present.

8.2 Ground truth dataset construction and testing

We assume that “state of the art papers” (i.e. papers containing a survey of the literature for a specific

research area) are the best candidates to induce an ordering over the cited documents from fundamental

paper to more advanced topics in the research area.

The dataset has been built by issuing queries to the Mendeley catalogue with specific query terms. The process

resulted in a corpus composed by 159,254 articles containing in their title the following terms:

D3.4



Term in the title Documents retrieved

survey 61,221

state of the art 5,559

introduction to 11,831

overview of 14,081

review of 66,625

Table 14

The resulting dataset is composed by tuples of type: (<uuid>,<filehash>)

For each document in the dataset, identified by <filehash>, we retrieved the associated PDF document and

extracted the citation contexts and resolved it against the Mendeley API. The citations have been ordered

according to their first occurrence in the source document.

The enriched dataset is then composed by tuples of type (<filehash>,<uuid>,<order>)

Due to some limitations on the amount of calls we can issue against the Mendeley API, sampled over 1M

instances of citation occurrences. We considered this amount sufficient for initial experiments.

The dataset has been pre-processed and we extracted all the existing follower relations, i.e. for any given

research paper, we compiled all pairs (uuid1, uuid2) where the first occurrence of uuid1 precedes the first

occurrence of uuid2. Each tuple in the pre-processed dataset will have the following schema

(<filehash>,<uuid1>,<ord1>,<uuid2>,<ord2>).

In order to test the agreement on the order of pairs in observed in the data, we coded the data as a matrix

where each row corresponds to one rater (<filehash> - i.e. the source of information) and each column to

each pair combination <uuid1>,<uuid2>. A cell of the matrix contains:

● 1, if in a given <filehash> the first occurrence of uuid1 precedes the first occurrence of uuid2

● 0, if in a given <filehash> the first occurrence of uuid2 precedes the first occurrence of uuid1

● NA, if in a given <filehash>, both uuid1 and uuid2 were not observed.

The agreement has been computed by means of the Krippendorf’s alpha [Krippendorff ,2013] and Fleiss’ kappa

[Fleiss, 1973] as alternative agreement metric.

We computed the metrics to a sample of the data because the implementation of Krippendorf’s alpha and

Fleiss’ kappa were not able to scale to the size of the data.

The values of alpha and kappa we were able to obtain on the sample data sample are as follows:

D3.4



Krippendorff's alpha

Subjects 190045

Raters 78

alpha -0.293

Table 15

Fleiss' Kappa for m Raters

Subjects 95695

Raters 78

Kappa -0.0125

z -218

p-value 0

Table 16

Both results show no agreement.

8.2.1 Noise filtering

We have tried to filter out noise from the agreement matrix in the following way:

● we have aggregated all the rows (rates) using a sum function

● we dropped all the (uuid1, uuid2) pairs where the abs(𝑥) ≤ 5. This means that we only kept pairs

where the difference between the number of raters who preferred (uuid1, uuid2) over (uuid2, uuid1)

was at least 5.

This resulted in a dataset of 28,611 pairs down from 27,820,109.

D3.4



Figure 19: Pair strength in the filtered dataset.

The dataset has the following distribution where the pairs strength is the margin by which voters agreed on the

citations order.

After the filtering, we can observe the following agreement: Krippendorff's alpha nominal metric: 0.139

This indicates that a very low agreement can be observed. However, it is still questionable, whether this

observed agreement is not only the result of the selection. This experiment was conducted on a dataset of

513,173 article pair judgements.

8.2.2 What can be done next on the evaluation dataset?

1. Given the fact that no significant agreement can be observed, we might want to try binning citations

originating from the same sentence/paragraph/section. This would mean that the order within that

unit is irrelevant and pairs should not be generated between these items. This would decrease

dimensionality and maybe is worth trying whether some agreement can then be seen.

2. The scalability problem is not necessarily the problem of the agreement algorithms scalability but of

the size of the input matrix, which has a very large number of columns. As the matrix is very sparse

too, one option would be to look for an implementation which uses a sparse matrix representation.

3. Approach the NP problem in a different way, for example decide that a path is composed of influential

papers impacting the currently visited paper. An approach similar to Valenzula et al. [Valenzula, 2015]

or Zhu et al. [Zhu, 2015] can then be followed.

4. Think of another approach for obtaining an evaluation data set.

5. Create a narrative path using Mendeley usage data following the idea of mining what readers usually

read after reading a given paper.

D3.4



8.3 Experiment 2: Mining Narrative Paths from Mendeley reading logs

The aim of this experiment was to develop an approach for browsing using the idea of narrative paths

generated from Mendeley usage data. More specifically, our assumption is that we can model the problem as a

Markov chain where the states correspond to academic articles and the transition probabilities representing

the relationship “read after” are mined from usage data. The transition probability of moving from article A to

article B is given by the probability of a user opening article B within 10 minutes of opening article A in any of

the Mendeley clients. Mendeley clients include the Mendeley Desktop, Mendeley Web Library and the Mobile

application.

Our approach can be fundamentally divided into two steps:

1. Generating the Markov chain, i.e. a directed weighted graph where the probability of moving to the

next state depends only on the current state.

2. Developing a client that allows a user to browse the network.

8.3.1 Generating the Markov chain

Our approach for generating the Markov chain is based on the following steps:

Input

>12 billion live events from the Mendeley logs (720 GB) from which transition probabilities are to be

determined

>622 million user articles (70 GB) corresponding to >132 million unique articles representing the states

of the Markov chain

Steps

1. Filter events related to opening a PDF using one of the Mendeley clients, namely:

‘OpenPdfInInternalViewer’, ‘OpenPDFInExternalViewer’ and ‘OpenFileInExternalViewer’.

2. Join filtered events with the catalogue (resulted in 27 million events, 32.6 GB)

3. Filter out document events for which there is no DOI or for which Mendeley does not have a full text.

4. Generate edges <doc_id1, doc_id2> for documents where doc_id2 was opened within 10

minutes of opening doc_id1.

5. Calculate the transition probabilities for the Markov chain getting triples of the form <doc_id1,

doc_id2, p> as 𝑝 =#opened(𝑑𝑜𝑐_𝑖𝑑1,𝑑𝑜𝑐_𝑖𝑑2)

#opened(𝑑𝑜𝑐_𝑖𝑑1,𝑋).

6. Group the triples by doc_id1 to collect outgoing edges and transition probabilities for a given node.

Group the triples by doc_id2 to collect incoming edges and transition probabilities for a given node.

Output

~10 million states and >61 million transition edges.

The original data where obtained by unloading the live_events and catalogue35 tables stored on the

Mendeley Redshift database. The processing then included the use of several Pig scripts, which were executed

on the Mendeley cluster. To select the pairs of articles that were open within 10 minutes of each other, we

have extended the functionality of Pig by streaming the data through a Python script.

D3.4



An example sample of the Markov chain is displayed in Figure 20 below.

Figure 20: A sample from the Markov chain induced from the Mendeley log data showing states as academic articles with

transitional probabilities between them.

8.3.2 Developing the client application

The functionality of the client application is as follows:

Preprocessing:

Import the network into Elasticsearch (or a database/key-value store) so that we can quickly retrieve

all the outgoing and incoming edges and transition probabilities for a given node.

Use case:

1. Retrieve the DOI of a document that the user wants to generate a path for/from by activating a

bookmarklet on this document. For simplicity, this has been implemented to work within the

Mendeley catalog web interface12, but the work could be fairly easily extended to work on any

research paper available on the Web.

2. Query an Elasticsearch instance to retrieve the identifiers of incoming and outgoing edges and

transition probabilities for a given state. Rank the retrieved states/edges descendingly by their

transition probability. Select identifiers of top N incoming and outgoing edges.

3. Resolve the retrieved identifiers (DOIs) against the Mendeley API to get the article metadata.

4. Display the retrieved documents to the user presenting which articles people tend to open after

reading this document and which articles they read before reading this document.

12 https://www.mendeley.com/catalog/

D3.4



The client application was developed using the bookmarklet technology we described in the previous

deliverables. To install the bookmarklet, all the user has to do is to drag and drop a link to the bookmarklet into

the user’s bookmarks in Chrome. The user can then visit any article page with a DOI in the Mendeley web

catalog and activate the bookmarklet on it. The result will look similar to the one in the figure below showing

what people tend to read next/before reading this document.

Figure 21: The bookmarklet for what people read next/before reading this document.

8.3.3 Limitations

The decision to model the problem as a Markov chain is a simplification. In theory, the transition

probabilities should depend on all the states the user has visited before. We leave the way to model

this and to determine any discounting factors to future work.

We have observed that in some cases, the same article can be recommended for both read after as

well as read before. While these loops in the network are supported by the usage data, it might be a

good idea to remove them as they might suggest that these articles can be read in any order.

8.3.4 Evaluation

To evaluate the quality of the proposed recommendation, we performed a user study over the Mendeley user

base. We selected and mailed a sample of Mendeley Core Users - i.e. users with high activity on Mendeley- that

D3.4



already have used any bookmarklet and visited Mendeley Suggest13. We asked to the user to install the

bookmarklet and try it on some sample documents or to freely navigate Mendeley Catalogue and answer a

short survey 14.

The survey has been answered by 70 Mendeley Core Users, i.e. user that habitually use Mendeley in their

workflow, selected in all academic positions except Bachelor Students.

How easy was the bookmarklet to install and to use?

Very easy: everything worked fine 54.40%

Somewhat easy: it worked in most cases but there were a couple of small problems 17.60%

Not very easy: it took a few goes to get it to work or it only worked in specific cases 10.30%

I was unable to get it to work at all 17.60%

Table 17

How useful did you find the reading lists generated?

Very useful: the reading lists were relevant 33.80%

Somewhat useful: the reading lists were useful in some cases, but not others 42.60%

Not very useful: the reading lists were largely irrelevant 23.50%

Table 18

How useful is the list of papers that people read BEFORE the current document?

Very useful: the suggested papers are relevant to better understand the current paper. 29.40%

Somewhat useful: some of the suggested paper are relevant others not 44.10%

Not very useful: the suggested papers are not relevant at all 26.50%

Table 19

13 https://www.mendeley.com/suggest/

14 https://www.surveymonkey.com/r/FJ5MYXW

https://www.surveymonkey.com/r/FJ5MYXW

D3.4



How useful is the list of papers that people read AFTER the current document?

Very useful: the suggested papers are relevant and helpful to discover new papers. 30.90%

Somewhat useful: some of the suggested paper are relevant others not. 48.50%

Not very useful: the suggested papers are not relevant at all. 20.60%

Table 20

Overall, how likely would you be to use this tool in your research workflow?

Very likely: I’ll use this regularly 42.60%

Somewhat likely: I may use this occasionally 36.80%

Unlikely: I won’t use this at all or only very rarely 20.60%

Table 21

Figure 22 Do you have any other comments or specific suggestions?

Result indicate the quality of the bookmarklet technology as an aid for exploring the path, the narrative paths

itself is judged generally useful, but sometimes the path does not correspond to the user idea of a sequence.

The ordering in before/after seems to agree with the user perception.

D3.4



The survey also highlights the need to refine the construction of the Markov chain, tuning the threshold for the

transition probability and the identification of loops between before and after.

The users appreciate the narrative paths as an element of their research workflow and advocate the

integration of narrative paths in Mendeley suggest.

8.3.5 How is this different from the previous narrative paths bookmarklet?

There are a number of differences from the bookmarklet we reported in the previous deliverables. Perhaps the

main difference lies in the use of a different idea to extract the narrative path. While the previous bookmarklet

used a content-based method to extract citation references from the article full text, this method relies on

usage activity logs. This has a number of consequences:

The method is able to provide recommendations on what to read next/before based on the

information of what people usually do read next/before rather than what the author would suggest

them to read next/before.

The usage based method is not applicable to documents on which we don’t track, or don’t have yet

the usage activity. These include new documents added to Mendeley for the first time and reports

which are not research articles, such as government reports, blog posts, Wikipedia pages, etc.

The usage based method has the potential to add serendipity into the recommendations.

The usage based method faces the challenge of relying on an uneven amount of user activity on each

document. This means that it can be expected that the quality of the produced recommendations is

good for documents with many readers, while being poor for documents with fewer readers. On the

other hand, the content based method is likely to produce pretty much the same quality of

recommendations across documents.

The content based method provides little information allowing us to decide on the direction of

whether a user should read something next/before. More specifically a reference

Overall, both of these approaches have pros and cons. To move the prototype solution into practice, we would

suggest to combine these solutions in the following way:

1. Extract citation references using the content based method

2. Retrieve links people follow using the usage data based method described in this section.

3. If we don’t have usage data, display the results from the content based method.

4. Otherwise, see if there is any overlap between the recommendations provided by the content based

and usage based method. If there is an overlap, it means that an article that is explicitly referenced by

a document is also often read by the readers, i.e. people follow this citation. This means that this

reference is important/useful for understanding this document as shown in Figure 23 below.

Consequently, this activity should be scored higher.

5. If there is no overlap, use the usage based recommendation if we have sufficient activity on the

document, otherwise fall back to the content based solution.

D3.4



Figure 23: An example transition from the generated Markov chain which coincides with an extracted citation from the

full text of the document that cites. We believe that such cases are indicative of the importance of a citation.

8.3.6 Possible improvements & Future work

The idea of narrative paths browsing is at the moment realised with the bookmarklet technology allowing the

user to see only one step ahead and one step back. However, this is more the limitation of the UI rather than

the approach. An obvious improvement here is to change the UI to support the displaying of paths based on

the Markov model. Establishing the length of the path and its end is an open question that could be explored.

We currently believe that a practical approach would be that by activating the bookmarklet on a given

resource, this resource would be seen as the goal resource and a shortest path over the graph would be

computed connecting it with any of the resources the user already has in their library.

8.4 Conclusions

We have presented two experiments addressing the automatic generation of a network of resources that is

needed to enable us to solve the Narrative Paths problem. Our ideas followed extracting/finding directed

relations connecting resources based on the principle of “Read B after A.” In Experiment 1 we investigated a

content based approach relying on citation positioning. In Experiment 2, we investigated an approach relying

on user activity data. Our results suggest that the user data can be exploited to induce a Markov chain over the

reading sequence to suggest to the user a reading path. Other systems can utilise this model to provide

recommendation/browsing capabilities across collections using the idea of narrative paths. We suggest to

combine this approach with the content based approach we described in previous deliverables, which naturally

complements this idea.

D3.4



9 References

[Callan, 2001] Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM

Transactions on Information Systems (TOIS), 19(2), 97-130.

[Demeester, 2015] Demeester, T., Trieschnigg, D., Zhou, K., Nguyen, D., & Hiemstra D. (2015) FedWeb

Greatest Hits: Presenting the New Test Collection for Federated Web Search. In 24th

International World Wide Web Conference (WWW 2015).

[Fernando, 2013] Fernando, S., Goodale, P., Clough, P., Stevenson, M., Hall, M., & Agirre, E. (2013).

Generating Paths through Cultural Heritage Collections. LaTeCH 2013.

[Fleiss, 1973] Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass

correlation coefficient as measures of reliability. Educational and psychological

measurement.

[Mikolov, 2013] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word

representations in vector space. arXiv preprint arXiv:1301.3781.

[Lu, 2005] Lu, J., & Callan, J. (2005). Federated search of text-based digital libraries in hierarchical

peer-to-peer networks. In Advances in Information Retrieval (pp. 52-66). Springer Berlin

Heidelberg.

[Jin, 2014] Jin, S., & Lan, M. (2014). Simple May Be Best-A Simple and Effective Method for Federated

Web Search via Search Engine Impact Factor Estimation. In TREC.

[Krippendorff, 2013] Krippendorff, K. (2013). Component of Content Analysis. Content Analysis: An Introduction

to its Methodology. 3rd Edition. Los Angeles: SAGE Publication.

[Pass, 2006] Pass, G.; Chowdhury, A. & Torgeson, C. (2006), A Picture of Search, in "Proceedings of the

1st International Conference on Scalable Information Systems", ACM, New York, USA.

[Rhodes, 2000] Rhodes, B. J. (2000). Just-in-time information retrieval (Doctoral dissertation,

Massachusetts Institute of Technology).

[Valenzuela, 2015] Valenzuela, M., Ha, V., & Etzioni, O. (2015). Identifying Meaningful Citations. AAAI

Workshops.

[Zhu, 2015] Zhu, X., Turney, P., Lemire, D., & Vellino, A. (2015). Measuring academic influence: Not all

citations are equal. Journal of the Association for Information Science and Technology,

66(2), 408-427.

[Ziak, 2015] Ziak, H., & Kern, R. (2015). Evaluation of Pseudo Relevance Feedback Techniques for Cross

Vertical Aggregated Search. In Experimental IR Meets Multilinguality, Multimodality, and

Interaction (pp. 91-102). Springer International Publishing.

D3.4



10 Glossary

Terms used within the EEXCESS project.

Partner Acronyms

JR-DIG JOANNEUM RESEARCH Forschungsgesellschaft mbH, AT

Uni Passau University of Passau, GE

Know Know-Center - Kompetenzzentrum für Wissenschaftsbasierte Anwendungen und Systeme

Forschungs- und Entwicklungs Center GmbH, AT

INSA Institut National des Sciences Appliquées (INSA) de Lyon, FR

ZBW German National Library of Economics, GE

BITM BitMedia, AT

KBL-AMBL Kanton Basel Land, CH

CT Collection Trust, UK

MEN Mendeley Ltd., UK

WM wissenmedia, GE

Abbreviations

API Application Programming Interface

EC European Commission

EEXCESS Enhancing Europe’s eXchange in Cultural Educational and Scientific resource

HTTP Hyper Text Transfer Protocol

JSON JavaScript Object Notation

LSI Latent Semantic Indexing

NLP Natural Language Processing

XML Extensible Markup Language

Acknowledgement: The research leading to these results has received funding from the European Union's

Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 600601.