d1.3 metrics, methods, and plans for evaluation of...

28
© FIRST consortium Page 1 of 28 Project Acronym: FIRST Project Title: Large scale information extraction and integration infrastructure for supporting financial decision making Project Number: 257928 Instrument: STREP Thematic Priority: ICT-2009-4.3 Information and Communication Technology D1.3 Metrics, methods, and plans for evaluation of results Work Package: WP1 Requirements Analysis Due Date: 30/09/2011 Submission Date: 30/09/2011 Start Date of Project: 01/10/2010 Duration of Project: 36 Months Organisation Responsible of Deliverable: MPS Version: 1.0 Status: Final Main authors: Paolo Lombardi Giorgio Aprile Dominic Ressel Alexandra Winter Stefan Queck Markus Reinhardt MPS MPS UHOH IDMS NEXT NEXT Peer Reviewer(s): Tomas Pariente Lobo Martin Žnidaršič Marko Bohanec ATOS JSI JSI Nature: R Report P Prototype D Demonstrator O Other Dissemination level: PU Public CO Confidential, only for members of the consortium (including the Commission) RE Restricted to a group specified by the consortium (including the Commission Services) Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

Upload: others

Post on 20-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

© FIRST consortium Page 1 of 28

Project Acronym: FIRST

Project Title: Large scale information extraction and integration infrastructure for supporting financial decision making

Project Number: 257928

Instrument: STREP

Thematic Priority: ICT-2009-4.3 Information and Communication Technology

D1.3 Metrics, methods, and plans for evaluation of results

Work Package: WP1 – Requirements Analysis

Due Date: 30/09/2011

Submission Date: 30/09/2011

Start Date of Project: 01/10/2010

Duration of Project: 36 Months

Organisation Responsible of Deliverable: MPS

Version: 1.0

Status: Final

Main authors: Paolo Lombardi

Giorgio Aprile

Dominic Ressel

Alexandra Winter

Stefan Queck

Markus Reinhardt

MPS

MPS

UHOH

IDMS

NEXT

NEXT

Peer Reviewer(s): Tomas Pariente Lobo

Martin Žnidaršič

Marko Bohanec

ATOS

JSI

JSI

Nature: R – Report P – Prototype D – Demonstrator O – Other

Dissemination level: PU – Public CO – Confidential, only for members of the

consortium (including the Commission)

RE – Restricted to a group specified by the consortium (including the Commission Services)

Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

Page 2: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 2 of 28

Revision history Version Date Modified by Comments 0.1 11/11/2010 Dominic Ressel (UHOH) Draft TOC created

0.2 15/11/2010 Dominic Ressel (UHOH) Draft State of the Art

0.3 02/12/2010 Dominic Ressel (UHOH) Draft TOC revised

0.4 08/12/2010 Dominic Ressel (UHOH) Draft TOC revised

0.5 19/07/2011 Paolo Lombardi (MPS) Draft TOC revised, content added and content production assigned

0.6 22/07/2011 Giorgio Aprile, Paolo Lombardi (MPS)

Inserted feedback from Partners. Added metrics for UC#2

0.7 22/07/2011 Giorgio Aprile (MPS) Inserted comments in UC#2

0.8 28/07/2011 Paolo Lombardi (MPS) Integrated comments from all Partners and finalised internal tasks for development of the various sections

0.9 02/09/2011 Dominic Ressel (UHOH), Maria Costante, Giorgio Aprile, Paolo Lombardi (MPS)

Revised sections on annotation; inserted technical content for UC#2 and methodology for GQM

0,10 06/09/2011 Alexandra Winter (IDMS) Inserted technical content for UC#3

0.11 07/09/2011 Paolo Lombardi, Giorgio Aprile (MPS)

Inserted technical content for UC#2

0.12 08/09/2011 Paolo Lombardi, Giorgio Aprile (MPS), Alexandra Winter (IDMS)

Revised and harmonised metrics tables and Section on evaluation planning

0.13 09/09/2011 Stefan Queck, Markus Reinhardt (NEXT)

Inserted technical content for UC#1

0.14 12/09/2011 Paolo Lombardi (MPS) Finalised complete draft and circulated for final comments before peer review

0.15 15/09/2011 Paolo Lombardi (MPS) Comments incorporated and sent to peer reviewers (Tomas Pariente Lobo, ATOS, and Martin Žnidaršič and Marko Bohanec, JSI)

0.16 26/09/2011 Paolo Lombardi (MPS) Revised according to peer reviewers’ comments. Sent back to ATOS and JSI for final check before final dispatch to the Coordinator

0.2 28/09/2011 Paolo Lombardi (MPS) Included comments from Achim Klein (UHOH) and final touch-ups.

1.0 30/09/2011 Tomás Pariente (ATOS) Final QA and preparation for submission

Page 3: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 3 of 28

Copyright © 2011, FIRST Consortium

The FIRST Consortium (www.project-first.eu) grants third parties the right to use and distribute all or parts of this document, provided that the FIRST project and the document are properly referenced.

THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

----------------

Page 4: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 4 of 28

Executive Summary

The main objective of the present document is to define metrics, methods and plans for evaluation of the results arising from the 3 FIRST Usecases, specified in deliverables D1.1 (“Definition of market surveillance, risk management and retail brokerage usecases”) and D1.2 (“Usecase requirements specification”).

In particular, the document defines the detailed metrics through which the three FIRST usecase owners will evaluate, by means of a commonly accepted methodology (the GQM methodology, briefly instantiated for FIRST in the document), the progress on the usecase prototypes.

The method of “corpus construction”, whose purpose is to support the development and validation of all the FIRST usecases, is also briefly illustrated, as well as the overview of the FIRST ontology.

Moreover, a high-level plan for evaluation of the results is provided, based on the GQM (Goal-Question-Metrics) methodology, as agreed among all FIRST Partners.

Finally, as a consequence of what already agreed upon at the time of writing of D1.2, where no detailed feasibility analysis was possible to be performed and it was therefore agreed among all Partners that, in relation to RTD activities progresses, the requirements there described could have been prioritised, pruned, altered, or extended, in the interests of the Project itself and always following the contractual provisions of FIRST, also the metrics established in the present document could be adjusted in due-course, pursuant modifications on D1.2 that would be agreed upon at Technical Committee level and validated at General Assembly level.

Page 5: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 5 of 28

Table of Contents

Executive Summary ...................................................................................................... 4

Glossary ......................................................................................................................... 7

Abbreviations and acronyms ....................................................................................... 9

1. Introduction .......................................................................................................... 10

1.1. Purpose of this document ................................................................................ 10

1.2. Research approach ......................................................................................... 10

1.3. Intended audience ........................................................................................... 10

1.4. Background and related documents ................................................................ 10

2. Metrics ................................................................................................................... 12

2.1. General approach............................................................................................ 12

2.2. Metrics for UC#1, “Market Surveillance” .......................................................... 12

2.2.1 UC#1 – Overall performance metrics ....................................................... 12

2.2.2 UC#1 – Metrics for functional requirements ............................................. 13

2.2.3 UC#1 – Metrics for non-functional requirements ...................................... 14

2.3. Metrics for UC#2 “Reputational Risk” .............................................................. 14

2.3.1 UC#2 – Overall performance metrics ....................................................... 14

2.3.2 UC#2 – Metrics for functional requirements ............................................. 15

2.3.3 UC#2 – Metrics for non-functional requirements ...................................... 15

2.4. Metrics for UC#3 “Retail Brokerage” ............................................................... 16

2.4.1 UC#3 – Overall performance metrics ....................................................... 16

2.4.2 UC#3 – Metrics for functional requirements ............................................. 17

2.4.3 UC#3 – Metrics for non-functional requirements ...................................... 17

3. Methods – Corpus construction ......................................................................... 18

3.1. Introduction ..................................................................................................... 18

3.2. Corpus size and annotators: State of the art ................................................... 18

3.3. Annotators and training in FIRST .................................................................... 20

4. Methods – Ontology implementation and update ............................................. 22

5. Methods – Evaluation of results: brief overview of the GQM methodology .... 23

6. Plan for evaluation of results .............................................................................. 25

7. Conclusion ............................................................................................................ 26

7.1. Refinement policy for the metrics .................................................................... 26

7.2. Next steps ....................................................................................................... 26

References ................................................................................................................... 27

Index of Figures

Figure 1: GQM schema ................................................................................................................. 23

Page 6: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 6 of 28

Index of Tables

Table 1. Metrics exemplified in the DoW ..................................................................................... 11 Table 2. UC#1 Overall performance metrics ................................................................................ 13

Table 3. UC#1 Metrics for Functional Requirements ................................................................... 13 Table 4. UC#1 Metrics for Non-Functional Requirements ........................................................... 14 Table 5. UC#2 Overall performance metrics ................................................................................ 14 Table 6. UC#2 Metrics for Functional Requirements ................................................................... 15 Table 7. UC#2 Metrics for Non-Functional Requirements ........................................................... 15

Table 8. UC#3 Overall performance metrics ................................................................................ 17 Table 9. UC#3 Metrics for Functional Requirements ................................................................... 17 Table 10. UC#3 Metrics for Non-Functional Requirements ......................................................... 17 Table 11: General information about different corpus design approaches ................................... 20 Table 12: Annotators and training ................................................................................................. 21

Table 13: Overview of the FIRST ontology .................................................................................. 22 Table 14. GQM steps planning ..................................................................................................... 25

Page 7: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 7 of 28

Glossary

For the sake of clarity, and as far as FIRST is concerned, the following table describes the meaning attributed to some key terms that are dealt with in the present document.

Actor User that interacts, either directly or indirectly, with the FIRST system.

Counterpart An entity, typically an organisation operating in the financial industry, with which a business relationship is established.

General requirement An end-user requirement that is common to all three FIRST usecases.

Index Weighted average of a number of stock prices, one example is IBEX 35.

Instrument A (financial) instrument refers to an entity whose value is determined on financial markets. E.g. a stock or an index.

Pillar 3 The common reporting standard defined by the Basel II Accord and inherent to risk measurement information.

Reputation Reputation is the opinion (more technically, a social evaluation) of a group of subjects towards an entity on a certain criterion. It is an important factor in the financial industry, especially when referred to a financial institution.

Reputational risk The risk arising from negative perception on the part of customers, counterparties, shareholders, investors or regulators that can adversely affect an organisation’s (viz., a Bank, in the FIRST specific usecase) ability to maintain or develop a certain business practice.

Risk the probability that a particular adverse event occurs during a stated period of time, or results from a particular challenge [The Royal Society, 1992].

Securities Shares representing an investor’s ownership interest, e.g. shares in a company (equity shares), in a fund, in a bond, etc, or securitized rights, e.g. warrants or options.

Sentiment A sentiment on a sentiment object o is a positive or negative view, attitude, emotion or appraisal on o from an document author or actor.

Sentiment analysis According to Liu [Hu and Liu, 2004] sentiment analysis and opinion mining are synonyms. However, in contrast to Liu, we use the term sentiment in exchange for opinion in the following definitions.

Sentiment object A sentiment object o is an entity which in the scope of this deliverable

is typically a financial instrument fiFI (adapted from Liu).

Sentiment orientation The sentiment orientation so{positive, negative} of a sentiment on a sentiment object o indicates whether the sentiment is positive or negative (adapted from Liu).

Shareholder A shareholder (or stockholder) is an individual or institution (including a corporation) that legally owns one or more shares of stock in a public or private corporation. Shareholders own the stock, but not the corporation itself.

Specific requirement An end-user requirement that is referred to the single usecase of pertinence.

Page 8: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 8 of 28

Stakeholder A stakeholder is an individual or group that takes active part in a organisation or in an organised process, influencing and/or being influenced by such organisation / process (e.g., suppliers, consumers, employees, shareholders, financial community, government, media).

Structured information Information that is provided in accordance with a data structure, i.e. each data item is assigned to a data field with defined meaning.

Technology providers ATOS, GUF, IDMS, JSI, UGOE, UHOH.

Unstructured information

Implicit information that cannot directly be assigned to a defined meaning but needs interpretation and context knowledge. The textual content of documents is a typical example for unstructured information.

Usecase providers IDMS, MPS, NEXT.

Volatility Standard deviation of a security’s price (or returns). Commonly used as a risk measure for a security.

Volume Number of shares traded.

Page 9: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 9 of 28

Abbreviations and acronyms

DoW Description of Work

DSS Decision Support System

EC European Commission

ICT Information & Communication Technology

GQM Goal, Question, Metric

GUI Graphical User Interface

KRI Key-Risk Indicator

TBD To be defined

UAT User Acceptance Test

UC Usecase

WP Workpackage

Page 10: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 10 of 28

1. Introduction

1.1. Purpose of this document

To define metrics, methods and plans that are necessary to conduct evaluation of the results arising from the 3 FIRST usecases (UCs).

In particular, the present document covers the metrics and evaluation plans that will be utilised to assess the end-user prototypes, from an end-user perspective, i.e. directly connecting the elements of evaluation to the requirements detailed in D1.2.

The document has been generated with strict interrelation with the technical workpackages, specifically with WP2, WP3, WP4, WP5 and WP6.

1.2. Research approach

FIRST validates its main innovations by defining and carrying out the three complementary, business-driven usecases defined in D1.2. Progress of results on such UCs is measured against the metrics specifically described in the present document.

The methodology proposed for results evaluation is the GQM1 one and is going to be harmonically performed along the three UCS, as described in Sec. 6 of the present document.

1.3. Intended audience

The target readers are mainly Work package and Task Leaders involved with the FIRST developments, and the European Commission.

The document is released as “Public”, in order to maximise the impact of FIRST on society.

1.4. Background and related documents

The present deliverable is part of the second wave of documents produced by the Project, and the specific background to it are the following user-driven documents:

Deliverable D1.1, “Definition of market surveillance, risk management and retail brokerage usecases”;

Deliverable D2.1, “Technical requirements and state-of-the-art”.

Moreover, specific reference is hereby made to Section 1.1.5 of the FIRST DoW, titled “Evaluation of project results”. In fact, in that Section a number of selected metrics for evaluation of the Project at technical level were already defined, viz.:

Component Metrics

Data acquisition Number of acquired data sources and the corresponding data volume in terms of data records per day

1 see for instance http://en.wikipedia.org/wiki/GQM

Page 11: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 11 of 28

Component Metrics

Ontology infrastructure (engineering, learning)

Size of semantic resources measured mostly in the number of defined concepts, fitness-for-purpose (reflected and quantified in the subsequent stages of the process)

Information extraction components and event detection/prediction models

Recall, precision, f-measure, accuracy

Processing pipeline Throughput (i.e. number of data records processed per second), processing speed (i.e. time required to process a data record)

Market surveillance prototype

Pre-event abnormal returns or trading volumes, risk metrics

Risk management, online retail banking and brokerage prototypes

Volatility, value at risk, Sharpe Ratio, Treynor’s Measure, Jensen's alpha

Table 1. Metrics exemplified in the DoW

Whereas some of the aforementioned parameters are indeed already part of the specification document, some others have also been identified by the UC Owners as relevant and effective for project delivery monitoring purposes and therefore have been inserted in Section 2 of the present document. Finally, others, although not expressly inserted in D1.2, nor directly specified in the present document, are effectively part of the FIRST integrated solution aimed for (e.g., the metrics on the Ontology infrastructure).

Be that as it may, in the following sections of the present document, specific reference is made to establishing a common and measurable path for UC delivery appraisal, as an activity scheduled for Year 2 of the FIRST Project and part of WP8, “End-user prototypes and Evaluation”.

Page 12: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 12 of 28

2. Metrics

2.1. General approach

Provided that a general acceptance criteria for the requirements specified in D1.2 will be adopted by all UC Owners (i.e., Requirement delivered Y/N), metrics will be adopted for assessing the overall performance of each end-user prototype of the 3 UCs and, in particular, for evaluating the functional and non-functional requirements listed in the aforementioned D1.2.

In the following sections the overall performance metrics and the specific metrics for individual requirements adopted by the FIRST Project and key to assess progress evaluation are specified. These metrics are by no means meant to substitute correct delivery of the specified requirements (which are described in D1.2), rather, they allow correct appraisal, by means of the evaluation plan described in Sec. 6 of the present document, of the effectiveness of the designed solution, focusing on innovations delivered compared to the current state-of-the-art in the related industries addressed by the 3 FIRST UCs.

Metrics are indicated as METX.Y, where the suffix “MET” clearly place such metrics in a clearly separated dimension from usecase requirements. Index “X” identifies the usecase number the metric refers to and index “Y” is the progressive number for such metrics in one of the three usecases.

A total of 27 parameters have been identified to evaluate FIRST results in the development of the three usecases specified in document D1.2, of which 11 are related to UC#1, and 8 both in UC#2 and UC#3.

2.2. Metrics for UC#1, “Market Surveillance”

In the following sections the metrics that will be monitored to control and assess the overall performance of each end-user prototype and the functional and non-functional requirements of the UC#1 (“Market Surveillance” usecase) are reported.

2.2.1 UC#1 – Overall performance metrics

In the following table, the metrics that will be utilised to assess the overall performance of the UC#1 end-user prototype are reported.

Overall feature Metrics Acceptance threshold Notes

Web resources scanning

MET1.1 – Web articles rate of acquisition

At least 95% of the selected web resources (as per D1.1) need to be scanned hourly

Derived from UC1.R-E1 UC1.R-E2 UC1-R-E3

Annotation capability (“recall”)

MET1.2 – Fraction of identified articles effectively acquired and annotated by the system

At least 50% of the articles that contain sentiment should be classified as containing sentiment

Derived from UC1.R-E1 UC1.R-E2 UC1-R-E3

Financial Object Identification

MET1.3 – Identification and annotation of objects within the articles by the system

At least 60% of the objects in the articles must be identified and

Derived from UC1.R-E1 UC1.R-E2

Page 13: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 13 of 28

Overall feature Metrics Acceptance threshold Notes

annotated. In any case, such upper limit should be lower than 90% of the upper bound acquired from human expert consensus rates.

Update of the sentiment

MET1.4 – Update time of the calculated sentiment.

The sentiments must be continuously updated once a day,

Derived from UC1.R-E2 UC1.R-E1

Table 2. UC#1 Overall performance metrics

2.2.2 UC#1 – Metrics for functional requirements

In the following table, the metrics that will be utilised to assess the specific functional aspects delivered by the UC#1 end-user prototype are reported.

Requirement ID /Name

Metrics Acceptance threshold Notes

Sentiment objects

MET1.5 – Annotation and resulting calculation of the sentiments done correctly with regard to the direction (positive / negative)

Threshold: 65% Derived from UC1.R-F1 UC1.R-F5

Sentiment analysis MET1.6 – Fraction of the features (price development) correctly identified for sentiment objects

Threshold: 60% In any case, such upper limit should be lower than 90% of the upper bound acquired from human expert consensus rates.

Derived from UC1-R-F1 UC1.R-F5

Sentiment history MET1.7 – complete data history of identified and valuated sentiments as well as for all alerts generated by the system

System has to keep a history of all identified and valued sentiments as well as for all generated alerts. Horizon for the historisation should be 5 years.

Derived from UC1.R-F6 UC1.R-F7 UC1.R-F4 UC1.R-F5

Search function MET1.8 – Availability of search functionality for each alert detected within the system

System has to guarantee search functionalities 100%

UC1.R-F7

Cockpit functions MET 1.9 – Displaying alert in a graphical understandable manner for trend analysis

System has to visualise all alerts 90%

UC1.R-F5

Table 3. UC#1 Metrics for Functional Requirements

Page 14: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 14 of 28

2.2.3 UC#1 – Metrics for non-functional requirements

In the following table, the metrics that will be utilised to assess the specific non-functional aspects delivered by the UC#1 end-user prototype are reported.

Requirement ID /Name

Metrics Acceptance threshold Notes

Capacity

MET 1.10 – System should be able to handle multiple users working simultaneously.

Minimum 10 users Derived form UC1.R-P1

Response Time MET 1.11 – Request response time max. 10 seconds. Start with sending a request / End with presenting results. It is not necessary to measure the response time by the system.

Tolerance +/- 10%

Derived from UC1.R-P2

Table 4. UC#1 Metrics for Non-Functional Requirements

2.3. Metrics for UC#2 “Reputational Risk”

In the following sections the metrics that will be monitored to control and assess the overall performance of each end-user prototype and the functional and non-functional requirements of the UC#2 (“Reputational Risk” Usecase) are reported.

2.3.1 UC#2 – Overall performance metrics

In the following table, the metrics that will be utilised to assess the overall performance of the UC#2 end-user prototype are reported.

Overall feature Metrics Acceptance threshold and/or tolerance

Notes

Web resources scanning

MET2.1 – Web articles rate of acquisition

At least 95% of the selected web resources (as per D1.1) need to be scanned hourly

Derived from R-E1.1, UC2.R-E1.1, UC2.R-E1.2, UC2.R-P3

Annotation capability (“recall”)

MET2.2 – Fraction of relevant articles effectively acquired and annotated by the system

At least 50% of the articles that contain sentiment should be classified as containing sentiment

Derived from R-E1.1, UC2.R-E1.1, UC2.R-E1.2

Update of the reputation KRI

MET2.3 – Update time of the KRI vector on counterparts

The KRI must be updated hourly with a tolerance of +/- 15 minutes

UC2.R-P3

Table 5. UC#2 Overall performance metrics

Page 15: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 15 of 28

2.3.2 UC#2 – Metrics for functional requirements

In the following table, the metrics that will be utilised to assess the specific functional aspects delivered by the UC#2 end-user prototype are reported.

Requirement ID /Name

Metrics Acceptance threshold and/or tolerance

Notes

UC2.R-F1 (Reputation vocabulary)

MET2.4 – Fraction of the counterparts and keywords effectively handled by the system

Threshold: 70% (with respect to concepts covered in the ontology).

UC2.R-F9 (Sentiment analysis)

MET2.5 – Fraction of the features (see D1.2, Sec. 4.3.4) correctly identified for sentiment objects

Threshold: 60% (with respect to concepts covered in the ontology). In any case, such upper limit should be lower than 90% of the upper bound acquired from human expert consensus rates.

UC2.R-F12 (DSS)

MET2.6 – Backtest of scenario report quality as produced by the DSS

Threshold: 50% Scenario report to be further detailed as part of WP6 activities on Models, in order to detail the parameters to measure the threshold here indicated

Table 6. UC#2 Metrics for Functional Requirements

2.3.3 UC#2 – Metrics for non-functional requirements

In the following table, the metrics that will be utilised to assess the specific non-functional aspects delivered by the UC#2 end-user prototype are reported.

Requirement ID /Name

Metrics Acceptance threshold and/or tolerance

Notes

UC2.R-P2 (Latency)

MET 2.7 – Fraction of reports delivered within 5 mins from news arrival

Threshold: 90%

UC2.R-U1 (Usability)

MET 2.8 – Quality of the interface delivered

Acceptance from at least 75% of the end-users

Ascertained by means of UATs – User Acceptance Tests

Table 7. UC#2 Metrics for Non-Functional Requirements

Page 16: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 16 of 28

2.4. Metrics for UC#3 “Retail Brokerage”

In the following sections the metrics that will be monitored to control and assess the overall performance of each end-user prototype and the functional and non-functional requirements of the UC#3 (“Retail Brokerage” Usecase) are reported.

2.4.1 UC#3 – Overall performance metrics

In the following table, the metrics that will be utilised to assess the overall performance of the UC#3 end-user prototype are reported.

Overall feature Metrics Acceptance threshold Notes

Web resources scanning

MET3.1 – Web articles rate of acquisition

At least 95% of the selected web resources (as per D1.1) need to be scanned hourly

Annotation capability # 1 (“recall”)

MET3.2 – Fraction of identified articles effectively acquired and annotated by the system

At least 60% of the documents that are classified negative/positive must actually be negative/positive.

Precision Derived from UC3.R-EU1, UC3.R-EU2, UC3.R-F1.1, UC3.R-F1.2, UC3.R-F1.3, UC3.R-F1.4

Annotation capability # 2

MET 3.3. Fraction of identified articles effectively acquired and annotated by the system

At least 50% of the articles that contain sentiment should be classified as containing sentiment

Recall

Financial Object Identification

MET3.4 – Identification and annotation of relevant objects (e.g. stocks) within the articles by the system

At least 70% of the relevant objects in the articles must be identified and annotated (with respect to concepts covered in the ontology)

Derived from UC3.R-EU1, UC3.R-EU2, UC3.R-F1.1, UC3.R-F1.2, UC3.R-F1.3, UC3.R-F1.4

Update of the sentiment

MET3.5 – Update time of the calculated sentiment.

Sentiments must be updated hourly with a tolerance of +/- 15 mns.

Re-calculation is only necessary in case new articles occur

Backtesting MET3.6 Quality assurance of the filtering

For at least 80% of all articles that have been identified as relevant, i.e. for which objects have been identified and sentiments have been calculated, filtering must be correct. In any case, such upper

Page 17: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 17 of 28

Overall feature Metrics Acceptance threshold Notes

limit should be lower than 90% of the upper bound acquired from human expert consensus rates.

Table 8. UC#3 Overall performance metrics

2.4.2 UC#3 – Metrics for functional requirements

In the following table, the metrics that will be utilised to assess the specific functional aspects delivered by the UC#3 end-user prototype are reported.

Requirement ID /Name

Metrics Acceptance threshold Notes

UC3.R-F1 Sentiment objects

MET3.7 – Annotation and resulting calculation of the sentiments done correctly with regard to the polarity (positive / negative)

Threshold: Accuracy of 65% at document level. In any case, such upper limit should be lower than 90% of the upper bound acquired from human expert consensus rates.

Accuracy UC3.R-F1.1, UC3.R-F1.2, UC3.R-F1.3, UC3.R-F1.4, UC3.R-F2.1

Table 9. UC#3 Metrics for Functional Requirements

2.4.3 UC#3 – Metrics for non-functional requirements

In the following table, the metrics that will be utilised to assess the specific non-functional aspects delivered by the UC#3 end-user prototype are reported.

Requirement ID /Name

Metrics Acceptance threshold Notes

UC3.R-U1 (Usability)

MET 3.8 – GUI elements should provide information in an aggregated, understandable manner.

Acceptance from at least 75% of the end-users

Ascertained by means of UATs – User Acceptance Tests

Table 10. UC#3 Metrics for Non-Functional Requirements

Page 18: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 18 of 28

3. Methods – Corpus construction

In the present section the method of “corpus construction”, whose purpose is to support the development and validation of all the FIRST usecases, is briefly described. More information can be found in D3.1, “Semantic resource and data acquisition”.

3.1. Introduction

The construction of a gold standard corpus is an important task in many scientific fields, especially in the areas of automatic information extraction and classification of semi- or unstructured resources. The quality of classifiers based on machine learning techniques or of the rule-based information extraction systems depends crucially on a well-annotated set of instances for training. This corpus is called the gold standard. It contains documents which are annotated according to the estimation of one or more human experts. In the case of heterogeneous classifications assigned by several experts, the gold standard is formed either by the majority vote or some kind of averaging or aggregation method.

This chapter introduces some related work of gold standard corpus construction in the area of sentiment analysis, though some examples of the areas of text classification and fact extraction are mentioned as well, because the different problems that arise when designing the process of constructing the gold standard are often the same or similar in all these areas.

A profound analysis of the state of the art will be given in the document D4.2, whereas here the focus lies on the required human experts and some general characteristics of corpora.

3.2. Corpus size and annotators: State of the art

The overview given in Table 11 emphasizes some general aspects of gold standard corpora which have been created by other scientific works.

Authors Total Number of Documents and Sentences / Number of documents and Sentences annotated by multiple raters

Corpus Source

Number of Annotators for whole corpus

Expert status of annotators / Training of annotators

Domain

Bermingham & Smeaton 2009

115 documents, 26.375 sentences

Blogs06 corpus from the TREC Blog Track

15 Partially sentiment analysts / No

Various domains

Christen-sen et al. 2002

600 documents American College of Radiology

4 Professional Physicians / No

Medical (CT scans reports)

Devitt & Ahmad

30 documents Irish media, internat. news

3 Unknown / No News about takeover bids

Hu & Liu 2004

500 documents (5 Products, 100 reviews per product) / -

Amazon.com, C|net.com

3 No experts / No Product customer reviews

Page 19: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 19 of 28

Authors Total Number of Documents and Sentences / Number of documents and Sentences annotated by multiple raters

Corpus Source

Number of Annotators for whole corpus

Expert status of annotators / Training of annotators

Domain

Jason Kessler et al. 2010

335 documents (blogs), 13.126 sentences, 223.001 tokens / exact number unknown, probably around 30.000 tokens

Manually gathered through web search

2 Authors of the article / No

Blogs about auto-motive domain

Klein et al. 2011

105 documents / 105 documents

Blogger.com

3 Students / No Financial Blogs about S&P500

Kim & Hovy 2004

462 English adjectives, 502 verbs; 100 sentences from DUC 2001 corpus

TOEFL Test preparation vocabulary for Word classification; DUC 2001 corpus for sentences

7 Unknown / Yes Various domains

Ku, Lo & Chen 2007

843 documents, 11.907 sentences / 843 documents

NTCIR CIRB020 and CIRB040 test collections

3 for word classification, 2 for sentence classification

Unknown / Unknown

News

Moilanen & Pulman 2009

24 documents / Unknown

Unknown 5 3 linguistic students, one author, one volunteer / Unknown

Unknown

Myko-wiecka et al. 2009

20 documents / 20 documents

Clinical data 2 Unknown / No Mammography and diabetes N

Nasukawa & Yi 2003

175 subject terms within context / -

Random Web pages

Unknown Unknown / Unknown

Various domains

O’Hare et al. 2009

6.561 documents / 164 doc-topic pairs

unknown 7 No Experts / Yes

Financial Blogs

Pang et al. 2002

1.400 documents / Unknown

Internet Movie Database (IMDb)

Unknown Unknown / Unknown

Movie reviews

Page 20: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 20 of 28

Authors Total Number of Documents and Sentences / Number of documents and Sentences annotated by multiple raters

Corpus Source

Number of Annotators for whole corpus

Expert status of annotators / Training of annotators

Domain

Shaikh et al. 2007

200 Sentences My Yahoo! 20 Anonymous annotators / No

Product & movie reviews. news

Wiebe et al. 1999

1004 sentiment sentences (all)

Articles of the Wall Street Journal Treebank Corpus

4 2 NLP researcher, 1 computer science student, 1 layman / Yes

Financial articles?

Wilson et al. 2005

425 documents, 8.984 sentences / 10, 447 subjective expressions

Multi-perspective Question Answering Opinion (MPQA) Corpus

Unknown Unknown / No Various domains

Tsur et al. 2010

40 sentences Amazon 15 Accustomed to Amazon reviews

Product reviews

Table 11: General information about different corpus design approaches

The analyzed approaches vary a lot when it comes to the size of the corpus. While sometimes only 40 sentences are annotated, other projects made use of around 26.000. The amount of documents lies between 20 and 6500. Another important information is the ratio of part of the corpus which has been annotated by several annotators and not only by one person. Furthermore the expert status of the annotators has an impact on the quality of the outcome, depending on the special requirements of the domain. It can be summarized that the effort varies a lot. Nevertheless, the significance of the results and the conclusions drawn upon them depend crucially on the basis of data. Thus, every ambitious sentiment analysis project should be prepared to spend the means to acquire a sufficient number of skilled annotators or at least for the teaching of them. Moreover, the size of the corpus should reach a certain level which allows for statistically significant results. In order to detect the subjectivity which naturally lies in the annotation of sentiments, a sample of the corpus or the corpus as a whole should be annotated by multiple raters.

3.3. Annotators and training in FIRST

For every usecase in FIRST, there are partner experts employed as annotators. The following table shows some further details about the three corpus creation approaches:

Page 21: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 21 of 28

Usecase Envisaged Number of Documents and Sentences / Envisaged Number of documents and Sentences annotated by multiple raters

Corpus Source

Number of Annotators for whole corpus

Expert status of annotators / Training of annotators

Domain

Market Surveillance

200 document-object pairs

See Annex 2 of D1.2

3 Professional / Yes

Financial news and blogs

Reputational Risk

700 document-object pairs

See Annex 2 of D1.2

6 Professional / Yes

Financial news and blogs

Retail Brokerage

500 document-object pairs

See Annex 2 of D1.2

3 Professional / Yes

Financial news and blogs

Table 12: Annotators and training

These expert annotators where additionally trained through the conduction of two test annotation rounds. Furthermore an annotation workshop was developed at which the annotators where trained in detail how to use the final annotation tool. This tool is the Knowtator1 plugin of Protégé2. The experts also performed an alignment session, which means that they discussed how they arrive at a more consistent way of annotating.

The number of documents differs for the usecases. The reason behind this lies in the different availability of adequate documents. For instance, indicators for the detection of market abuse are found more rarely throughout the internet, which means that for the usecase market surveillance there will be a smaller gold standard corpus.

1 http://knowtator.sourceforge.net/

2 http://protege.stanford.edu/

Page 22: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 22 of 28

4. Methods – Ontology implementation and update

Ontology implementation will follow the “competence questions” methodology. Draft concepts are provided as part of the specification requirements (D1.2) only for UC#2.

Technical aspects of the methodology related to ontology implementation and update will be described in the report D3.1, “Semantic resources and data acquisition”, co-ordinated by JSI.

The FIRST ontology contains two important aspects of knowledge about financial markets: (i) real-world entities such as companies and stock indices and their interrelations, and (ii) the corresponding lexical knowledge required to identify these entities in texts. The ontology is thus fit for the purpose of information extraction rather than representing a basis for logic-based reasoning.

We distinguish between the static and dynamic part of the ontology. The static part contains knowledge that does not change frequently (i.e., does not adapt to the stream in real time). It contains the knowledge about financial indices, instruments, companies, countries, industrial sectors, sentiment-bearing words, and financial topics. This part of the ontology will scale-up in terms of coverage (i.e., how many financial indices, topics, and sentiment-bearing words the ontology covers) and in terms of aspects (i.e., which different types of information are available in the ontology, e.g., industrial sectors, sentiment vocabularies, topic taxonomies…).

The dynamic part will include two aspects of knowledge1 that will be constantly updated with respect to the data stream: (i) topic taxonomy and (ii) sentiment vocabulary. The dynamic part of the ontology will scale-up mostly in terms of the maximum throughput of the topic detection algorithm and sentiment vocabulary extractor.

Metrics relevant for ontology evaluation.

target

Co

vera

ge

Ontology “spawned” from >1000 financial indices

Asp

ects

Indices, stocks, companies, countries, industrial sectors, sentiment vocabulary, topic taxonomies

Table 13: Overview of the FIRST ontology

1 Note that these two aspects are also included in the static part where they do not adapt to the stream but rather

represent UC-specific knowledge and existing semantic resources (e.g., McDonald’s financial word lists

<http://www.nd.edu/~mcdonald/Word_Lists.html>).

Page 23: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 23 of 28

5. Methods – Evaluation of results: brief overview of the GQM methodology

Generally speaking, the GQM (“Goal, Question, Metric”) methodology [Basili et al. 1994] provides practical guidance for establishing and using meaningful metrics, (which are well defined with the goals of the organizational environment), in order to evaluate the quality, assess progress and support improvement initiatives of a given project.

GQM is a top-down approach to establish a goal-driven measurement system for software development, in that the team starts with organizational goals, defines measurement goals, poses questions to address the goals, and it identifies metrics that provide answers to the questions. Ultimately, the GQM method defines a measurement model based on three logical levels, as depicted in the figure below 1 :

Figure 1: GQM schema

Because of its proven effectiveness and lean execution, the GMQ methodology will be applied by FIRST in the framework of UC evaluation of results, as it will be utilised by all UC Owners to systematically evaluate the solutions delivered in the UCs.

GQM is described in terms of a six-step process where the first three steps are about using goals to drive the identification of the right metrics utilised in 3 Usecases (a process that is intended to be conducted as part of the development of Sec. 2 of the present document) and the last three steps about gathering the measurement data and making effective use of the measurement results to drive decision making and improvements based on solutions delivered in the UCs.

The steps of GQM, adequately contextualised for the FIRST usecases, are the following:

1. GOALS – Develop a set of project goals (among the UCs and related requirements, as defined in D1.2) of paramount importance.

2. QUESTIONS – Generate questions that assess goals achievement as completely as possible in a quantifiable way and specify the detailed steps necessary to answer those

1 See http://goldpractice.thedacs.com/practices/gqm/

Page 24: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 24 of 28

questions and track prototypes conformance to the goals, to obtain a significant and objective assessment of the goals.

3. METRICS – Specify the measures needed to be collected to answer those questions to track prototype conformance to the goals specified above (i.e. the metrics defined in Sec. 2).

4. TOOL SETUP – Develop mechanisms for data collection (for example, a spreadsheet gathering comments from the various end-users on the same goal, if applicable).

5. DATA COLLECTION AND FEEDBACK – Collect, validate and analyze the data (in quasi-real time) to provide feedback to projects for corrective action.

6. FINAL ANALYSIS – Analyze the data in an a-posteriori fashion to assess conformance to the goals and to make recommendations for future improvements.

Page 25: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 25 of 28

6. Plan for evaluation of results

Evaluation of the three UCs will be performed by each UC Owner, following a common GQM methodology as defined in Sec. 5.

In the following section a brief description of the plan, performed by the various UC Owners in their respective UC, is provided. Overall, given the harmonised view on UCs that has been achieved in Sec. 2 of the present document, the plan for evaluation of results will have the same structure for all 3 UCs, although with specific instantiations on each of them.

The general time schedule of the GQM, to be adopted on each UC is defined in the following table, where responsibilities lying with the involved partners are also assigned.

GQM Step When Who Notes

Step 1 (Goals), Step 2 (Questions) and Step 3 (Metrics)

Already defined in Sec. 2 of the present document

All UC Owners

Step 4 (Tool Setup) To be agreed upon the 3 UC Owners in M13 before the first release of the UC prototypes (“early prototypes” expected by M13, October 2011)

All UC Owners to agree upon

MPS proposes a draft to be circulated early in M13

Step 5 (Data collection and feedback)

To be performed in M13 (“early prototypes” release) and M24 (“intermediate prototypes”)

All UC Owners

Step 6 – Part A (Data collection for final analysis)

To be performed in M36 (“final releases”)

All UC Owners

Step 6 – Part B (Integrated analysis of final outcome)

To be performed in M36 (“final releases”)

NEXT Results to be incorporated in D8.3 “Empirical evaluation over large-scale test beds” and D8.4 “Qualitative evaluation”

Table 144. GQM steps planning

Page 26: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 26 of 28

7. Conclusion

7.1. Refinement policy for the metrics

The metrics identified to evaluate progress of work on the three FIRST usecases are tightly connected with the requirements described in D1.2, “Usecase requirements specification”. Due to the fact that, according to Section 5, “Conclusion”, of D1.2, “all requirements are „open‟ to allow for incorporation of possible market evolutions, regulatory evolutions, and technological evolutions that might occur in a timeframe that is compatible with the Project‟s activity plan” and therefore such requirements “may be prioritised, pruned, altered, or extended, in the interests of the Project itself and always following the contractual provisions of FIRST”, also the metrics have to be considered flexible, notwithstanding the fact that, all future refinements/modifications to the present document will need to be discussed and agreed upon at Technical Committee level and validated at General Assembly level.

7.2. Next steps

The present document will serve as input and reference point (the GQM methodology described in Sec. 5 above) for development of WP8 activities, “End-user prototypes and evaluation”, as well as for the other Project WPs, as per the Project’s Gantt chart.

Page 27: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 27 of 28

References

Basili, Victor R.; Caldiera, Gianluigi; Rombach, H. Dieter, “The Goal Question Metric Approach”, 1994 http://wwwagse.informatik.uni-kl.de/pubs/repository/basili94b/encyclo.gqm.pdf

Basili, Victor R.; Caldiera, Gianluigi; Rombach, H. Dieter, “The Goal Question Metric Paradigm”, Encyclopedia of Software Engineering (Marciniak, J.J., editor), Volume 1, John Wiley & Sons, 1994, pp. 578-583 Rodenback, Erik;

Bermingham, A., Smeaton, A. F. (2009). A study of inter-annotator agreement for opinion retrieval. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval - SIGIR ‟09 (p. 784). New York, New York, USA: ACM Press. doi: 10.1145/1571941.1572127.

Christensen, L. M., Haug, P. J., & Fiszman, M. (2002). MPLUS : A Probabilistic Medical Language Understanding System The M + Semantic Model. Proc ACL-02 workshop on Natural language processing in the biomedical domain (pp. 29-36).

Devitt, A. (2007). Sentiment Polarity Identification in Financial News : A Cohesion-based Approach. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 984-991).

Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ‟04 (p. 168). New York, New York, USA: ACM Press. doi: 10.1145/1014052.1014073.

Kessler, J. S., Eckert, M., Clark, L., & Nicolov, N. (2010). The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain. Proc 4th Conference on weblogs and social media data challenge Workshop.

Klein, A., Altuntas, O., Häusser, T., Kessler, W. (2011). Extracting Investor Sentiment from Weblog Texts: A Knowledge-based Approach. To be published in Proc 13th Conference on Commerce and Enterprise Computing.

Kim, S.-M., & Hovy, E. (2004). Determining the sentiment of opinions. Proceedings of the 20th international conference on Computational Linguistics - COLING ‟04 (p. 1367-es). Morristown, NJ, USA: Association for Computational Linguistics. doi: 10.3115/1220355.1220555.

Ku, L.-wei, Lo, Y.-shen, & Chen, H.-hsi. (2007). Test Collection Selection and Gold Standard Generation for a Multiply-Annotated Opinion Corpus. Proc 45th Computational linguistics (pp. 89-92).

Moilanen, K., & Pulman, S. (2009). Multi-entity Sentiment Scoring. RANLP.

Mykowiecka, A., Marciniak, M., & Kupść, A. (2009). Rule-based information extraction frompatients’ clinical data. Journal of biomedical informatics, 42(5), 923-36. doi: 10.1016/j.jbi.2009.07.007.

Page 28: D1.3 Metrics, methods, and plans for evaluation of resultsfirst.ijs.si/FirstShowcase/Content/reports/D1.3.pdf · Risk the probability that a particular adverse event occurs during

D1.3

© FIRST consortium Page 28 of 28

Nasukawa, T. (2003). Sentiment Analysis : Capturing Favorability Using Natural Language Processing Definition of Sentiment Expressions. Proc 2nd conference on Knowledge capture (pp. 70-77).

OHare, N., Davy, M., Bermingham, A., Ferguson, P., Sheridan, P., Gurrin, C., et al. (2009). Topic-Dependent Sentiment Analysis Of Financial Blogs.pdf. Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion.

Pang, Bo, Lee, Lillian, Rd, H., & Jose, S. ((2002) Thumbs up ? Sentiment Classification using Machine Learning Techniques. Proc 10 Empirical methods in natural language processing.

Shaikh, M. A. M., Prendinger, H., & Ishizuka, M. (2007). An analytical approach to assess sentiment of text. 2007 10th International Conference on Computer and Information Technology (pp. 1-6). Ieee. doi: 10.1109/ICCITECHN.2007.4579359.

Tsur, O., & Rappoport, A. (2010) ICWSM – A Great Catchy Name : Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews. Artificial Intelligence (pp. 162-169).

Wiebe, Janyce; Bruce, Rebecca; O’Hara, T. (1999). Development and use of a gold standard data-set for subjectivity classifications. Proc 37th Annual Meeting of the ACL.

Wilson, T., Wiebe, J., & Hoffmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing - HLT ‟05 (pp. 347-354). Morristown, NJ, USA: Association for Computational Linguistics. doi: 10.3115/1220575.1220619.

_____________________