&kdluv - ist.gmu.eduist.gmu.edu/~hpurohit/events/sml17/sml17_proceedings_v1.pdf · ing a hybrid...

65
PRE-CONFERENCE PROCEEDINGS OF IJCAI Workshop on Semantic Machine Learning 2017 Chairs Rajaraman Kanagasabai, Institute for Infocomm Research, Singapore Ahsan Morshed, Swinburne University, Melbourne, Australia Hemant Purohit, George Mason University, USA

Upload: leanh

Post on 29-Mar-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

P R E - C O N F E R E N C E P R O C E E D I N G S O F

IJCAI Workshop on Semantic Machine Learning 2017

Chairs

Rajaraman Kanagasabai, Institute for Infocomm Research, Singapore Ahsan Morshed, Swinburne University, Melbourne, Australia

Hemant Purohit, George Mason University, USA

Welcome to #SML17

Learning is an important attribute of an AI system that enables it to adapt to new

circumstances and to detect and extrapolate patterns. Machine Learning (ML) has seen a

tremendous growth during the last few years due in part to the successful commercial

deployments. The interest has also being fueled by the recent research breakthroughs brought

about by deep learning. ML is however not a silver bullet as it is made out to be, and currently

has several limitations in complex real-life situations. Some of these limitations include: i)

many ML algorithms require large number of training data that are often too expensive to

obtain in real-life, ii) significant effort is often required to do feature engineering to achieve

high performance, iii) many ML methods are limited in their ability to exploit background

knowledge, and iv) lack of a seamless way to integrate and use heterogeneous data.

Approaches that formalize data, functional and domain semantics, can tremendously aid

addressing some of these limitations. The so-called semantic approaches have been

increasingly investigated by various research communities and applied at different layers of

ML, e.g. modeling representational semantics in vector space using deep learning

architectures, and modeling domain semantics using ontologies.

The fourth IJCAI workshop on Semantic Machine Learning seeks to bring together

researchers and practitioners from all these communities working on different aspects of

semantic ML, to share their experiences, exchange new ideas as well as to identify key

emerging topics and define future directions. The workshop programme includes i) invited

keynote from Dr. Amy Shi-Nash, Commonwealth Bank, Australia, ii) 4 paper sessions with

oral presentations from international research groups, and iii) an invited panel on Value Aid in

incorporating Structured, Semantic Knowledge Bases into Machine Leaning Approaches,

with renowned research leaders from Academia and Industry as panelists.

We wish to express our deep appreciation to the programme committee members and the

additional reviewers who shared their valuable time and expertise in support of the SML17

review process. Special thanks to our advisory committee members Prof. Amit Sheth, Prof.

Fausto Giunchiglia, and Prof. Timos Sellis for their constant encouragement and guidance in

the organization. We also wish to express our gratitude to our supporting organizations: the

Institute for Infocomm Research (A*STAR), Swinburne University and George Mason

University.

We hope you enjoy your stay in Melbourne, and have a fruitful time at the workshop.

Rajaraman Kanagasabai, Ahsan Morshed, Hemant Purohit

Chairs, #SML17

SML16 Organisation

Chairs

Rajaraman Kanagasabai, Institute for Infocomm Research, Singapore

Ahsan Morshed, Swinburne University, Melbourne, Australia

Hemant Purohit, George Mason University, USA

Advisory Committee

Prof. Fausto Giunchiglia, University of Trento, Trento, Italy

Prof. Amit Sheth, Kno.e.sis Center, Wright State University, Dayton, USA

Prof. Timos Sellis, Swinburne University of Technology, Australia

Programme Committee

Kim Jung Jae, Institute for Infocomm Research, Singapore

Prem Jayaraman, Swinburne University of Technology, Australia

Kewen Lio, Swinburne University of Technology, Australia

Heiko Mueller, New York University, USA

Oshani Seneviratne, Oracle, USA

Md. Sumon Shahriar, Department of Health, Australia

Saeedeh Shekarpour, Kno.e.sis, Wright State University, USA

Sanjaya Wijeratne, Kno.e.sis, Wright State University, USA

Programme

Date: 20th August, 2017, Sunday

Time: 8.30- 18.00

Venue : RMIT University Building 80 (also known as SAB or Swanston Academic

Building)

Address: 445 Swanston Street , Melbourne, Victoria, 3000

8.30 – 10.00

Paper Session (3 papers – each 25 min + 5 min Q&A)

1. Evan Dennison Livelo, Andrea Nicole Ver, Jedrick Chua, John Paul Yao and Charibeth Cheng.A Hybrid Agent for Automatically Determining and Extracting the 5Ws of Filipino News Articles

2. Heng Chen, Yongjuan Zhang, Chunhong Lin, Liwen Zhang and Tao Chen. Construction of Viral Hepatitis Bilingual Bibliographic Database, Mining of Viral Hepatitis Related Protein Text and Integrating with Uniprot Protein Database

3. Yang Gao, Linjing Wei, Heyan Huang and Qian Liu. Topical

Sentence Embedding for Query Focused Document Summarization

10.00 – 10:30 COFFEE BREAK

10:30 – 12:30

Paper Session (3 papers – each 25 min + 5 min Q&A)

4. Luis Palacios, Yue Ma, Gaëlle Lortal, Claire Laudy and Chantal Reynaud. Data Driven Concept Refinement to Support Avionics Maintenance

5. Andreea Salinca. Convolutional Neural Networks for Sentiment Classification on Business Reviews

6. Ritesh Ratti, Himanshu Kapoor, Shikhar Sharma and Anshul

Solanki Semantic extraction of Named Entities from Bank Wire text

12.30-14.00

LUNCH BREAK

14.30-15.00

Paper Session (1 papers – each 25 min + 5 min Q&A)

7. Abdullah Alharbi, Yuefeng Li and Yue Xu. Enhancing Topical Word Semantic for Relevance Feature Selection

14.30-15.30

Keynote

Speaker: Amy Shi-Nash, PhD - Head of Data Science, Commonwealth Bank, Australia

Title : How can Machine Learning/AI help Banks and Customers

15.30-16.10

Panel Discussion “Value Aid in incorporating Structured, Semantic Knowledge Bases into Machine Leaning / Deep learning Approaches”

Amy Shi-Nash Commonwealth Bank, Australia Prof. Dimitrios Georgakopoulos, Swinburne University of

Technology, Australia A/Prof. Xiuzhen (Jenny) Zhang, RMIT University, Australia Dr. Truyen Tran, Lecturer, Deakin University, Australia Dr. Yuan-Fang, Senior Lecturer, Monash University, Australia Prof., Arkady Zaslavsky, CSIRO

16.10-16.30 COFFEE BREAK

16.40-17.40

Paper Session (2 papers – each 25 min + 5 min Q&A)

8. Yang Shao. Several simple neural networks for evaluating semantic textual similarity

9. Fenglong Ma, Radha Chitta, Saurabh Kataria, Jing Zhou, Palghat Ramesh, Tong Sun and Jing Gao. Long-Term Memory Networks for Question Answerin

18.00 End of the day

Keynote

Title: How can Machine Learning/AI help Banks and Customers

Speaker:

Dr. Amy Shi-Nash

Commonwealth Bank

Australia

Biography:

Amy is an executive leader with a proven track record of creating value and

competitive advantage through data-driven culture and innovation. As the

Head of Data Science at Commonwealth Bank, she is responsible for driving

strategic data science capability, enable business transformation and

differentiated customer experience. Prior to CBA, Amy was the founding

member and Chief Data Science Officer of DataSpark, Singtel’s data analytics

spin-off. Responsible for driving data-led innovation and creating new revenue

steams by combining telco data with advanced analytics and big data

technology. Amy is a Science Board Member of i-Com and since 2013 is

Industry Track Program Committee Member of ACM KDD. She is a frequent

public speaker, a co-inventor and co-author of multiple Patents and

Publications. Amy holds a Ph.D in data mining, a Master in AI and an MBA.

A Hybrid Agent for Automatically Determining and

Extracting the 5Ws of Filipino News Articles

Evan Dennison S. Liveloevan dennison [email protected]

Andrea Nicole O. Verandrea nicole [email protected]

Jedrick L. Chuajedrick [email protected]

John Paul S. Yaojohn paul [email protected]

Charibeth K. [email protected]

De La Salle University - Manila

Abstract

As the number of sources of unstructureddata continues to grow exponentially, man-ually reading through all this data becomesnotoriously time consuming. Thus, there isa need for faster understanding and process-ing of this data. This can be achieved by au-tomating the task through the use of infor-mation extraction. In this paper, we presentan agent that automatically detects and ex-tracts the 5Ws, namely the who, when, where,what, and why from Filipino news articles us-ing a hybrid of machine learning and linguis-tic rules. The agent caters specifically to theFilipino language by working with its uniquefeatures such as ambiguous prepositions andmarkers, focus instead of subject and predi-cate, dialect influences, and others. In orderto be able to maximize machine learning algo-rithms, techniques such as linguistic taggingand weighted decision trees are used to pre-process and filter the data as well as refinethe final results. The results show that theagent achieved an accuracy of 63.33% for who,71.38% for when, 58.25% for where, 89.20% forwhat, and 50.00% for why.

1 Introduction

Information can be found in various types of mediaand documents such as news [Cheng et al., 2016] and

Copyright c© by the paper’s authors. Copying permitted forprivate and academic purposes.

In: Proceedings of the 4th International Workshop on SemanticMachine Learning (SML 2017), 20th August 2017, Melbourne,AUS

legal documents [De Araujo et al., 2013]. These doc-uments provide different types of data beneficial topeople ranging from field-specific professionals to theeveryday newspaper readers. Thus, from the seem-ingly endless sea of unstructured data, it is importantto be able to determine the appropriate informationneeded quickly and efficiently.

The process of automatically identifying and re-trieving information from unstructured sources andstructuring the information in a usable format is calledInformation Extraction. This task involves the useof natural language processing in the analysis of un-structured sources to identify relevant data such asnamed entities and word phrases through operationsincluding tokenization, sentence segmentation, named-entity recognition (NER), part-of-speech (POS) tag-ging, and word scoring. This system is applied to var-ious fields such as legal documents [De Araujo et al.,2013], work e-mails [Xubu and Guo, 2014], and newsarticles [Cheng et al., 2016].

Our information extraction agent automatically ex-tracts the who, when, where, what, and why of Filipinonews articles. Who pertains to people, groups, or or-ganizations involved in the main event of the news ar-ticles. When refers to the date and time that the mainevent of the news article occurred. Where refers to thelocation where the main event took place. There canbe one or more who, when, and where features in anarticle. On the other hand, what is the main eventthat took place while why is the reason the main eventhappened. There can only be one what and why foreach article. Moreover, it is possible that there are nowho, when, where, what or why features in an articleif one does not exist. Figure 1 shows a sample articletranslated in English with the corresponding 5Ws.

However, the grammar of English and Filipino arenot the same. Some of the nuances encountered in thelatter are the differences in focus-subject order (i.e.

Figure 1: Sample Article Translated to English

verb first before performer) as well as the presence ofambiguous prepositions (i.e. “sa” can be applied to ei-ther a location or a date). Moreover, due to this, auto-matic translation of large data from Filipino to Englishis not feasible. Thus, our agent was designed to rec-ognize and handle these linguistic features through acombination of machine-learned models and rule-basedalgorithms.

The results of this research can greatly benefit in-dividuals and organizations reliant on Filipino news-papers such that they will be able to determine andaggregate essential information based on main events(as compared to mere presence) quickly and efficiently.Moreover, the research contributes an advancement inthe field of natural language processing and semanticmachine learning for the Filipino language.

2 Related Works

Information extraction has been performed in severalprevious studies dealing with a variety of languagesand retrieving different kinds of information.

In a study by [De Araujo et al., 2013], 200 legal doc-uments written in Portuguese concerning cases thattranspired in the RS State Superior Court were ana-lyzed in order to determine the events that occurred.The events examined in these documents included for-mal charges, acquittal, conviction, and questioning. Inaddition, the study discussed how they put the legaldocuments through a deep linguistic parser and thenrepresented the tokens in a web ontology language orOWL using a linguistic data model. Moreover, theydescribed how after running documents through a deeplinguistic parser and converting to OWL format, theyformulated linguistic rules using morphological, syn-tactical, and part-of-speech (POS) information and in-tegrated these to domain knowledge in order to pro-duce a generally accurate information extraction sys-tem. Likewise, the study of [Xubu and Guo, 2014] de-scribed how they extracted information from descrip-tive text involving enterprise strategies such as e-mail,

personal communication, and management documentsthrough manual information extraction rule definitionsin order to determine the efficiency of strategic execu-tion.

Our agent also utilizes various rules and grammat-ical information such as POS and text markers forlinguistic tagging. Similarly, [Das et al., 2010] alsoadopted a rule-based information extraction in orderto improve the overall accuracy of their informationextraction system. However, unlike [De Araujo et al.,2013] and [Xubu and Guo, 2014], they also used Ma-chine Learning. They applied machine learning totheir information extraction system through the useof a gold standard created by the matching answers oftwo annotators.

In 2012, [Dieb et al., 2012] discussed how they usedpart-of-speech (POS) tagging as well as regular expres-sions to parse texts and determine orthogonal featureswithin the considered nanodevice research documents.In addition, they discussed how after tokenizing andparsing the research papers, they made use of Yam-Cha, a text chunk annotator, for machine learning inorder to determine each of the parsed data categoryor tag (e.g. Source Material, Experiment Parameter)within an annotation automatically. Our agent alsolearns by example through several machine-learnedclassification algorithms derived from annotated Fil-ipino news articles.

Furthermore, in the field of Filipino news, the re-search of [Cheng et al., 2016] in 2016 extracted the5Ws from Filipino editorials through a rule-based sys-tem in order to determine the possible candidates foreach W and uses weight to choose among the list ofcandidates. They reported a performance of 6.06%accuracy for who, 84.39% for when, 19.51% for where,0.00% for what, and 50.00% for why. However, thetest corpus is composed of mostly true negatives andthus, there are only few examples as basis for imple-mentation. Moreover, the candidates were subjectedto minimal processing and filtering. Therefore, prob-lems such as difficulty identifying correct candidatesand low precision are present.

3 Information Extraction Agent Im-plementation

Figure 2 shows the architecture of the hybrid informa-tion extraction agent. A hybrid approach was imple-mented by means of utilizing a combination of machinelearning techniques and rule-based algorithms.

A file containing a corpus of Filipino news arti-cles acts as the agent’s environment. The agent scansthrough the environment and gets all the Filipino newsarticles. Each article is then parsed and stored inter-nally as a word table, which contains tokens with the

Figure 2: Hybrid Information Extraction Agent Ar-chitecture

corresponding position, POS and NER tags, and wordscore. The word table is passed to the candidate se-lection and feature extraction module to get the finalwho, when, where, what, and why for each article. Theresults are passed to the actuator that writes the cor-responding annotations to the environment, which inturn saves the file and generates an inverted index file(Figure 3).

Figure 3: Sample Inverted Index File

3.1 Linguistic Tagging

Linguistic tagging is first applied to each news articleand the parsed data is stored in a word table. Thebody of the article is initially segmented into its com-posite sentences and then individually tokenized. Eachtoken is processed in order to determine the followinginformation:

1. Part-of-Speech tag; e.g. proper noun (NNP),preposition (IN), determiner (DT)

2. Named-Entity tag, which includes person (PER),location (LOC), date (DATE), and organization(ORG)

3. Word score or frequency count

In order to assign each token its correspondingpart-of-speech tag, a tagger was implemented using amodel trained on news-relevant datasets from TPOST,a Tagalog Part-of-Speech Tagger [Rabo, 2004].

For named-entity recognition, each token is eval-uated and assigned (if applicable) as a PER, LOC,DATE, or ORG. This process utilizes a Stanford NERmodel trained on 200 Filipino news articles.

Lastly, under linguistic tagging, word scoring is per-formed. Word scoring utilizes term frequency andcounts how many times a token or word is encounteredin an article.

3.2 Candidate Selection

Even though the articles have the named-entity tagsassigned to particular words, these are not enough in-dicators of candidates. This is because named-entitytags do not consider grammatical information and,consequently, common nouns. Moreover, what andwhy candidates are sentence fragments that are com-posed of a variety of word tokens with different part-of-speech and named-entity tags, further indicating theneed for the agent to perform candidate selection.

To select candidates, we use a rule-based approachto select possible candidates for the final who, when,where, what and why of each article.

A word or phrase is a who, when and where candi-date when:

1. It is a noun or noun phrase

2. The word or phrase acts as a subject within thearticle

3. For proper nouns, it has a PER or ORG named-entity tag for who, DATE or TIME named-entitytag for when and LOC named-entity tag for where.

4. For common nouns, it is encapsulated by neigh-bouring markers including Filipino determiners,conjunctions, adverbs, and punctuations.

On the other hand, for the what, the agent simplychooses the first two sentences of the article’s body ascandidates. Lastly, for the why, the agent runs throughthe first six sentences of the article’s body. Sentenceswhere why feature makers are found are considered asthe why candidates of the article.

3.3 Feature Extraction

Feature extraction is then performed to narrow downthe candidate pool of the who, when, where, what andwhy in order to get the final results. A machine-learnedmodel was trained and used for the who, when, whereand why while a rule-based algorithm was developedfor the what. Among the machine-learning algorithmstested include J48, Naive Bayes, and Support VectorMachine. Variations were also tested using boosting,bagging, and stacking. Moreover, several iterations in-volving feature engineering and parameter fine tuning

were done to get the optimal results for each algorithmbased on true positive and accuracy rate among oth-ers.

Each of the who, when, where and why candidatespass through a machine-learned model which deter-mines whether or not it is a final result. The mod-els were generated using a gold standard composedof annotated Filipino news articles. Before being fedto the machine learning algorithm, however, the goldstandard articles are pre-processed and filtered intocandidates as discussed previously in order to betterrepresent the data in a way such that the model canestablish patterns better.

In order to do this, the gold standard articles wereput through the same candidate selection module dis-cussed previously and corresponding linguistic featureswere assigned to each candidate. The list of featuresthat were tested include the following:

1. The candidate string

2. The number of words in the candidate

3. The sentence which the candidate belongs to

4. The numeric position of the candidate in the ar-ticle

5. The distance of the candidate from the beginningof the sentence it belongs to

6. The frequency count of the candidate

7. 10 neighbouring word strings before and after thecandidate

8. The part-of-speech tags of the aforementionedneighbouring words

In order to determine the class attribute (whetheror not it is a final W), the candidate was matchedagainst the annotations found in the gold standard tosee if it matches. If it does, the class attribute is setto yes. Otherwise, it is set to no. These candidates,and their corresponding features, were used to trainseveral models using different algorithms for testing.The features to be considered varied among the Ws,since not all of the listed features were proven usefulin choosing the who, when and where results.

Furthermore, the algorithm that showed the besttrue positive and accuracy rate is J48 with boostingfor who and J48 with bagging for when and where.

The model evaluates each candidate by assigning itan acceptance probability as well as a rejection proba-bility. If a candidate’s acceptance probability is higherthan its rejection probability, it is added to the finalwho, when and where results.

Table 1: Final feature sets for who, when and where

Who When WhereCandidateString

X X X

No. of Words X X XSentence No. X X XPosition XProximity XWord Score X X XNo. of Neigh-boring Wordsand their POSTags

10 3 10

On the other hand, for what, a weighting schemewas implemented in order to determine the best can-didate. This was done since we found that determiningthe what is more straightforward than the other Ws.Thus, feature engineering and fine tuning a machinelearned model for this W is unnecessary and may evencause unnecessary complexities.

The implementation firstly determines the presenceof the extracted who, when, and where and adds 0.3,0.3, and 0.2 respectively to a candidate’s score. Theweights were chosen after several experimental itera-tions starting with neutral arbitrary weights of all 0.5.The when and where are extracted in a similar wayto the who except for a few differences in parameters,values, and implementation. Secondly, the sentencenumber is considered. The formula for computing theadditional weight based on sentence number is givenbelow.

weight = 1 − (0.2 ∗ sentenceNumber) (1)

If the extracted who, when, and where found in thecandidate is present in the title, an additional 0.1 isadded to the candidate score.

The candidates are then trimmed based on the pres-ence of a list of markers composed of Filipino adverbsand conjunctions that denote cause and effect. If oneof the markers are found within the candidate, thecandidate is trimmed. If the marker found is a begin-ning marker, all words before the marker including themarker itself are removed. On the other hand, if themarker is an ending marker, all words after the markerincluding the marker itself are removed.

The candidate with the highest weight is then cho-sen as the final what result for that article.

Lastly, for why, the candidates first undergo trim-ming and weighting. This is done since the machine-learned models are limited to the data that is fed tothem. Thus, they require an associated rule-based al-

gorithm to pre-process the data before it is used fortraining or classification.

Words that come before starting markers and afterending markers are removed from the candidate. Thepresence of the extracted what and the markers werealso given additional weights. The final feature set forin feature extraction of the why included the following:

1. The candidate string

2. The number of words in the candidate

3. The sentence which the candidate belongs to

4. The candidate’s weighted score

5. The number of who features are in the candidate

6. The number of when features are in the candidate

7. The number of where features are in the candidate

8. 10 neighbouring word strings before and after thecandidate

9. The part-of-speech tags of the aforementionedneighbouring words

Furthermore, the algorithm that showed the besttrue positive and accuracy rate is J48.

4 Results and Observations

4.1 Gold Standard

In order to train and evaluate the agent, a gold stan-dard was created. This gold standard is composedof 250 Filipino news articles retrieved from the studyof [Regalado et al., 2013]. Each article was manuallyannotated with 5Ws by four annotators. For each dis-agreement where only two or less annotators agree,the four annotators deliberated the best annotation.In the case that the decision is split, the annotationis discarded and left blank, denoting ambiguity. Theresulting annotated corpus was then qualitatively eval-uated by a literary expert.

Table 2: Inter-annotator agreement for the who, when,where, what and why

Feature ValueWho 59.35%When 61.25%Where 71.00%What 74.40%Why 70.40%

Based also on inter-annotator agreement, the whoand when proved to be more ambiguous than the rest.

Since, based on the observations of the annotations,the what can be found in the first two sentences, theannotators found it easier to choose the annotationfor this and thus there was more agreement. On theother hand, because there are many possible who andwhen in an article, the annotators may have had aharder time choosing all the relevant who and when inan article thus leading to more disagreement. Thereis also a possibility of finding more than one possiblewhere in an article, but based on the results it waseasier for the annotators to identify the where in agiven article.

4.2 Evaluation

After implementing the agent, the agent’s results werecompared against the gold standard comprising of 250articles. For the true positive value, complete matches,under-extracted, and over-extracted annotations wereincluded. The results can be seen in Table 31.

Table 3: Statistics for the who, when, where, what andwhy

Who When Where What Why

CM 63.46% 67.53% 53.82% 40.4% 39.2%

UE 2.41% 4.43% 4.86% 12% 9.6%

OE 0.92% 0.74% 1.39% 36.8% 1.2%

CMM 33.17% 27.31% 39.93% 10.8% 50%

TPCM 59.23% 35.51% 11.11% 40.4% 10.8%

TPPM 3.19% 5.07% 6.06% 48.8% 10.8%

FP 4.78% 5.80% 21.89% 10.8% 10%

TN 0.91% 30.80% 41.08% 0% 28.4%

FN 31.89% 22.83% 19.87% 0% 40%

P 92.88% 87.50% 43.97% 89.2% 68.35%

R 66.18% 64.00% 46.36% 100% 35.06%

A 63.33% 71.38% 58.25% 89.2% 50%

F 77.29% 73.93% 45.13% 94.29% 46.35%

Based on the statistics shown, the when was ableto obtain the highest complete match rate, while thewhy has the lowest. This was possibly because thewhen had only a limited number of frequent candidatesthat could be seen across the news articles (i.e. sevendays in a week, twelve months, holidays, relative days),making it easier to identify the candidates.

For the who and where, both had slightly lower com-plete match rates compared to that of the when. Thecandidates produced seemed to be greater in numberbecause of the many different possible who and whereacross articles. The reason is that people and places

1CM - Complete Match Rate; UE - Under-extracted Rate;OE - Over-extracted Rate; CMM - Complete Mismatch Rate;TPCM - True Positive for Complete Match; TPPM - True Posi-tive for Partial Match; FP - False Positive; TN - True Negative;FN - False Negative; P - Precision; R - Recall; A - Accuracy; F- F-Score

of significance can change over time unlike the moreconstant when candidates. Thus, the candidate selec-tion and feature extraction had a more difficult timein identifying the correct who and where candidates forthe article.

On the other hand, the what has less than half com-plete matches. However, the combined number of com-plete matches and partial matches still greatly out-number the number of complete mismatches. This isbecause during the implementation of the agent, it hasbeen observed that most of the what can be found inthe first two sentences of the article with 94.00% of theinstances in the first sentence and 4.40% in the second.Thus, the primary problem for the what is the trim-ming of candidates in order to completely match whatis needed (and annotated) based on the gold standard.In part, it is because the linguistic structure of Filipinomakes it so that sometimes, adjectives and other de-scriptors become too lengthy that some important de-tails may be considered insignificant by the agent andare thus trimmed off. On the other hand, some phrasesare not trimmed because of the presence of details thatmay be unnecessary but are considered linguisticallysignificant by the agent possibly because of misleadingmarkers.

Moreover, the reason why the recall of the what is100% is because the agent always extracts a what fea-ture for each article. Since partial matches are alsoconsidered as true positives, all the gold standard an-notations for what were considered extracted.

Lastly, for the why, it could be observed that it ob-tained a high amount of false negatives, which showsthat the agent fails to detect the why in the articleeven if one is present in the article. The agent also hasdifficulty in identifying the correct why from the can-didates. This could probably be caused by the lack ofrelations between the why and what candidates. Thelinguistic structure of some articles prove to be difficultbecause of the interchangeability of the potential whatand why. Thus, the agent could get confused when asupposed what is actually a why which came ahead of awhat candidate. Moreover, text markers denoting rea-son could be misleading the agent to deciding that thephrase that follows the aforementioned text markersis the why, which matches the extracted what when inreality, they are only related by proximity.

Furthermore, the who performed well using a ma-chine learning approach for its feature extraction. Anexperiment supporting this was performed. The ex-periment involved comparing the final who results oftwo different evaluation runs wherein the first run uti-lized the machine-learned model while the second onlyrelied on the candidate selection module. The resultsof the experiment show that the accuracy was 63.33%for the first run while it was 38.27% for the second

run.We did the same experiment for the when and

where. For the when, the agent was able to achievean accuracy of 63.35% on the first run compared to16.17% it got from the second run. For the where, thefirst run with machine learning achieved an accuracyrate of 58.25% in comparison to the second run withan accuracy of 13.33%.

For the why, experiment results show that the ac-curacy of the why feature when run with machinelearning algorithms went up to 50%, compared to the47.60% accuracy it got with a rule-based feature ex-traction.

Table 4: Comparison between our hybrid approachand a rule-based approach using the data of the latter

EvaluationMetric

CompleteMatch

Under-Extracted

Hybrid Who 43.84% 2.46%RB Who 6.06% 8.08%Hybrid When 59.1743% 7.7981%RB When 84.39% 0%Hybrid Where 56.4593% 1.4354%RB Where 19.51% 1.22%Hybrid What 28.0% 31.5%RB What 0.00% 5.88%Hybrid Why 11% 7.5%RB Why 50% 3.13%

Table 4 shows a comparison between the perfor-mance of our hybrid extraction agent and an existingrule-based extraction system [Cheng et al., 2016], us-ing the same test data. Based on the results above, ouragent proved to be better than the previous system forthe who, when and what.

For the who and where, in terms of candidate se-lection, the rule-based system only uses markers. Onthe other hand, our agent uses NER and POS taggingin addition to markers. Furthermore, for feature ex-traction, our agent uses a machine-learned model ascompared to a weighting system to better filter outcandidates.

For the what, instead of immediately constrictingcandidates in the candidate selection stage using mark-ers (as done in the rule-based system), our agent re-trieves entire sentences and trims the markers out dur-ing the feature extraction stage. Moreover, our agentutilizes other extracted features including the who,when, where and title presence as additional weightsto better determine the final what.

For the when and the why, the results show thatthe existing rule-based feature extraction performedbetter than the machine learning. However, if the data

used to train the when was increased, it is possible toimprove the results of the machine learning featureextraction.

5 Conclusion and Future Work

This paper presents a hybrid information extractionagent for automatically determining the 5Ws of Fil-ipino news articles.

In conclusion, performing machine learning on who,when, where, and why was beneficial since the agentallows the models to choose which candidates are cor-rect. The performance is also further supported by theassociated pre-processing, filtering, and refining rule-based algorithms. Thus, if the model is iterated upon,the results may improve. On the other hand, usingpurely rule-based selection on what is beneficial since,based on the structure of most Filipino news articles,the what can be found in the first two sentences andthere are common markers that can easily denote thefeature.

The framework used in this study can be appliedin extracting other information and features such asperpetrator-victim, crime-ridden areas, businesses orcompanies involved in a main event, among othersfrom news articles. However, the agent’s models andalgorithms would need to be modified for the informa-tion. Specifically, rule-based algorithms may have adifferent set of parameters and values while machine-learned models would have to be re-trained on the do-main corpus of the new data. Thus, the linguistic tag-ging, candidate selection, and feature extraction wouldneed to be tested and modified based on the aforemen-tioned corpus.

Future work for the study include integratinganaphora resolution in order to maximize the power ofpronouns and other referential linguistic information.Moreover, an ontology consisting of known figures, lo-cations, positions, and organizations in the Philippinescan be incorporated to possibly improve the extractedinformation. Lastly, a larger and more diverse corpusof news articles can serve as examples and aid in train-ing better models and for more exhaustive evaluation.

Acknowledgements

The authors gratefully acknowledge the Department ofScience and Technology for the support under the En-gineering Research and Development for Technologyscholarship.

References

[Cheng et al., 2016] Charibeth Cheng, BernadynCagampan, and Christine Diane Lim. Organizingnews articles and editorials through information

extraction and sentiment analysis. In 20th PacificAsia Conference on Information Systems, PACIS2016, Chiayi, Taiwan, June 27 - July 1, 2016, page258, 2016.

[Das et al., 2010] A. Das, A. Ghosh, and S. Bandy-opadhyay. Semantic role labeling for bengali using5ws. In Natural Language Processing and KnowledgeEngineering (NLP-KE), 2010 International Confer-ence on, pages 1–8, Aug 2010.

[De Araujo et al., 2013] D.A. De Araujo, S.J. Rigo,C. Muller, and R. Chishman. Automatic informa-tion extraction from texts with inference and lin-guistic knowledge acquisition rules. In Web In-telligence (WI) and Intelligent Agent Technologies(IAT), 2013 IEEE/WIC/ACM International JointConferences on, volume 3, pages 151–154, Nov 2013.

[Dieb et al., 2012] T.M. Dieb, M. Yoshioka, andS. Hara. Automatic information extraction of ex-periments from nanodevices development papers.In Advanced Applied Informatics (IIAIAAI), 2012IIAI International Conference on, pages 42–47, Sept2012.

[Rabo, 2004] V. Rabo. Tpost: A template-based, n-gram part-of-speech tagger for tagalog. Master’sthesis, De La Salle University, 2004.

[Regalado et al., 2013] R.V.J. Regalado, J.L. Chua,J.L. Co, and T.J.Z. Tiam-Lee. Subjectivity clas-sification of filipino text with features based onterm frequency – inverse document frequency. InAsian Language Processing (IALP), 2013 Interna-tional Conference on, pages 113–116, Aug 2013.

[Xubu and Guo, 2014] M. Xubu and J.E. Guo. In-formation extraction of strategic activities basedon semi-structured text. In Computational Sci-ences and Optimization (CSO), 2014 Seventh Inter-national Joint Conference on, pages 579–583, July2014.

1

Construction of Viral Hepatitis Bilingual Bibliographic

Database with Protein Text Mining and Information

Integration Functions

Heng Chen* Yongjuan Zhang Chunhong Lin Liwen Zhang Tao Chen

Shanghai Institutes for Biological

Sciences, Chinese Academy of

Sciences, Yueyang Road 320, Shanghai

200031, China

Shanghai Institutes for Biological

Sciences, Chinese Academy of

Sciences, Yueyang Road 320,Shanghai

200031, China

ShangTex Workers’ College, Changshou Road

652, Shanghai 200060, China

Shanghai Institutes for Biological Sciences, Chinese Academy of

Sciences, Yueyang Road 320, Shanghai 200031,

China

Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Yueyang Road 320, Shanghai

200031, China

[email protected] [email protected] [email protected] [email protected] [email protected]

* Copyright © by the paper’s authors. Copying permitted for private and academic purposes. In: Proceedings of the 4th International Workshop on Semantic Machine Learning (SML 2017). 19th August, 2017, Melbourne, VIC, Australia.

Abstract

With fast development of viral hepatitis research, a large number of the research achievements have been generated and scattered in various literatures. Information service providers are meeting the challenge of satisfying readers’ needs for more efficient and intelligent retrieval. Data mining and information integration are basically the promising and effective ways which become more and more important. Our study describes how to build the viral hepatitis bibliographic database, how the viral hepatitis related protein information is mined from the viral hepatitis bibliographic database, and integrated with corresponding information in the Universal protein resource - the Uniprot database from EBI. With the help of Chinese and English bilingual protein control vocabulary built by ourselves, mining of the viral hepatitis related protein text in the bilingual bibliographic database is realized and integration with corresponding protein information in the Uniprot database is achieved. In a word, our paper describes the integration and mapping between Chinese-English bilingual bibliographic databases and the authoritative factual databases (the Uniprot database) through relevant text mining works. It would be useful for extension, utilization and mining of Chinese-English bilingual bibliographic

resources, as well as cross lingual information retrieval, integration, and mining.

1 Introduction

At present, global mass information floods and affects all aspects of human life. As one of the most active research fields, life science generates countless achievements and datasets that scatter in various literatures every year. In life science field, viral hepatitis is a seriously infecting disease resulted from various hepatitis viruses. So, viral hepatitis is, arguably, one of the most intensely studied viruses in the history of biomedical research over the world. With fast development of viral hepatitis research, a large number of the research achievements have been generated and scattered in various literatures. Although most of them are accessible through databases and web sites, it is still a problem for readers to identify what they really need from enormous search results. So mining and information integration are essential to meet readers’ needs for more efficient and intelligent retrieval. Different useful information resources can be further integrated after the information is filtered , digitized and mined, The integration of information resources could be chosen, organized and processed according to the needs of different readers or users so as to yield the new information resources and new knowledge formation. The integration of digital information resources includes: data integration, information integration, knowledge integration, in which knowledge integration is at the highest level of resource integration system, which is based on the inevitable requirement and result of data and information integration to a certain stage.

2

Knowledge mining is a complex process of identifying effective, novel, potentially useful information and knowledge from the information database (Feng and Wang, 2008). Information integration allows users to get the most extensive information, while knowledge mining allows users to quickly find the knowledge they want from the infinite information ocean. The application of information integration and knowledge mining technology and the establishment of linked and integrated database knowledge service system will allow users to quickly and efficiently find the necessary information and knowledge (Zhang et al., 2010). Nowadays, many professional databases have been developed to the era of data mining and integration, knowledge mining and discovery, and greatly focus on information integration and knowledge mining so as to realize link and integration between different type of database through the one-way or two-way mode, which makes the relevant different types of database connected into a interactive organic whole, and enriches the extension and expansion capabilities of the relevant database. Some successful works have been carried out, such as GOPubMed, which can automatically recognize concepts from user’s search query to PubMed and display papers containing relevant terms (Doms and Schroeder, 2005), and Entrez, an integrated search system that enables access to multiple National Center for Biotechnology Information (NCBI) databases (Maglott et al., 2011). Similar works are also reported by Alexopoulou et al (2008), Chen et al. (2013), McGarry et al. (2006), Pasquier (2008), and Sahoo et al. (2007). Different useful information resources can be further integrated after this information is filtered, digitized and mined. The innovation of database design and construction makes users deeply experience the charm and potential of information integration and knowledge mining. In summary, with the development of international scientific database, information integration and knowledge mining has become the mainstream and the trend of digital information resources processing and utilization. the semantic network is the environment of information integration, ontology is the core of semantic web construction and foundation. Construction of the professional domain ontology, based on the integration and mining of digital information resources will become the focus of information integration and knowledge mining research (Yan, 2008). Based on the analysis of domestic and foreign database information integration and knowledge mining theory and application, authors learning from advanced foreign information integration and knowledge mining technology explore the association and integration of the Chinese and English bilingual literature databases of viral hepatitis and the related scientific data databases at home and abroad in the innovation construction of the viral hepatitis special literature knowledge database, moreover, the authors further

study the deep processing of the subject classification index of the literature in the knowledge database from the user's needs so as to facilitate the readers’ use and retrieval. As you know, literature database and protein science database are the ones of the most important support source for hepatitis virus researchers. So in this paper, we build the viral hepatitis bilingual bibliographic database and perform viral hepatitis related protein text mining and integrating with the Uniprot protein database so as to give our vigorous support for the sino-foreign hepatitis virus researchers’ information retrieval and knowledge discovery.

2 Materials, Methods, Design and Results

2.1 Materials

Data resources: Medline database which is from NCBI for English dataset, CNKI database which is from China National Knowledge Infrastructure for Chinese dataset, and Uniprot protein database which is from EBI (European Bioinformatics Institute) for protein dataset. Methods and procedure: ① Collect, select and process the viral hepatitis and hepatitis virus A, B, and C related dataset (literature data) from the above Chinese and English database; ②Build the bilingual text mining control vocabulary (dictionary); ③ Perform text mining of viral hepatitis related proteins in the viral hepatitis bilingual literature database; ④ Perform preliminary research on eliminating the false positive ones from mining results; ⑤ Integrate the viral hepatitis bilingual literature database with the Uniprot protein database on the basis of the mined hepatitis virus A, B and C related protein.

2.2 Design

System design 1. System architecture: 3-tier structure based on B/S model ( separateness of web server and database server). See fig.1 as follows:

3

Figure 1 System architecture

2. System hardware platform: IBM 4 core servers 3. System software platform: Operating system: Linux, Ubuntu 9.04 WEB server: Nginx 0.87

Database software: MySql 5.6.22 Development language: C++ for information index module and data mining module, and PHP for web application module. 4. Integration design architecture of database system platform. See fig.2 as follows:

Figure 2 Database system platform structure Figure 2 demonstration: On the one hand, literature records about viral hepatitis A, B and C from Medline database of Web of Science platform in English and from CNKI database of China in Chinese were screened, collected and processed into the viral hepatitis related literature knowledge data warehouse. On the other hand. The control vocabulary of Uniprot protein database from EBI was also screened, collected, processed and translated into the Chinese & English bilingual viral hepatitis related protein text mining control vocabulary. Then the indexed viral hepatitis subject literature knowledge database was built by index program including improved index procedure control and optimizing index algorithm through application of the protein text mining control vocabulary in the processed viral hepatitis related literature data warehouse. Finally, integration of the indexed viral hepatitis subject literature knowledge database and Uniprot protein database was realized by mapping ruler through protein text or knowledge mining algorithm and machine learning. 5. Viral hepatitis related literature indexing and processing. See fig.3 as follows:

Figure 3 literature indexing and processing flow

chart

4

Figure 3 demonstration: The literatures in the viral hepatitis knowledge data warehouse were indexed and processed according to three stages in the flow chart. Stage 1 is preprocessing before index. Stage 2 is control during indexing procedure. Stage 3 is feedback control after index. Aim of all three stages above is to protect protein text mining from false positive indexing and mining results. 6. Database system function module components:

① Information issue/management system ② Literature knowledge database

processing/maintaining system ③ Administration system for user right and

IP address ④ Information index system ⑤ Knowledge mining system ⑥ Knowledge inquiry system ⑦ Data maintaining system ⑧ Web site visiting and statistical system

Construction of Chinese English bilingual control vocabulary dictionary Part exemplary diagram for the bilingual control vocabulary. See fig.4 as follows:

Figure 4 Demonstration diagram of part

exemplary for the bilingual control vocabulary of viral hepatitis (A, B, C) protein

Information integrating and hyperlinking regulation and examples for the mined protein text in literature using Chinese English bilingual control vocabulary Using the HBV related protein text as example to demonstrate information integrating and hyperlinking regulation for the mined English protein text in literature. See as follows: ① HBeAg,

http://lifecenter.sgst.cn/protein/cn/quickSearch.do?entrezWord=HBeAg

② Capsid protein, http://lifecenter.sgst.cn/protein/cn/quickSearch.do?entrezWord=Capsid%20protein ③ Large envelope protein, http://lifecenter.sgst.cn/protein/cn/quickSearch.do?entrezWord=Large%20envelope%20protein ④ RNA-directed DNA polymerase

http://lifecenter.sgst.cn/protein/cn/quickSearch.do?entrezWord=RNA-irected%20DNA%20polymerase While for the mined Chinese protein text in literature: Translate the Chinese protein into English protein text in advance, such as “乙型肝炎 e 抗原”is translated into “ HBeAg”, “衣壳蛋白质” is translated into “Capsid protein ” , then performing information integrating and hyperlinking according to regulations above and examples. Main performance index of the database system:

1. The biggest record number for the literature information: 0.2 billion.

2. Index and data mining time: at current condition of the database system

containing one million four hundred and seventy thousand (1,470,000) control vocabularies and about twenty thousand (20,000) literature records, the index and data mining time is about eighteen minutes.

The index and data mining time is about five minutes after the single literature record is added.

3. The average retrieval time: < 0.03s (second) 4. The amount of concurrency (the number of

users simultaneous access): >50 people Viral hepatitis subject literature knowledge database extends three functions through data mining, information integration and hyperlinking

1. Obtain the protein sequence and annotation information

2. Perform homological analysis of the protein sequences (BLAST)

3. Perform different alignment of the protein sequences and evolutionary tree mapping

2.3 Results

Function realization and result display of the viral hepatitis subject literature knowledge database Homepage of the viral hepatitis subject literature knowledge database. See fig.5 as follows:

Figure 5 Homepage of the viral hepatitis subject

literature knowledge database Realization of protein mining for the viral hepatitis literature knowledge database.

5

The viral hepatitis related proteins are successfully mined by using the bilingual control vocabulary, algorithm and computer program in the viral hepatitis bilingual bibliographic database. Moreover, the viral hepatitis bilingual bibliographic

database is protein database through the protein mining and information integration. See the fig.6, 7, 8 as follows:

Figure 6 Page of the hepatitis viral protein mining (1)

Figure 7 Page of the hepatitis viral protein mining (2)

6

Figure 8 Page of the hepatitis viral protein of literature database integrating and hyperlinking to the

Uniprot protein scientific database Viral hepatitis subject literature knowledge database extends three functions through data mining, information integration and hyperlinking Obtain the hepatitis viral protein sequence and

annotation information. See fig.9 as follows: Result of homological analysis of the protein sequences (BLAST). See fig.10 as follows: Obtain the evolutionary tree mapping. See fig.11 as follows:

Figure 9 Page of the protein sequence and annotation information of HBcAg

7

Figure 10 Page of homological analysis result of the HBcAg protein sequences (BLAST)

Figure 11 Page of the evolutionary tree mapping of the HBcAg protein

8

3 Discussion, Conclusion and Future Work

3.1 Discussion

The viral hepatitis bilingual bibliographic database was successfully built, and protein text was also successfully mined, and two different classes of databases were also triumphantly integrated, but we encountered some problems, especially such as false positive mining results in bilingual protein text mining. Having investigated the false positive questions, we think there are probably three causes resulting in the false positive mining results:

1) Low quality of the original datasets collected; 2) The accuracy and unity of a specialized word

usage is not enough in building of bilingual control vocabulary;

3) In data mining and integration, computer algorithms, mining mode and route selection, and algorithm itself are unreasonable or the system has defects. As for the problems above, we use artificial quality control to handle the collected original datasets; refer to specialized dictionary and consult the experts to solve the accuracy and unity question of a specialized word usage; try to explore different algorithms, mining mode and route to solve accuracy and efficiency question of data mining and integration. After the viral hepatitis bilingual bibliographic database was used and demonstrated, we have got many feedbacks from users. Most of them love the convenience of easily searching hepatitis viral protein names, locating highlighted viral protein names in search results, and accessing UniProt database for the detailed protein information through information integration and links. But they also raised some questions and proposed many advices. Overall, however, the feedback is very positive so far. According to users’ suggestions and problems, we have discovered, following issues are currently being considered and actually some of them are being undertaken in order to further enhance the system and make it more efficient and convenient:

1) add more hepatitis viral protein names and their features into the English-Chinese Controlled-vocabulary dictionary. This work is continuously being conducted and actually we also plan to add relationships of hepatitis viral proteins and other relevant information so as to finally construct a Chinese hepatitis viral protein ontology. Then it would be possible to realize semantic-based text mining and provide users with knowledge-based information service.

2) integrate more factual scientific databases, especially factual gene databases. Some users are also interested in other special fields, such as evidence-based medicine, AIDS, etc. If search results of a special topic from a bibliographic

database can be integrated with relevant factual scientific databases, it is certainly very helpful and convenient for users. This is an interesting direction for information integration and knowledge mining.

3.2 Conclusion

With the fast development of the viral hepatitis research, to satisfy user’s information needs is becoming an inevitable challenge. So, construction of the viral hepatitis bilingual literature database is important, significant and useful. Integration of two different classes of databases via data mining and linking is innovative and trend for database development. Moreover, information integration and data mining are playing a more and more important role in big data era.

3.3 Future work

In order to solve the problems above, future work must be done as follows:

1) Constantly extend and update datasets in viral hepatitis bilingual literature database;

2) Constantly improve mining and integrating quality so as to decrease the false positive results as low as possible through algorithm improvement and machine learning;

3) Further improve accuracy and unity of the bilingual control vocabulary;

4) The viral hepatitis bilingual literature database will be linked more factual scientific atabase via data mining and information integration.

Acknowledgements This work is supported by The Chosen Excellent Program for Introduced Outstanding Talent of Chinese Academy of Sciences in the Fields of Bibliographical Information and Periodical Publication 2010 (Subject field 100 talent program) and Chinese National Science and Technology Support Project (No.2013BAH21B06) Reference Alexopoulou, D., Wächter, T., Pickersgill, L.,

Eyre, C. and Schroeder, M.: Terminologies for text-mining; an experiment in the lipoprotein metabolism domain. BMC Bioinformatics. 9(Suppl 4), S2, 2008

Chen Heng, Jin Yi, Zhao Yan, Zhang Yongjuan, Chen Chengcai, Sun Jilin, Zhang Shen. Mining and Information Integration Practice for Chinese Bibliographic Database of Life Sciences. Book title : Advances in Data Mining: Applications and Theoretical Aspects; Vol.7987, pp.1-10, 2013. Publisher: Springer Berlin Heidelberg. Book subtitle: 13th Industrial Conference, ICDM 2013, NewYork, NY, USA, July 16-21, 2013, Proceedings. (DOI: 10.1007/978-3-642-39736-3)

Doms, A. and Schroeder, M.: GoPubMed: exploring PubMed with the gene ontology.

9

Nucleic Acids Research. Vol.33: 783-786, 2005

Feng Xinmin and Wang Jiandong. The concept dilemma of knowledge mining and the broad-sense knowledge mining. Journal of Information, Vol.27 (7): 63-65, 2008

Maglott, D., Ostell, J., Pruitt, K. and Tatusova, T.: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research. Vol.39: 52-57, 2011

McGarry, K., Garfield, S. and Morris, N.: Recent trends in knowledge and data integration for the life sciences. Expert Systems. Vol.23(5): 330-341, 2006

Pasquier, C.: Biological data integration using Semantic Web technologies. Biochimie. Vol.90: 584-594, 2008

Sahoo, S., Bodenreider, O., Zeng, K. and Sheth, A.: An experiment in integrating large biomedical knowledge resources with RDF: Application to associating genotype and phenotype information. In: 16th International World Wide Web Conference (WWW2007) on Health Care and Life Sciences Data Integration for the Semantic Web, pp. 8-12. Banff, Canada(2007)

Yan Zhihong. Research on the integration mode of digital information resources in Chinese University libraries. Thesis for Master degree, Chong Qing University, 2008

Zhang Xiaojuan, Zhang Yutao, Zhang Jieli and Wang Juncheng. The central research issues of information resources integration in china. Journal of the China Society for Scientific andTechnical Information, Vol.28 (5): 791-800, 2010

Topical Sentence Embedding for Query Focused Document

Summarization

Yang Gao

Beijing Institute of Technology (BIT);

Beijing Engineering Research

Center of High Volume Language

Information Processing and

Cloud Computing Applications

[email protected]

Linjing Wei

BIT; Beijing Advanced Innovation Center for

Imaging Technology, Capital Normal University

[email protected]

Heyan Huang

BIT; Beijing Engineering Research

Center of High Volume Language

Information Processing and

Cloud Computing Applications

[email protected]

Qian Liu

BIT; Beijing Advanced Innovation Center for

Imaging Technology, Capital Normal University

[email protected]

Abstract

Distributed vector representation for sentences

have been utilized in summarization area, since

it simplifies semantic cosine calculation between

sentence to sentence as well as sentence to doc-

ument. Many extension works have been done

to incorporate latent topics and word embedding,

however, few of them assign sentences with ex-

plicit topics. Besides, much sentence embedding

framework follows the same spirit of prediction

task about a word in the sentence, which omits

the sentence-to-sentence coherence. To address

these problems, we proposed a novel sentence

embedding framework to collaborate the current

sentence representation, word-based content and

topic assignment of the sentence to predict the

next sentence representation. The experiments on

summarization tasks show our model outperforms

state-of-the-art methods.

Copyright c© by the paper’s authors. Copying permitted for private and

academic purposes.

In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZ Workshop,

Location, Country, DD-MMM-YYYY, published at http://ceur-ws.org

1 Introduction

Text summarization is an important task in natural language

processing, which is expected to understand the meaning

of the documents and then produce a coherent, informative

but brief summarization of the original document with in a

limited length. The main approaches of text summarization

can be divided into two categories: extractive and genera-

tive. Most extractive summarization systems extract parts

of the document (a few sentences or a few words) that are

deemed interesting by some metric (i.e., inverse-document

frequency) and join them to form a summary. Conven-

tionally, selecting sentences rely on feature engineering ap-

proach in terms of extracting surface feature statistics (i.e.,

TFIDF cosine similarity) to compare with query and docu-

ment representation.

Recently, distributed vector semantic representation for

words and sentences have achieved overwhelming success

in summarization area [KMTD14, KNY15, YP15], since it

converts high-dimensional and sparse linguistic data into a

controllable and dense dimension of semantic vectors. It

becomes more straightforward for generic summarization

to compute similarity (or relevance to some extents) and fa-

cilitates semantic calculation. Delighted by the successful

word2vec model [MCCD13, MSC+13], Paragraph Vector

(PV) [LM14] model (i.e., the paragraph can be sentence,

paragraph or document) also contributes to predict the next

word given sequential word context and the current para-

graph representation. It inherits the semantic representa-

tion and its efficiency, further captures the word order for

sentence representation. Moreover, the sentence vector can

benefit summaries since it directly characterises the rele-

vance between queries and candidate sentences.

However, most of the sentence embedding models

[LM14, YP15] are trained as the prediction task about a

word in the sentence. In these models, sentences are inde-

pendently learnt via their local word content but often omit

the coherent relationship between sentences. Summariza-

tion system focuses more on comprehensive attributes of

sentences, such as sentence coherence, sentence topic, sen-

tence representation and so on. Utilizing the conventional

sentence vectors may neglect the coherence between candi-

date sentences as well as sentence topics. Although, mod-

els incorporating topic and word embedding models, such

as TWE [LLCS15], have achieved successful results in

some NLP tasks, at sentence level, very few work focuses

on representing sentences with topics. For example, given

a user’s query that emphasises on possible plans, progress

and problems with hydroelectric projects. The query con-

tain complex topics like “plans”, “progress”, “problems”

and “hydroelectric projects”. Nevertheless, normal vector-

based models can retrieve those relevant sentences that only

emphasis on one or two aspects of the query. It is problem-

atic to capture all the aspects of the query .

In order to tackle the problems, we propose a novel sen-

tence embedding learning framework to enhance sentence

representation by incorporating multi-topic semantics for

summarization task, called Topical Sentence Embedding

(TSE) model. Gaussian distributions are utilised to model

mixtured centralities of the embedding space, which cap-

ture a prior preference of topic for sentence prediction. In

addition, instead of training to predict words in the docu-

ment, our proposed model represents one sentence by pre-

dicting the next sentence via jointly training the words in

the current sentence and the topic of the sentence.

The rest of this paper is organized as follows. Section

2 summarizes the basic methods of embedding models and

summarization systems. We then introduce a newly sum-

marization framework in Section 3, especially in Section

3.2, the novel TSE model is proposed. Section 4 reports

the experimental results and corresponding analysis. Fi-

nally, we conclude the paper.

2 Background and Related Work

We firstly introduce the Word2Vec and the PV model to in-

vestigate the basic framework of training embedding model

for words and sentences.

Word2Vec:

The basic assumption behind Word2Vec [MCCD13] is that

the representation of co-occurred words have the similar

representation in the semantic space. To this target, a slid-

ing window is employed on the input text stream, where

the central word is the target word and others are contexts.

Word2Vec method contains two models: CBOW and Skip-

gram model. CBOW aims at predicting the target word us-

ing the context words in the sliding window. The objective

of CBOW is to maximize the average log probability,

L =1

D

D∑

i=1

logPr(wi | C;W ). (1)

where, wi is the target word, C is the word contexts and

W is is word matrix, D is the corpus size. Different from

CBOW, Skip-gram aims to predict context words given the

target word. We ignore the details of this approach here.

Paragraph Vector (PV):

It [LM14] is an unsupervised algorithm that learns fixed-

length semantic representations of variable-length of texts,

which follows the same predicting task with Word2Vec.

The only change is the concatenate vector constructed from

W and S, where S is sentence matrix instead of individual

W . The PV model is a strong alternative sentence model,

and it is widely applied in learning representations for se-

quential data.

Work on extractive summarization spans a large range

of approaches. Most existing systems [Gal06, YGVS07]

use rank model to select the sentences with highest scores

to form the summarization. However, multi-document texts

often describe one central topic and some sub-topics, which

cannot be described only depending on ranking model.

Then we focus on how to rank the sentences and collab-

orate topic coverage.

A variety of features were defined to measure the

relevance, including TF-IDF cosine similarity [NVM06,

YGVS07], cue words [LH00], topic theme [HL05], and

WordNet similarity [OLLL11], etc. However, these fea-

tures usually suffer from lacking of deep understand-

ing semantics mechanism, which fail to meet the query

need. Since Mikolov et al. [MCCD13] proposed the

efficient word embedding method, there is a surge of

works [LM14, LLCS15] focusing on embedding models

for capturing the linguistic regularities. Embedding mod-

els [KMTD14, KNY15, YP15, CLW+15] for words and

sentences also have encouraged summarization tasks from

the perspective of semantic relevance computing, such as

DocEmb and CNNLM. However, aforementioned methods

usually reward semantic similarity without considering of

topic coverage, which fail to meet the summary need.

Topic-based methods have been proved their successes

for summarization. Parveen et al. [PRS15] proposed an ap-

proach, which is based on a weighted graphical represen-

tation of documents obtained by topic modeling. [GNJ07]

measured topic concentration in a direct manner: a sen-

tence was considered relevant to the query if it contained at

least one word from the query. While these work assume

that documents related to the query only talk about one

topic. Tang et al. [TYC09] proposed a unified probabilistic

approach to uncover query-oriented topics and four scor-

ing methods to calculate the importance of each sentence in

the document collection. Wang et al. [WLZD08] propose

a new multi-document summarization framework (SNMF)

based on sentence-level semantic analysis and symmetric

non-negative matrix factorization. The symmetric matrix

factorization has been shown to be equivalent to normal-

ized spectral clustering and is used to group sentences into

clusters. Futhermore, several approaches incorporate vec-

tor representations with topics , such as NTM [CLL+15],

TWE [LLCS15] and GMNTM [YCT15], have collaborated

both benefits of semantic representation and classified top-

ics. This motivates us to investigate the cooperation models

for summarization system.

3 The Framework for Query-focused Sum-

marization

Extracting salient sentences is the main task in this study.

At sentence level, the sentence embedding and sentence

ranking are utilised to enable sentence relevance to the user

queries and extract salient summaries.

3.1 The Proposed TSE Model

Inheriting the superiority of the PV model that constructs a

continuous semantic space, the novel architecture of learn-

ing sentence representation, called TSE model, as shown in

the Figure 1.

wn s Ts

s

GMM T1 T2 T3 Tk-3 Tk-2 Tk-1 Tk. . .

w1 w2 w3 wn-1

Context

concatenate

classifier

. . .

s*

1 0

Figure 1: The structure of the proposed TSE model

Topic Vectorization by GMM

Let K represent the number of topics, V is the size of

vector, and W represent word dictionary. S denotes the

sentence collection, in which s is one of the sentences. Let

vec(Ts) be the topic vector of sentence s. The vectors of

sentences and words are represented as vec(s) ∈ RV and

vec(w) ∈ RV . πk ∈ R, µk ∈ RV , Σk ∈ RV×V and∑K

k=1 πk = 1 are denoted as mixture weights, means and

covariance matrices, respectively. The parameters of the

GMM are collectively represented by λ = πk, µk,Σk,

where k = 1, · · · ,K . Given the collection of parameters,

we use

P (x|λ) =K∑

k=1

πkN(x|µk,Σk) (2)

to represent the probability distribution for sampling a vec-

tor x from the GMM.

Subsequently, we can infer the posterior probability dis-

tribution of topics. For each sentence s, the posterior dis-

tribution of its topic is

q(zs = k) =πzN(vec(s)|µz,Σz)∑K

k=1 N(vec(s)|µk,Σk)(3)

Based on the distribution, the topic of sentence scan be vectorized as vec(Ts) = [q(zs = 1), q(zs =2), · · · , q(zs = K)].

Generative Sentence Embedding

The assumption of the TSE is that sentences are coher-

ent and associated with their neighbours. Consequently,

we model one sentence as a prediction task based on se-

mantic structure of the previous sentences. The semantic is

represented by collaborating sentence topic, sentence rep-

resentation and its content. The Negative Sampling (NEG)

method is applied in [MCCD13] which is an efficient ap-

proximation method. Therefore, we carry on the similar

estimation schema in our model.

Definition 1. Label ls: A label of sentence s is 1 or 0. The

label of positive sample is 1, the label of negative samples

are 0. For ∀s ∈ S,

ls(s) =

1, s = s;

0, s 6= s;(4)

Let Xs be a concatenation of given information of

current sentence for predicting the next sentence, s, s′

be the current sentence. Xs = vec(Ts′) ⊕ vec(s′) ⊕vec(w1)⊕, · · · ,⊕vec(wm). We incorporate the vectors as

the input, which includes topics, sentence embedding, and

its content of words.

Given the collection S, we show how to learn represen-

tation of sentences and topics. In this paper, we concentrate

to exploit the latent relationship between sentences. Sub-

sequently, the target sentence s is predicted purely by the

information from previous sentence, namely Xs. So the

objective of TSE is to maximize the probability

G =∏

s∈S

g(s) =∏

s∈S

u∈s∪s−

p(u|Xs) (5)

Instead of using softmax function as prediction proba-

bility, we directly use its negative sampling approxima-

tion. The prediction objective function of sentence s is

g(s)=∏

s∈S p(u|Xs), and the probability function is rep-

resented as follows

p(u|Xs) =

σ(XT

s θu), ls(u) = 1

1− σ(XTs θ

u), ls(u) = 0(6)

or write as a whole

p(u|Xs) = [σ(XTs θ

u)]ls(u) · [1− σ(XT

s θu)]1−ls(u) (7)

where σ(x) = 1/(1 + exp(−x)) and θu ∈ RV is the pa-

rameter of Xs.

The objective function is taken log-likelihood and de-

fined as

L =∑

s∈S

ls(u) log[σ(XTs θu)]+

(1− ls(u))(nE(s∗ ∼ N(S))) log[1− σ(XTs θ

u)](8)

where nE(·) is number of n negative samples as Definition

1, and n is set to 10 empirically. Considering convenience

in estimation, we rewrite the final objective function as

L(s, u) = ls(u) · log[σ(XTs θu)]+

[1− ls(u)] · log[1− σ(XTs θ

u)](9)

Parameters Estimation

The parameters λ, θu, Xs, where λ = πk, µk,Σkare estimated by maximizing the likelihood of the objec-

tive function jointly. A two-phase iteration process is con-

ducted.

Given θu, Xs, stochastic gradient descent (SGD) is

adopted in updating parameters of the GMM. Given λ,

the gradient of θu is calculated using the back propagation

based on the objective in Eq. 9.

3.2 Sentence Ranking

Sentence ranking aims to measure the relevant sentences

with consideration of query information. In this paper,

relevance ranking of sentences primarily relys on seman-

tic vector-based cosine similarity [KMTD14] that is a

promising measure to compute relatedness for summariza-

tion. Additionally, statistics features (i.e., TFIDF score

[NVM06]). In summary, the ranking score is formulated

as:

Score(S) = α

nw∑

t=1

TFIDF (wt) + βsim(vec(s), vec(Q))

+ γsim(vec(Ts), vec(TQ))(10)

where Q is the query, sim(·) represents the function to

compute similarity, and we use cosine similarity in this pa-

per. α, β and γ are parameters in the summarization sys-

tem.

4 Experiments

In this section, we present experiments to evaluate the per-

formance of our method in query focused multi-document

summarization task.

4.1 Dataset and Evaluation Metrics

In this study, we use the standard summarization bench-

mark DUC2005 and DUC20061 for evaluation. DUC2005

contains 50 query-oriented summarization tasks. For each

query, a relevant document cluster is assumed to be “re-

trieved”, which contains 25-50 documents. DUC2006 con-

tains 50 query-oriented summarization tasks as well and

each query contains 25 documents. Thus, the task is to

generate a summary from the document cluster for answer-

ing the query2. The length of a result summary is limited

by 250 words.

We conducted evaluations by ROUGE [LH03] metrics.

The measure evaluates the quality of the summarization by

counting the number of overlapping units, such as n-grams.

Basically, ROUGE-N is n-gram recall measure.

4.2 Baseline Models and Settings

We compare the TSE model with several query-focused

summarization methods.

• TF-IDF: this model uses TF-IDF [NVM06] for scor-

ing words and sentences.

• Lead: take the first sentences one by one from the

document in the collection, where documents are or-

dered randomly. It is often used as an official baseline

of DUC.

• LDA: this method uses Latent Dirichlet

Allocation[BNJ03] to learn the topic model. Af-

ter learned the topic model, we give max score to the

word of the same topic with query. The reader can

refer to the paper [TYC09] for the details.

• SNMF: this system [WLZD08] is for topic-biased

summarization. It utilised non-negative matrix factor-

ization (SNMF) to cluster sentences and from which

selected multi-coverage summary sentences.

• Word2Vec: the vector representations of words can

be learned by Word2Vec [MCCD13, MSC+13] mod-

els. The sentence representation is calculated by using

an average of all word embeddings in the sentence.

• PV: PV [LM14] learns sentence vectors based on

Word2Vec Model. Thus, we use the same parame-

ters as that in our approach to calculate the scores of

sentences.

• TWE: TWE [LLCS15] employs LDA to refine Skip-

gram model. It learns topical word embeddings based

on both words and their topics. The sentence repre-

sentation is calculated by using an average of all word

vectors in the sentence.

1http://duc.nist.gov/data.html2In DUC, the query is also called “narrative” or “topic”

Table 1: Overall ROUGE evaluation (%) of different models for DUC2005 and DUC 2006

MethodDUC2005 DUC2006

ROUGE-1 ROUGE-2 ROUGE-1 ROUGE-2

LEAD 29.71 4.69 32.61 5.71

TF-IDF 33.56 5.20 35.93 6.53

Avg-DUC 34.34 6.02 37.95 7.54

SNMF 35.0 6.04 37.14 7.52

Word2Vec 34.59 5.48 36.33 6.34

PV 35.41 6.14 37.52 7.41

DocEmb 30.59 4.69 32.77 5.61

LDA 31.70 5.33 33.07 6.02

TWE 35.05 6.06 37.58 6.52

TSE 36.28 6.53 37.96 7.56

Impr 2.46 6.35 0.03 0.27

Table 2: Influence analysis of each factor for the TSE summarization, evaluated on DUC2005

MethodRouge-1 Rouge-2 ratio 1 ratio 2

TF-IDF sen sim topic

× √ √35.54 6.37 2.04% 2.45%√ × √34.88 5.99 3.86% 8.27%√ √ × 35.92 6.47 0.99% 0.91%

Note that all the baselines are conducted similar with

the proposed summary framework as unsupervised query-

focused summarization system.

The learning rate η is set to 0.05 and gradually reduced

to 0.0001 as training converge. The word2vec is addition-

ally trained by English Gigaword Fifth Edition 3 and di-

mension is set to 256. The dimension of PV is set to 128,

and the TWE is 64, similar as the proposed TSE model.

4.3 Experimental Results and Discussion

In this subsection, we give a report of experimental results

and analysis. Table 1 shows the overall summarization

performances of the proposed model and baseline mod-

els. It can be observed that our approach gives the best

summary compare to any other method in ROUGE metrics

over two benchmark datasets, which strongly demonstrates

the outstanding performance of the proposed summariza-

tion model. Impr denotes the relative improvements over

the best of the nine baselines. We can find that the pro-

posed TSE sentence embedding consistently outperforms

the baselines from 0.03% to 6.35%.

Experimental results have validated our proposed model

that exploits sentence similarity and topic information can

improve the overall performance. Nevertheless, they could

not point out impact of the designed measure of sentence

similarity. Hence, we keep consistency for our algorithm

framework except for removing the part of features while

calculating sentence ranking, to investigate the importance

3https://catalog.ldc.upenn.edu/LDC2011T07

of each element as shown in Table 2. We calculate the per-

centage that the TSE is superior to the one neglecting one

feature, denoted as ratio 1 for ROUGE-1 metrics and ratio

2 for ROUGE-2. As shown the ratio 1 is 3.86% and ratio

2 is highly up to 8.27%, it illustrates that sentence sim-

ilarity computation by our proposed sentence embedding

plays a consistently dominant role for the summary. On

the contrary, it has improving space for utilizing topics for

summary.

5 Conclusion

This work proposes a novel sentence embedding model

which wisely incorporates sentence coherence and topic

characteristics in the learning process. It can automatically

generates distributed representations for sentences as well

as assigns sentences with semantic and meaningful topics.

We conduct extensive experiments on DUC query-focused

summarization datasets. Utilizing the superiority of the

proposed TSE that facilitates sentence ranking, the system

achieves competitive performance. A promising future di-

rection is to strengthen topic optimization during the sen-

tence learning. With the assistance of semantic topic, we

can extract sentence-based saliance topic representation as

direct summary.

Acknowledgments

This work is supported by National Basic Research Pro-

gram of China (973 Program, Grant No.2013CB329303),

National Nature Science Foundation of China (Grant

No.61602036), and Beijing Advanced Innovation Center

for Imaging Technology (BAICIT-2016007).

References

[BNJ03] David M. Blei, Andrew Y. Ng, and Michael I.

Jordan. Latent dirichlet allocation. JMLR,

3:993–1022, 2003.

[CLL+15] Ziqiang Cao, Sujian Li, Yang Liu, Wenjie Li,

and Heng Ji. A novel neural topic model and

its supervised extension. In Proceedings of

AAAI’15, pages 2210–2216, 2015.

[CLW+15] Kuan Yu Chen, Shih Hung Liu, Hsin Min

Wang, Berlin Chen, and Hsin Hsi Chen.

Leveraging word embeddings for spoken doc-

ument summarization. Computer Science,

2015.

[Gal06] M. Galley. A skip-chain conditional random

field for ranking meeting utterances by impor-

tance. In Proceedings of EMNLP’07, 2006.

[GNJ07] Surabhi Gupta, Ani Nenkova, and Dan Juraf-

sky. Measuring importance and query rele-

vance in topic-focused multi-document sum-

marization. 2007.

[HL05] Sanda Harabagiu and Finley Lacatusu. Topic

themes for multi-document summarization.

In Proceedings of SIGIR’05, pages 202–209,

2005.

[KMTD14] Mikael Kageback, Olof Mogren, Nina Tah-

masebi, and Devdatt Dubhashi. Extractive

summarization using continuous vector space

models. In Proceedings of EACL’14, 2014.

[KNY15] Hayato Kobayashi, Masaki Noguchi, and

Taichi Yatsuka. Summarization based on

embedding distributions. In Proceedings of

EMNLP’15, 2015.

[LH00] Chin Yew Lin and Eduard Hovy. The au-

tomated acquisition of topic signatures for

text summarization. In Proceedings of COL-

ING’00, pages 495–501, 2000.

[LH03] Chin Yew Lin and Eduard Hovy. Auto-

matic evaluation of summaries using n-gram

co-occurrence statistics. In Proceedings of

ACL’03, 2003.

[LLCS15] Yang Liu, Zhiyuan Liu, Tat Seng Chua, and

Maosong Sun. Topical word embeddings. In

Proceedings of AAAI’15, 2015.

[LM14] Quoc V. Le and Tomas Mikolov. Distributed

representations of sentences and documents.

Computer Science, 4:1188–1196, 2014.

[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and

Jeffrey Dean. Efficient estimation of word rep-

resentations in vector space. Computer Sci-

ence, 2013.

[MSC+13] Tomas Mikolov, Ilya Sutskever, Kai Chen,

Greg Corrado, and Jeffrey Dean. Distributed

representations of words and phrases and their

compositionality. 26:3111–3119, 2013.

[NVM06] Ani Nenkova, Lucy Vanderwende, and Kath-

leen Mckeown. A compositional context sen-

sitive multi-document summarizer: explor-

ing the factors that influence summarization.

In Proceedings of SIGIR’06, pages 573–580,

2006.

[OLLL11] You Ouyang, Wenjie Li, Sujian Li, and Qin

Lu. Applying regression models to query-

focused multi-document summarization. In-

formation Processing & Management An In-

ternational Journal, 2011.

[PRS15] Daraksha Parveen, Hans-Martin Ramsl, and

Michael Strube. Topical coherence for graph-

based extractive summarization. In Proceed-

ings of EMNLP’15, pages 1949–1954, 2015.

[TYC09] Jie Tang, Limin Yao, and Dewei Chen. Multi-

topic based query-oriented summarization. In

Proceedings of SDM’09, pages 1147–1158,

2009.

[WLZD08] Dingding Wang, Tao Li, Shenghuo Zhu, and

Chris Ding. Multi-document summarization

via sentence-level semantic analysis and sym-

metric matrix factorization. In Proceedings of

SIGIR’08, pages 307–314. ACM, 2008.

[YCT15] Min Yang, Tianyi Cui, and Wenting Tu.

Ordering-sensitive and semantic-aware topic

modeling. In Proceedings of AAAI’15, 2015.

[YGVS07] Wen Tau Yih, Joshua Goodman, Lucy Vander-

wende, and Hisami Suzuki. Multi-document

summarization by maximizing informative

content-words. In Proceedings of IJCAI’07,

pages 1776–1782, 2007.

[YP15] Wenpeng Yin and Yulong Pei. Optimizing

sentence modeling and selection for document

summarization. In Proceedings of IJCAI’15,

2015.

Data Driven Concept Refinement

to Support Avionics Maintenance

Luis Palacios Medinacelli1,2 Yue Ma2 Gaelle Lortal1 Claire Laudy1 Chantal Reynaud2

1LRASC, Thales Research & Technology, Palaiseau, France2LRI, Univ. Paris-Sud, CNRS, Universite Paris-Saclay, France

gaelle.lortal,[email protected], palacios,ma,[email protected]

Abstract

Description Logic Ontologies are one of themost important knowledge representation for-malisms nowadays which, broadly speaking,consist of classes of objects and their rela-tions. Given a set of objects as samples anda class expression describing them, we presentongoing work that formalizes which propertiesof these objects are the most relevant for thegiven class expression to capture them. More-over, we provide guidance on how to refine thegiven expression to better describe the set ofobjects. The approach is used to characterizetest results that lead to a specific maintenancecorrective action, and in this paper is illus-trated to define sub-classes of aviation reportsrelated to specific aircraft equipment.

1 Introduction

Given a DL-Ontology we might find that some con-cepts definitions are too generic, in the sense that theyare not rich enough to capture only the intended ob-jects, or that the definition does not describe themproperly. In order to have more control on what isexpressed, having sub-types of such concepts is use-ful, since we can make distinctions between the ob-jects that could not be made before. Our motiva-tion comes from the avionics maintenance domain[Palacios et al.2016] , where we find several levels ofmaintenance. One of them involves shop repair, wherean equipment found faulty on an aircraft is to be re-paired or replaced. In this scenario, several tests need

Copyright c© by the paper’s authors. Copying permitted forprivate and academic purposes.

In: Proceedings of the SML Workshop, Melbourne, Australia,20-08-2017, published at http://ceur-ws.org

to be run to find out the exact part(s) of the equip-ment which is faulty and causes failures. Once thetests are run, it is up to mechanic experts to determinethe possible components of the equipment involved inthe failure, and the repairs/replacements to be done.For a maintenance process it is difficult to establish apriori what are the actions to be taken to return theequipment to a fully functional state. Establishing themost probable actions is useful to shorten the exami-nation and repair time, thus gaining in efficiency andlowering the costs. In this paper, we aim to model thisproblem and propose a primitive method. The idea isbased on the fact that we know by a large amount ofhistorical shop repair reports the test results and re-pair decisions made accordingly. We can consider theclosed incidents as positive samples of a repair action,the incidents that do not require this repair action asnegative ones, and use tests as features of each sample.Then identifying important tests results equals identi-fying key features that distinguish positives from neg-atives. Finally we use these key features to providea description for the sub-set of tests that lead to aspecific maintenance action. Our approach thus canbe used to obtain formal patterns to capture a set ofsamples; properties that distinguish positive from neg-ative samples; guide to refine a concept expression andenrich patterns that can be served as features for clus-tering or classifying samples.

2 Positioning

Mining formal patterns from data has been identifiedas an important task in many different research com-munities. Depending on the target language used forrepresenting a pattern, we can divide the existing workinto different categories.Concept Learning Similar to our paper, DL-Learner[Lehmann2009], a state-of-the art tool for conceptlearning, is based on description logic. It provides aframework for supervised machine learning using sev-

eral algorithms which are highly parameterizable. It isbased on refinement operators like CELOE for OWLexpressions and ELTL for the EL family of DLs. De-pending on the desired properties of the operator andthe DL-constructors allowed, the operator traversesthe space of possible concept expression in a top-downor bottom-up way. Then these concept expressions areevaluated, using heuristics, to find the most suitableones. Shorter and more simple expressions are pre-ferred by these algorithms.

A useful list of approaches for concept learning andtheir characteristics can be found in [Tran et al.2012]where they position their approach based on bisimu-lations with respect to other techniques.

In [Divroodi and Nguyen2015], they study how toestablish whether the signature of an ontology (theconcepts, roles and individuals along with the DL-family chosen) is sufficient to distinguish between twoobjects. Broadly speaking, if two objects belong tothe same equivalence class, they are indistinguishable.On the other hand, given a set of samples, if thereexists an expression in the underlying language thatcan capture this set of samples, then it must be theunion of some equivalence classes. They use this no-tion to provide an algorithm that learns a concept ex-pression, by partitioning the domain with respect tothe equivalence classes. One interesting feature of theirapproach, is that it offers a formal way of defining ap-proximations to a concept expression based on roughsets [Nguyen and Sza las2013] by using the similarityclasses and the underlying language.In the above mentioned approaches, the goal is to finda concept expression that best describes a set of sam-ples, by refining the concept. In a specialization sce-nario, if a false positive, can be left out of the extensionof the concept by adding a restriction to the the objectsthat belong to it, we know that the selected propertyseparates the false positive from the rest of the sam-ples. Pointing out these properties and objects, is thedifference and contribution of this paper with respectto the above mentioned approaches.Graph mining Graph structures are used within avariety of applications in order to represent structuredinformation. The issue of mining interesting patternsin these graphs structures has thus emerged and alarge amount of researches focus on algorithms en-abling graph mining. Regarding our application, in-teresting patterns mean for instance recurring or fre-quent patterns that can be found either in a big graphor in a set of (smaller) graphs. It may also mean find-ing significant patterns, either semantically or statisti-cally representative for the instance. From a MachineLearning point of view, interesting patterns are thosethat well discriminate positive vs. negative samples.

One of the main problems addressed by the different

work raised in the subgraph isomorphism issue. Sub-graph isomorphism is an NP-complete problem. Thus,mining subgraphs of a graph means studying an expo-nential number of subgraphs. In [Yan and Han2003],the authors propose an algorithm, CloseGraph to mineclosed frequent graphs. The algorithm uses pruningmethods in order to reduce the exploration space. Aclosed graph in a set of graphs is a graph, for which itexists no proper subgraph that has the same support.Therefore, closed graph patterns will be the largestpatterns that can be found in a graph database for agiven problem.

[Motoda2007], [Inokuchi et al.2000] and[Yoshida et al.1994] present two approaches forextracting frequent subgraphs: AGM and GBI.AGM relies on the representation of the graphs byadjacency matrices and GBI relies on the chunking ofadjacent nodes in order to generate subgraphs and therewriting of the graphs given the selected subgraphsas new nodes.

Using graph mining, the most relevant structuresof a set of graphs can be found. It can be used tofind patterns that discriminate positive and negativeexamples. This is very similar to our problem of learn-ing a concept from examples as patterns describe thecommon parts of the examples. However, no directcontrol over the signature is given, neither the pos-sibilities to extend a concept are taken into account.Graph mining techniques can be used as initial stepfor our approach to find a first substructure that dis-criminate partially positives and negatives examples.Our feature extraction algorithm may then be appliedin order to not remain at a structure level but to con-sider the semantics, focusing on the relevant proper-ties. Furthermore the associated signature providestheoretical limits to what can be expressed.

The rest of the paper is organized as follows: insection 3 we introduce the approach and the necessarydefinitions. Then, in section 4 we show a use casebased on aviation reports, where we refine a conceptbased on expert knowledge. Finally, in section 5 theconclusions and further works are presented.

3 Defining the most relevant features

We define ontologies based on [Baader2003], where anontology is a tuple O = 〈T ,A〉 with T being the T-Box and A the A-Box. The T-Box contains the set ofconcept and role definitions, while the A-Box containsassertions about the definitions in the T-Box. The sig-nature ΣO of an ontology O is the set of all conceptnames, role names and individuals in O. For detailsrefer to [Divroodi and Nguyen2015]. To make the ap-proach simpler, in the following we consider T = ∅and the assertions in the A-Box are of the form A(x)

or r(x, y), where A and r are atomic.Given an ontology O, a set of samples X = Pos ∪

Neg (where Pos is a set of positive samples and Nega set of negative samples) and a concept C describingX up to certain degree ψ, we are interested in findinga DL-expression C′ with a better degree ψ′ to describeX , if it exists. The value of ψ is to be defined in ac-cordance to the problem by the user (for example therecall, the accuracy, f-measure, etc.). Thus, the prob-lem consists in that: a) C captures some unintendedobjects (false positives),b) it does not capture someintended objects (false negatives) or c) both.We take the case of false positives to illustrate the ap-proach, where C captures some negative samples. Inthis scenario, we would like to make C more specific,that is to add restrictions to the objects that belongto this concept, in such a way that a) some (or all)false positives are not considered anymore and b) wepreserve all (or most) of the positive samples. Given aconcept C and a positive instance x ∈ Pos, we want toknow what are the properties to consider, if we wantto specialize C. Intuitively the process is as follows:Any instance and its relation to other objects can berepresented as a directed (acyclic) graph, with thenodes representing the objects and the edges repre-senting the relations between them, as stated by theA-Box. For example consider the A-Box :

A : r1(x, y), r2(y, w), r3(y, z)

we obtain: Since a concept expression C is given as part

x y

w

z

r1r2

r3

Figure 1: The graph representation of object x

of the input, we can determine which are the proper-ties of object x that are necessary for C to capture it.Assume C ≡ ∃r1.∃r2.>, then the assertion r3(y, z) isnot relevant for deciding whether C(x) = > holds, anda simpler representation of x can be obtained: This

x y

wr1r2

Figure 2: The graph representation of object x, w.r.t.A \ r3(y, z)

representation allows us to establish up to which pointthe structure of the object x is relevant for C. Fromthe original A-Box, we know that some of the objectsto which x is connected, are themselves connected toother objects by relations not considered by C. In ourexample object y is connected to object z through re-lation r3. We are interested in these objects, since

they provide all the properties that we can considerto further specialize C and still capture x, and there-fore represent the relevant properties to capture theset of positive instances. In order to provide a formaldefinition for these objects, let us first introduce somenecessary notions.

Definition 1 Given an ontology O with signatureΣO, two objects x, y and an A-Box A, we say thatthere exists a binary relation closure between x and y,denoted by r ↑(x, y), if x = y or if there exists a pathof the form:

r1(x, z1), r2(z1, z2), . . . , rm(zn−1, zn) ⊆ A

with n ≥ 1, zn = y, and rj ∈ ΣO(1 ≤ j ≤ m).

Next, we want to establish which of the edges of thegraph representing the object are necessary for a givenconcept to capture the object. This is formalized bythe following definition:

Definition 2 Given an object x, a concept C, an A-Box A and a binary relation r(y, z) ∈ A, we say thatr(y, z) is necessary for C to capture x iff:

C(x) = > w.r.t A, r ↑(x, y), r(y, z) ∈ A but

C(x) = ⊥ w.r.t A \ r(y, z)

Likewise, we say that r(y, z) is unnecessary if

C(x) = > w.r.t A \ r(y, z)

still holds. Additionally, we say that an object o isnecessary if o = x or:

∃z | r(z, o) ∈ A s.t. r(z, o) is necessary.

Note that depending on the concept C and the con-tent of the A-Box, a unnecessary binary relation mightbecome necessary, therefore several possible answersmight exist. For example take the concept C ≡ ∃r.>and the A-Box: A = r(x, y), r(x, z), then we have:

C(x) = > w.r.t A \ r(x, y)

Concluding that r(x, y) is not necessary, but only aslong as r(x, z) ∈ A (and vice-versa). Given a definitionfor the properties and objects necessary for an objectx to belong to a concept C, we can also obtain thosethat are not necessary. These (unnecessary) proper-ties are linked to the object x but are not required byC. As such they can be seen as candidate propertiesfor specializing C and still capture x. These are theproperties of special objects hereafter called leafs of x,defined by:

Definition 3 Given an object x, a concept C and anA-Box A the set of leafs of x w.r.t C is given by:

Leafsx,C = y | r(y, z) ∈ A

where y is necessary for C to capture x

but r(y, z) is not necessary for C to capture x

Intuitively the set Leafsx,C represents all those objectsy in the frontier of x w.r.t. C, in the sense that no fur-ther edges of the graph representing x are consideredby C to decide whether the object x belongs to it.

Definition 4 Given an object x, a concept C and theset Leafsx,C w.r.t. an A-Box A, the set of extensionsExtx,C to specialize C w.r.t. x is defined by:

Extx,C = r(y, z) ∈ A | y ∈ Leafsx,C

and r(y, z) is not necessary for C to capture x

Intuitively Extx,C provides all those role namesthrough which C can be specialized and still capturex. The set of leafs and their properties are the answerto our problem (they provide the ways in which wecan specialize C, from where we can derive the con-flictive properties and the conflicting objects). Be-fore we present an algorithm to obtain the neces-sary properties, let us show that the definitions arenot sufficient alone. Consider now C ≡ ∃r.> andA = r(x, y), r(y, w), r(x, z), we have Leafsx,C =x and Extx,C = r(x, y), r(x, z). We can indeedspecialize C using these properties, but nothing inthe above answer gives us the information that eitherr(x, y) or r(x, z) is required for C(x) = > to hold (wejust know they are not necessary). To obtain this infor-mation, we take the unnecessary relations and removethem one by one from the A − Box until a minimalset of necessary role assertions is reached. We nowintroduce an algorithm to compute a minimal set ofnecessary role assertions, from which we can extractLeafsx,C and Extx,C :In Algorithm 1 we first compute Rx (2), which is thesubset of the A-Box A containing all those role asser-tions in a path from x:

r(x, y) ∈ Rx since r ↑ (x, x) and r(x, y) ∈ A

r(y, w) ∈ Rx since r ↑ (x, y) and r(y, w) ∈ A

r(x, z) ∈ Rx since r ↑ (x, x) and r(x, z) ∈ A

We have : Rx = CopyR = r(x, y), r(y, w), r(x, z)

(3) makes a copy CopyR of Rx from which we willsequentially remove the last elements of each path. (4)establishes as the candidates CandR to be tested all

Algorithm 1 Minimal set of necessary role assertions

1: input: (x, C(x) = >,A)2: Rx = r(y, z) | r ↑ (x, y), r(y, z) ∈ A3: CopyR = Rx

4: CandR = r(y, z) ∈ Rx |6 ∃w s.t. r(z, w) ∈ Rx5: Cx = D(z) ∈ A | z = x or ∃y s.t. r(y, z) ∈ Rx6: while CandR 6= ∅ do7: for r(y, z) ∈ CandR do8: if C(x) = > w.r.t. Rx \r(y, z)∪Cx then9: remove r(y, z) from Rx

10: end if11: remove r(y, z) from CopyR12: end for13: CandR = r(y, z) ∈ CopyR |6 ∃w s.t. r(z, w) ∈

CopyR14: end while15: return: Rx

those role assertions that do not have further outgoingedges (that is, the last elements of a path):

CandR = r(y, w), r(x, z)

(5) creates the set of all concept assertions about allthe objects that take part in a relation in Rx. In ourexample is empty. Then we start the while loop to testall assertions for necessity until no more candidates arefound. (7) takes one candidate at the time, and (8)tests if its necessary. The unnecessary assertions areremoved from Rx (9). Any assertion already tested isremoved from CopyR by (11) in order not to test themtwice. First we test r(y, w):

C(x) = >w.r.t.Rx \ r(y, w) ∪ Cx , remove r(y, w)

Rx = r(x, y), r(x, z) , CopyR = r(x, y), r(x, z)Then, r(x, z) is tested:

C(x) = >w.r.t.Rx \ r(x, z) ∪ Cx , remove r(x, z)

Rx = r(x, y) , CopyR = r(x, y)Once all identified candidates are tested, the setCandidatesR is re-computed (13) considering onlythose assertions remaining in CopyR,

CandR = r(x, y)

A second run of the while loop tests r(x, y) yielding:

C(x) = ⊥w.r.t.Rx \ r(x, y) ∪ Cx , keep r(x, y)

Rx = r(x, y) , CopyR = Since there are no more candidates to test, the out-put of the algorithm is the modified set Rx which is aminimal set of necessary role assertions for C(x) = > 1:

Rx = r(x, y)1The proof for this property remains as further work.

From this set we can easily construct Leafsx,C andExtx,C , following their definitions:

Leafsx,C = x, y , Extx,C = r(y, w), r(x, z)

Where the possible extensions to consider to specializeC are r(y, w), r(x, z), which is the intended answer.

4 Use case

Since the data specific to aviation maintenance is re-stricted for disclosure, we provide an example basedon aviation incidents. Similarly as in aviation main-tenance, where we want to find sub-types of failuresbased on their features, here we want to obtain in-teresting sub-concepts of aviation issues. We countwith an ontologyOASRS representing the reported avi-ation incidents from the ASRS 2 database, and a setof equipment used in aviation, from which we selectedGPS. In this scenario, we are interested on which prop-erties to take into account to obtain those ”GPS re-lated aviation issues” and those ”aviation issues wherethe GPS presented a problem”. We proceed as follows:in the ASRS website, a classification of some reportsmade by experts is given, one of such sets is ”GPSrelated reports”. This provides us with the set of posi-tive instances Pos 3. As the set of negative instancesNeg , we have selected reports related to other types ofincidents disjoint with Pos:

Pos = 1336, 1347, 1359 , Neg = 1361

The A-Box A is composed of the following conceptsand roles:

AviationIssue = 1336, 1347, 1359, 1361Aircraft = A1, A2, A3, A4

NavInUse = GPS, FMSCompProb = Mf, IO

involves = (1336, A1), (1347, A2), (1359, A3),(1361, A4)

usesNav = (A1, GPS), (A2, GPS), (A3, GPS),(A3, FMS)

repProblem = (A1,Mf), (A2,Mf), (A3, IO),(A4,Mf)

hasNarrative = (1336, ”..GPS..”), (1347, ”..GPS..”),(1359, ”..GPS..”), (1361, ”..GPS..”)

(where: Mf=Malfunctioning,CompProb=ComponentProblem,

IO=Improperly operated,repProblem=reportedProblem,

NavInUse=Navigation system in use)

Assume as input we are given a concept expression ofthe form:

C ≡ ∃involves.>2Aviation Safety Report System (https://asrs.arc.nasa.gov)3The samples have been simplified for this paper.

1336

AviationIssue

A1

Aircraft

Mf

CompProb

GPS

NavInUse

..GPS..

Narrative

involves

repProblem

usesNav

hasNarrative

Figure 3: The graph representation of object 1336,instance of AviationIssue concept. Concepts are rep-resented in bold, values as nodes, and relations asedges.

It is easy to see that the AviationIssue 1336 is aninstance of C, but the problem is that we also findC(1361) = >, thus C does not classify the objects inthe intended way. To establish how to improve C, weuse algorithm 1 to obtain Rx, and from this set ofnecessary role assertions we obtain its leafs and thepossible properties to specialize it:

Rx = involves(1336, A1)

Leafsx = 1336, A1Extx,C = hasNarrative(1336, ”...GPS...”),

usesNav(A1, GPS), repProb(A1,Mf)It is out of the scope of this paper how to constructa concept expression using the identified properties,nevertheless we provide some examples to illustratethe approach. If we first consider hasNarrative tospecialize C, we can add a restriction in the followingway:

C′ ≡ ∃involves.> ∧ ∃hasNarrative.”...GPS...”

The concept C′ expects that any report mentioningGPS in its narrative is indeed a ”GPS related report”.But even though report 1361 mentions GPS, it is notclassified as ”GPS related reports” by the experts (setPos). Thus, we learn that hasNarrative is not theproperty that allows us to distinguish them. We pro-ceed now with the property usesNav. We can special-ize C by:

C′′ ≡ ∃Involves.∃usesNav.>

Then C′′ correctly classifies Pos and Neg (sinceC′′(1361) = ⊥). We learn that usesNav is the mostrelevant property to specialize C that allows us tomake the desired distinction, and that the specializa-tion should be made in the position of the leafA1 in thegraph. Finally, assume we want to obtain a more inter-esting concept expression, representing those ”Avia-tion Issues that involve a problem with GPS devices”.The sets of positive and negative samples become:

Pos = 1336, 1347 , Neg = 1359, 1361

Considering again object 1336 and C′′ as part of theinput, the graph representation of its necessary prop-erties is:

1336

AviationIssue

A1

Aircraft

GPS

NavInUse

involves usesNav

Figure 4: The graph representation of object x, whereonly the necessary properties for C′′ to capture it areshown.

Where:

Leafsx,C′′ = 1336, A1

Extx,C′′ = hasNarrative, repProb

If we select repProb we can construct a concept ex-pression of the form:

C′′′ ≡ ∃Involves.(∃usesNav.> u ∃repProb.Mf)

We can see that C′′′ properly distinguishes betweenPos and Neg , and we learn that the most importantproperty to make this distinction w.r.t. C′′ is repProb,where the most relevant objects (leafs) and their prop-erties provide the key to construct such expressions.

Aviation Issues

A.I. mention GPS

A.I. related to GPS

A.I. presentGPS problems

Figure 5: Specialization of concept ”Aviation Issue”.

5 Conclusions and further works

We consider that the most simple case (without a T-Box) is the most appropriate way to introduce ourwork and show how it can be used to obtain thoseproperties relevant for a concept to capture an object.This information can be used to guide the refinementprocess or generalizing a concept, since we know ex-actly up to which point the properties of the object aretaken into account by the concept expression in the in-put. The approach can also be useful for optimizingconcept learning techniques, given that the scenariois restricted to our specifications. A constructive wayfor obtaining the refined concept can be given, whichis dependent of the DL-family chosen for the ontology.Finally, we will study the method for generating richfeatures for action prediction in the avionics mainte-nance domain.

References

[Baader2003] Franz Baader. The description logichandbook: Theory, implementation and applica-tions. Cambridge university press, 2003.

[Divroodi and Nguyen2015] Ali Rezaei Divroodi andLinh Anh Nguyen. On bisimulations for descriptionlogics. Information Sciences, 295:465–493, 2015.

[Inokuchi et al.2000] Akihiro Inokuchi, TakashiWashio, and Hiroshi Motoda. An apriori-basedalgorithm for mining frequent substructures fromgraph data. In Proceedings of the 4th EuropeanConference on Principles of Data Mining andKnowledge Discovery, PKDD ’00, pages 13–23,London, UK, UK, 2000. Springer-Verlag.

[Lehmann2009] Jens Lehmann. Dl-learner: learningconcepts in description logics. Journal of MachineLearning Research, 10(Nov):2639–2642, 2009.

[Motoda2007] Hiroshi Motoda. Pattern Discoveryfrom Graph-Structured Data - A Data Mining Per-spective, pages 12–22. Springer Berlin Heidelberg,Berlin, Heidelberg, 2007.

[Nguyen and Sza las2013] Linh Anh Nguyen and An-drzej Sza las. Logic-based roughification. Rough Setsand Intelligent Systems-Professor Zdzis law Pawlakin Memoriam, pages 517–543, 2013.

[Palacios et al.2016] Luis Palacios, Galle Lortal,Claire Laudy, Christian Sannino, Ludovic Simon,Giuseppe Fusco, Yue Ma, and Chantal Reynaud.Avionics maintenance ontology building for fail-ure diagnosis support. In Proceedings of the 8thInternational Joint Conference on KnowledgeDiscovery, Knowledge Engineering and KnowledgeManagement (IC3K 2016), pages 204–209, 2016.

[Tran et al.2012] Thanh-Luong Tran, Quang-ThuyHa, Linh Anh Nguyen, Hung Son Nguyen, AndrzejSzalas, et al. Concept learning for description logic-based information systems. In Knowledge and Sys-tems Engineering (KSE), 2012 Fourth InternationalConference on, pages 65–73. IEEE, 2012.

[Yan and Han2003] Xifeng Yan and Jiawei Han.Closegraph: Mining closed frequent graph patterns.In Proceedings of the Ninth ACM SIGKDD Inter-national Conference on Knowledge Discovery andData Mining, KDD ’03, pages 286–295, New York,NY, USA, 2003. ACM.

[Yoshida et al.1994] Kenichi Yoshida, Hiroshi Mo-toda, and Nitin Indurkhya. Graph-based inductionas a unified learning framework. Applied Intelli-gence, 4(3):297–316, 1994.

Convolutional Neural Networks for Sentiment

Classification on Business Reviews

Andreea SalincaFaculty of Mathematics and Computer Science, University of Bucharest

Bucharest, [email protected]

Abstract

Recently Convolutional Neural Networks(CNNs) models have proven remarkable re-sults for text classification and sentiment anal-ysis. In this paper, we present our ap-proach on the task of classifying business re-views using word embeddings on a large-scaledataset provided by Yelp: Yelp 2017 challengedataset. We compare word-based CNN us-ing several pre-trained word embeddings andend-to-end vector representations for text re-views classification. We conduct several ex-periments to capture the semantic relation-ship between business reviews and we use deeplearning techniques that prove that the ob-tained results are competitive with traditionalmethods.

1 Introduction

In recent years, researchers have been investigated theproblem of automatic text categorization and senti-ment classification - the overall opinion towards thesubject matter whether the user review is positive ornegative. Sentiment classification is useful in the areaof recommender systems and business intelligence ap-plications.

The effectiveness of applying machine learning tech-niques in sentiment classification of product or moviereviews is achieved using traditional approaches suchas representing text reviews using bag-of-words modeland different methods such as Naive Bayes, maxi-mum entropy classification and SVM (Support vector

Copyright 2017 c© by the paper’s authors. Copying permittedfor private and academic purposes.

In Proceedings of IJCAI Workshop on Semantic Machine Learn-ing (SML 2017), 19-25 August, Melbourne, Australia

machines) [PL+08, PLV02, MDP+11]. ConvolutionalNeural Networks (CNNs) have achieved remarkable re-sults in the area of sentiment analysis and text classifi-cation on large-scale databases [Kim14, ZW15, JZ14].

In this article, we conduct an empirical study ofa word-based CNNs for sentiment classification us-ing Yelp 2017 challenge dataset [yel17] that comprises4.1M user reviews about local business with star ratingfrom 1 to 5. We choose two models for comparison, inwhich both are word-based CNNs with one or multi-ple layer of convolution built on top of word vectors bychoosing pre-trained or end-to-end learned word rep-resentations with different embedding sizes. Previousworks report several techniques on sentiment classifi-cation results of text reviews using Yelp 2015 challengedataset [ZZL15, TQL15, Sal15].

A series of experiments are made to explore the ef-fect of architecture components on model performancealong with the hyperparameters tuning, including fil-ter region size, number of feature maps, and regular-ization parameters of the proposed convolutional neu-ral networks. We discuss the design decisions for sen-timent classification on Yelp 2017 dataset and we offera comparison between these models and report the ob-tained accuracy.

In our work, we aim to identify empirical hyperpa-rameter tuning and practical settings and we inspirefrom other research conducted by [Kim14] on a CNNssimple architecture. Furthermore, we also take intoconsideration some advices from the empirical analy-sis of CNNs architectures and hyperparameter settingsfor sentence classification described by [ZW15]. Weobtain an accuracy of 95.6%, via 3-fold cross valida-tion, on Yelp 2017 challenge dataset using word-basedCNN along with sentiment-specific word embeddings.

2 Prior Work

Kim et al. present a series of experiments using asimple one layer convolutional neural network built on

top of pre-trained word2vec models obtained from anunsupervised neural language model with little param-eter tuning for sentiment analysis and sentence classi-fication [Kim14]. Zhang et al. offer practical advice byperforming an extensive study on the effect of archi-tecture components of CNNs for sentence classificationon model performance with results that outperformbaseline methods such as SVM or logistic regression[ZW15].

In [JZ14] it is proven the benefit of word orderon topic classification and sentiment classification us-ing CNNs and bag-of-words model in the convolutionlayer.

Other approaches use character-level convolutionalnetworks rather than word-based approaches thatachieve state of art results for text classification andsentiment analysis on large-scale reviews datasets suchas Amazon and Yelp 2015 challenge dataset. For theYelp polarity dataset, by considering stars 1 and 2 neg-ative, 3 and 4 positive and dropping 5 star reviews, theauthors use 560 000 train samples, 38 000 test and 5000 epochs in training [ZZL15].

A comparison between several models using tra-ditional techniques with several feature extractors:Bag-of-words and TFIDF (term-frequency inverse-document-frequency), Bag-of-ngrams and TFIDF,Bag-of-means on word embedding (word2vec) andTFIDF and a linear classifier - multinomial logisticregression and deep learning techniques: Word-basedConvNets (Convolutional Neural Networks) (one large1024 and one small - 256 features sizes having 9 layersdeep with 6 convolutional layers and 3 fully-connectedlayers) and long-short term memory (LSTM) recur-rent neural network model is made. The testing errorsare reported on all models for Yelp sentiment analy-sis: 4.36% is obtained for n-gram traditional approach,word-based CNNs with pre-trained word2vec obtain4.60% for the large-featured architecture and 5.56%for the small-featured architecture. Also, word-basedCNNs lookup tables achieve a score of 4.89% for thelarge-featured architecture and 5.54% for the small-featured architecture. The character-level ConvNetsmodel reports an error of 5.89% for the large-featuredarchitecture and 6.53% for the small-featured architec-ture [ZZL15].

In [TQL15] is proposed a convolutional-gated recur-rent neural network approach, which encodes relationsbetween sentences and obtains a 67.1% accuracy onYelp 2015 dataset (split in training, development andtesting sets of 80/10/10) which is compared to a base-line implementation of a convolutional neural networkbased on Kim work [Kim14] with an accuracy of 61.5%for sentiment analysis. On the same dataset, an accu-racy of 62.4% is achieved using a traditional approachwith SVM and bigrams.

In prior work, the authors use traditional ap-proaches in the sentiment analysis classification onYelp 2015 challenge dataset (split in 80% for trainingand 20% for testing and 3-fold cross validation). Lin-ear Support Vector Classification and Stochastic Gra-dient Descent Classifier report an accuracy of 94.4%using unigrams and applying preprocessing techniquesto extract a set of feature characteristics [Sal15].

3 Convolutional Neural NetworkModel

We model Yelp text reviews using two convolutionalarchitecture approaches. The first model is word-based CNN having an embedding layer in which wetokenize text review sentences to a sentence matrixhaving rows with word vector representations of eachtoken similar to the approach of Kim et al. [Kim14].We will truncate the reviews to a maximum lengthof 1000 words and we will only consider the top 100000 most commonly occurring words in the businessreviews dataset.

We use both pre-trained word embeddings such asGloVe [KFF15] using 100 dimensional embeddingsof 400k words computed on a 2014 dump of En-glish Wikipedia, word2vec [MCCD13] using 300 di-mensional embeddings and fastText [BGJM16] using300 dimensional embeddings and a vocabulary trainedfrom the reviews dataset using word2vec having100-dimension word embeddings. Out-of-vocabularywords are randomly initialized by sampling values uni-formly from (0.25, 0.25) and optimized during train-ing.

Next, a convolutional layer with one region sizedfilters is applied. Filter widths are equal to the di-mension of the word vectors [ZW15]. Then we applya max-pooling operation on the feature map to com-pute a fixed-length feature vector and finally a softmaxclassifier to predict the outputs. During training, weuse dropout regularization technique with deep net-works where network units are randomly dropped dur-ing training [GG16]. Also, we aim to minimize the cat-egorical cross-entropy loss. We use a 300 feature maps,1D convolution window of lengths 2, rectified linearunit (ReLU) activation function and 1-max-pooling ofsize 2, 0.2 dropout (p) probability.

The second model approach differs from the firstapproach by using multiple filters for the same regionsize to learn complementary features from the sameregions. We propose 3 filter regions size, having 128features per filter region, 1D convolution window oflength 5, a dropout (d) of 0.5 probability and 1-max-pooling of 35. We compare two different optimizers:Nesterov Adam and RMSprop optimizer[SMDH13].

4 Results And Discussion

4.1 Yelp Challenge Dataset

Yelp 2017 challenge dataset, introduced in the 9thround of Yelp Challenge, comprises user reviews aboutlocal businesses in 11 cities across 4 countries withstar rating from 1 to 5. The large-scale dataset com-prises 4.1M reviews and 947K tips by 1M users for144K businesses [yel17]. Yelp 2017 challenge datasethas been updated compared to datasets in previousrounds, such as Yelp 2015 challenge dataset or Yelp2013 challenge dataset.

We conduct our system evaluation on U.S. cities:Pittsburgh, Charlotte, Urbana-Champaign, Phoenix,Las Vegas, Madison, and Cleveland, having 1 942 339reviews. For the sentiment analysis classification task,we consider the 1 and 2 star ratings as negative senti-ments and 4 and 5 as positive sentiments and we dropthe 3 star ratings reviews as the average Yelp reviewis 3.7.

Next, we will use two subsets of Yelp 2017 dataset toconduct our experiments, due to computational powerconstraints.

Our first experiments are done on a smaller sub-set of Yelp dataset having 8200 training samples, 2000validation samples and 900 testing samples. We willcall this Small Yelp dataset.

Further, we experiment on 82 000 training samples,20 000 validation samples and 9 000 testing samples.We will call this Big Yelp dataset.

In the last experiment, we split the large-scale YelpUS dataset into 80% for training and 20% for test-ing. We use 3-fold cross validation for evaluating dif-ferent hyperparameters for the deep neural methods.We use accuracy as evaluation metric, which is a stan-dard metric to measure the overall sentiment reviewsclassification performance [MS+99].

4.2 Word Embeddings

We use several pre-trained models of word embed-dings built with an unsupervised learning algorithmfor obtaining vector representations of words: GloVe[KFF15], word2vec along with pre-trained vectorstrained on part of Google News dataset (about 100billion words). The models contain 100-dimensionalvectors for 3 million words and phrases [MCCD13].

We use also use fastText pre-trained word vec-tors for English language which are an extensionof word2vec. These vectors in dimension 300 weretrained on Wikipedia using the skip-gram model de-scribed in [BGJM16] with default parameters.

Moreover, we use in the embedding layer of bothproposed CNNs a 100-dimensional word2vec embed-ding vectors that we have trained using the text re-

views in the training dataset.

4.3 Experimental Results

We conduct an empirical exploration on the use of theproposed word-based CNNs architecture for sentimentclassification on Yelp business reviews.

In the training phase, we use a batch size of 500 and3 epochs for the first model approach, and a batch sizeof 128 and 2 epochs for the second model approach.

We obtain the same accuracy of the classifica-tion task of 77.88% when using 100-dimension and300-dimension GloVe word embeddings with the firstCNNs proposed having 300 features maps and a convo-lution of window of length 5 on the Small Yelp dataset.

We study the effect of filter kernel size of the con-volution when using only one region size on the modelaccuracy shown in Fig. 1. We set number of featuremaps for this region size to 300 and consider regionsizes of 2, 3 and 5 and compute the means of 3-fold CVfor each. We observe that using a smaller region sizethe CNNs performs better, obtaining an accuracy of79,5% (window of 2 words) rather than using a largerregion size (window size of 5) and obtaining 22,1%.

Figure 1: CNN accuracy for different kernel sizes whenfeature map is 300

The word embeddings used in the embedding layerof our CNNs have successfully captured the semanticrelations among entities in the unstructured text re-view. For the Big Yelp dataset using the first CNNmodel approach with 300 features map, with a regionsize of 2, a dropout probability of 0.2 and NesterovAdam optimizer we obtain a score of 89.59% in thesentiment classification.

Furthermore, we conduct our study on the secondmodel approach of the word-based CNN having 3 filterregions size, 128 features per filter region, 1D convolu-tion window of length 5, a dropout (d) of 0.5 probabil-ity and 1-max-pooling of size 35 along with NesterovAdam optimizer.

In Table 1 we report results achieved using thesecond model approach along with pre-trained GloVewith 100 dimension, word2vec, fastText word embed-dings and vocabulary trained from the reviews datasetusing word2vec of word embeddings with size of 100.For both pre-trained word2vec and fastText embed-dings we choose 300-dimensional word vectors.

We find that the choice of vector input representa-tion has an impact of the performance of the sentimentmeaning. On the Small Yelp dataset we report a sig-nificand difference of 11.52% between the highest scoreusing pre-trained GloVe embeddings and self-built dic-tionary using word2vec model.

However, on the Big Yelp dataset we report a dif-ference of 0.81% between the highest score using pre-trained fastText embeddings and pre-trained word2vecvectors. The relative performance achieved using thesecond CNN model approach has similar accuracyscores on the Big Yelp dataset, regardless of the in-put embeddings (Table 1). We can observe that thescale of the dataset has an impact on the overall per-formance in the sentiment classification task.

Table 1: Accuracy results on Yelp reviews dataset.

Dataset Model CNN Embed.dimension train testSmall Yelp reviews Pre-trained GloVe 100 89.65% 87.36%Small Yelp reviews Pre-trained word2vec 300 91.25% 90.41%Small Yelp reviews Pre-trained fastText 300 89.90% 88.77%Small Yelp reviews Word2Vec self-dictionary 100 79.45% 78.89%

Big Yelp reviews Pre-trained GloVe 100 94.46% 94.54%Big Yelp reviews Pre-trained word2vec 300 93.80% 93.92%Big Yelp reviews Pre-trained fastText 300 94.49% 94.73%Big Yelp reviews Word2Vec self-dictionary 100 94.45% 94.60%

Training is done through stochastic gradient descentover shuffled mini-batches with Nesterov Adam or RM-Sprop update rule. Nesterov Adam obtains better re-sults than RMSprop [SMDH13] when using the secondmodel approach with the same number of epochs and adropout of 0.2. The sentiment accuracy computed onthe Big Yelp dataset using RMSprop method scored0.16 less than the accuracy obtained using NesterovAdam which scored 95.15

The CNN model in the second approach performedbetter in the text review classification than the first ap-proach due to the differences in the architecture modeland the depth of the convolutional network, the filterregion size has a large effect on the classifier perfor-mance, for a dropout of 0.5 we obtain 94.54% com-pared to 95.15% for a 0.2 dropout.

When we impose a stronger regularization on themodel the performance increases: for a dropout of0.5 we obtain 94.54% compared to 95.15% for a 0.2dropout. A similar remark about dropout regulariza-tion is reported in [ZW15]

Prior work offers a baseline CNN configuration im-plementing the architectural decisions and hyperpa-rameters of [Kim14] on Yelp 2015 Challenge dataset for

sentiment classification of text review [TQL15]. Theauthors report an accuracy of 61.5%, and propose anew method that represents document with convolu-tional recurrent neural network, which adaptively en-codes semantics of sentences and their relations andachieve 67.6%. Also, using traditional methods suchas SVM and bigrams report a score of 62.4%.

In [ZZL15] the authors propose character-levelCNNs that achieve an accuracy of 94.11% for thelarge-featured architecture and 93.47% for the small-featured architecture and compare the obtained re-sults to baseline word-based CNNs with pre-trainedword2vec that obtain 95.40% accuracy for a large-featured architecture and 94.44% for the small-featured architecture. In their experiments the au-thors drop 5 star reviews, and use 560 000 train sam-ples, 38 000 test samples from Yelp 2015 challengedataset and 5 000 epochs in training. Traditionalmethods as n-grams linear classifier report a score of95.64% on the subset.

In comparison against traditional models such asbag of words, n-grams and TFIDF variants, the deeplearning models - word-based CNNs and the hyperpa-rameters proposed in this paper obtain comparable tothe baseline methods [ZZL15, TQL15, Sal15]. On theBig Yelp dataset, we report an accuracy of 94.73% us-ing pre-trained fastText vector embeddings and a CNNhaving 3 filter regions sizes and 128 feature maps.

Further, we conduct our evaluation on the completeYelp 2017 challenge dataset. The second CNN modelapproach proposed in this work yields the best per-formance on Yelp 2017 challenge dataset in terms ofaccuracy. We obtain an accuracy of 95.6% using 3-foldcross validation.

5 Conclusions And Future Work

In the present work, we have described a series of ex-periments with word-based convolutional neural net-works. We introduce two neural network models ap-proaches with different architectural size and severalword vector representations. We conduct an empiri-cal study on effect of hyperparameters on the overallperformance in the sentiment classification task.

In the experimental results, we find that the sizeof the dataset has an important effect on the systemperformance in training and evaluation, a better ac-curacy score is obtained using the second CNN modelapproach on the Big Yelp dataset compared to the re-sults obtained on Small Yelp dataset. Furthermore,when evaluating the second model approach on thelarge scale 2017 Yelp Dataset, we achieve an accuracyscore of 95.6% using 3-fold cross validation.

The models proposed in this article show good abil-ity for understanding natural language and predicting

users sentiments. We see that our results are compara-ble and sometimes overcome the ones in the literaturefor the task of classifying business reviews using Yelp2017 challenge dataset [ZZL15, TQL15, Sal15].

In future work, we can explore Bayesian optimiza-tion frameworks for hyperparameters ranges ratherthan a grid search approach. Also, we can conductother experiments using Recursive Neural Network(RNN) with the Long Short Term Memory (LSTM) ar-chitecture [Gra12] for sentiment categorization of Yelpuser text reviews.

References

[BGJM16] Piotr Bojanowski, Edouard Grave, Ar-mand Joulin, and Tomas Mikolov. En-riching word vectors with subword infor-mation. arXiv preprint arXiv:1607.04606,2016.

[GG16] Yarin Gal and Zoubin Ghahramani. A the-oretically grounded application of dropoutin recurrent neural networks. In Advancesin Neural Information Processing Systems,pages 1019–1027, 2016.

[Gra12] Alex Graves. Supervised sequence la-belling. In Supervised Sequence Labellingwith Recurrent Neural Networks, pages 5–13. Springer, 2012.

[JZ14] Rie Johnson and Tong Zhang. Effectiveuse of word order for text categorizationwith convolutional neural networks. arXivpreprint arXiv:1412.1058, 2014.

[KFF15] Andrej Karpathy and Li Fei-Fei. Deepvisual-semantic alignments for generatingimage descriptions. In Proceedings ofthe IEEE Conference on Computer Visionand Pattern Recognition, pages 3128–3137,2015.

[Kim14] Yoon Kim. Convolutional neural networksfor sentence classification. arXiv preprintarXiv:1408.5882, 2014.

[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado,and Jeffrey Dean. Efficient estimationof word representations in vector space.arXiv preprint arXiv:1301.3781, 2013.

[MDP+11] Andrew L Maas, Raymond E Daly, Pe-ter T Pham, Dan Huang, Andrew Y Ng,and Christopher Potts. Learning word vec-tors for sentiment analysis. In Proceed-ings of the 49th Annual Meeting of the As-sociation for Computational Linguistics:

Human Language Technologies-Volume 1,pages 142–150. Association for Computa-tional Linguistics, 2011.

[MS+99] Christopher D Manning, Hinrich Schutze,et al. Foundations of statistical naturallanguage processing, volume 999. MITPress, 1999.

[PL+08] Bo Pang, Lillian Lee, et al. Opinion miningand sentiment analysis. Foundations andTrends in Information Retrieval, 2(1–2):1–135, 2008.

[PLV02] Bo Pang, Lillian Lee, and ShivakumarVaithyanathan. Thumbs up?: sentimentclassification using machine learning tech-niques. In Proceedings of the ACL-02conference on Empirical methods in nat-ural language processing-Volume 10, pages79–86. Association for Computational Lin-guistics, 2002.

[Sal15] Andreea Salinca. Business reviews classi-fication using sentiment analysis. In Sym-bolic and Numeric Algorithms for Scien-tific Computing (SYNASC), 2015 17th In-ternational Symposium on, pages 247–250.IEEE, 2015.

[SMDH13] Ilya Sutskever, James Martens, GeorgeDahl, and Geoffrey Hinton. On the im-portance of initialization and momentumin deep learning. In International con-ference on machine learning, pages 1139–1147, 2013.

[TQL15] Duyu Tang, Bing Qin, and Ting Liu. Doc-ument modeling with gated recurrent neu-ral network for sentiment classification. InEMNLP, pages 1422–1432, 2015.

[yel17] Yelp Challenge Dataset, 2017.

[ZW15] Ye Zhang and Byron Wallace. A sen-sitivity analysis of (and practitioners’guide to) convolutional neural networksfor sentence classification. arXiv preprintarXiv:1510.03820, 2015.

[ZZL15] Xiang Zhang, Junbo Zhao, and Yann Le-Cun. Character-level convolutional net-works for text classification. In Advancesin neural information processing systems,pages 649–657, 2015.

Semantic extraction of Named Entities

from Bank Wire text

Ritesh RattiPitney Bowes Software

Noida [email protected]

Himanshu KapoorPitney Bowes Software

Noida [email protected]

Shikhar SharmaPitney Bowes Software

Noida [email protected]

Anshul SolankiPitney Bowes Software

Noida [email protected]

Pankaj SachdevaPitney Bowes Software

Noida [email protected]

Abstract

Online transactions have increased dramati-cally over the years due to rapid growth in dig-ital innovation. These transactions are anony-mous therefore user provide some details foridentification. These comments contain infor-mation about entities involved and transferdetails which are used for log analysis later.Log analysis can be used for fraud analyticsand detect money laundering activities. Inthis paper, we discuss the challenges of en-tity extraction from such kind of data. Webriefly explain what wired text is, what arethe challenges and why semantic informationis required for entity extraction. We explorewhy traditional IE approaches are in-sufficientto solve the problem. We tested the approachwith available open source tools for Entity ex-traction and describe how our approach is ableto solve the problem of entity identification.

1 Introduction

Named Entity Extraction is the process of extract-ing entities like Person, Location, Address, Organi-zation etc. from natural language text. However,named entities might also exist in non-natural textlike Log data, Bank transfer content, Transactional

Copyright c© by the paper’s authors. Copying permitted forprivate and academic purposes.

In Proceedings of IJCAI Workshop on Semantic Machine Learn-ing (SML 2017), Aug 19-25 2017 , Melbourne, Australia

data etc. Hence we require a system which should berobust enough to deal with the issues such as degradedand un-structured text rather than natural languagetext with correct spelling, punctuations and grammar.Existing information extraction methods are not ableto deal with these requirements as most of the infor-mation extraction tasks work over natural languagetext. Since the context of language is missing in un-structured text, it is difficult to extract the entitiesfrom it and features are based on the natural languagehence it requires semantic processing capabilities tounderstand the hidden meaning of content using dic-tionaries, ontologies etc.

Wire text is an example of such kind of text whichis un-formatted and non-grammatic in nature. It cancontain some letters in capital and some in small. Forexample people generally write the comments in shortform and use multiple abbreviations. Bank wire textcan be of this following format:

EVERITT 620122T NAT ABC INDIA LTDREF ROBERT REASON SHOP RENTALREF 112233999 - REASON SPEEDING FINEGEM SS HEUTIGEM SCHIENDLERPENSION CH1234 CAB28

There are two major challenges in creating themachine learning model for wire text :

• Non-availability of data set due to confidentiality

• Non-contextual representation of text

To identify the entities from such kind of text, itis therefore required special pre-processing of the textusing semantic information of content. In this paper,we discuss the solution to extract entities from suchkind of text. We evaluate our approach for Bank wiretransfer text and make use of wordnet taxonomy foridentifying the semantics for each of keyword. Thispaper is arranged in following sections. In Section 2we discuss available methods of entity extraction. InSection 3 we describe the algorithm in detail and com-ponents involved. Section 4 we show the experimenta-tion results and comparison with open source utilities.Section 5 is for conclusion & future work.

2 Background

Supervised machine learning techniques are primarysolutions to solve the named entity recognition prob-lem which requires data to be annotated. Supervisedmethods either learn disambiguation rules based ondiscriminative features or try to learn the parameterof assumed distribution that maximizes the likelihoodof training data. Conditional Random fields [SM12]is the discriminative approach to solve the problemswhich uses sequence tagging. Other supervised learn-ing models like Hidden Markov Model (HMM) [RJ86],Decision Trees, Maximum Entropy Models (ME), Sup-port Vector Machines (SVM) also used to solve theclassification problem. HMM is the earliest model ap-plied for solving NER problem by Bikel [BSW99] forEnglish. Bikel introduced a system, IdentiFinder, todetect NER using HMM as a generative model. Cur-ran and Clark [CC03] applied the maximum entropymodel to the named entity recognition problem. Theyused the softmax approach to formulate. McNameeand Mayfield [MMP03] tackle the problem as a binarydecision problem, i.e. if the word belongs to one of the8 classes, i.e. B- Beginning, I- Inside tag for person,organization, location and misc tags, Thus there are 8classifiers trained for this purpose. Because of unavail-ability of wire text, it is difficult to create the taggedcontent hence supervised approaches are not able tosolve the problem.

Various unsupervised schemes are also proposed tosolve the entity recognition problem. People suggestthe gazetteer based approach which help in identify-ing the keywords from the list. KNOWITALL is such asystem which is domain independent and proposed byEtzioni [ECD+05] that extracts information from theweb in an unsupervised, open-ended manner. It uses8 domain independent extraction patterns to gener-ate candidate facts. Manning [GM14] have proposed asystem that generates seed candidates through local,cross-language edit likelihood and then bootstraps tomake broad predictions across two languages, optimiz-

Figure 1: Component Diagram

ing combined contextual, word-shape and alignmentmodels.

Semantic Approaches also exists for named entityextraction. [MNPT02] used the wordnet specificationto identify the W ordClass and W ordInstances list foreach of the word to identify based on predefined rules.But that list is limited. [Sie15] uses word2Vec rep-resentation of words to define the semantics betweenwords, that enhances the classification accuracy. Ituses a continuous skipgram model which requires hugecomputation for learning word vectors. [ECD+05]specifiy the gazetteer based feature as external knowl-edge for good performance. Given these findings, sev-eral approaches have been proposed to automaticallyextract comprehensive gazetteers from the web andfrom large collections of unlabeled text [ECD+04] withlimited impact on NER. Kazama [KT07] have suc-cessfully constructed high quality and high coveragegazetteers from Wikipedia.

In this paper, we propose the semantic disambigua-tion of named entities using wordnet and gazetteer.Our approach is based on pre-processing the text be-fore passing it to Named entity recognizer.

3 Algorithm

3.1 Method

Named Entity Recognition involve multiple featuresrelated to the structural representation of entitieshence proper case information imparts a valuable rolein defining the entity type. For example : Person isgenerally written in Camel Case in english language& Organization are in Capitalized format. Our ap-proach is based on orthogonal properties of entities. Itis based on conversion of input data using wordnet af-ter looking into the semantics for each of the word andproviding existing NER the converted output. Nowconverted text is more probable to extract the Namedentities once provided. We hereby propose the in-termediate layer so called Pre-Processor as shown inFigure 1. Pre-Processor contains three major compo-nents called WordnetMatcher, GazetteerMatcher and

CaseConverter, whose purpose is to match the text ef-ficiently with the given content list and converting thetext to required case. LowerCaseConverter, Camel-CaseConverter and UpperCaseConverter are instancesof CaseConverter.Tokenizer’s main job is to convert the sentence intotokens. Named Entity Recognizer is used to extractthe named entities.

We used Wordnet [Mil95] which provides theinformation about synsets. English version contains129505 words organized into 99642 synsets . In word-net two kinds of relations are distinguished: semanticrelations (IS-A , part of etc. ) which hold amongsynsets and lexical relations (synonymy , antonymy) which hold among words. Our gazetteer containsthe dictionary for Person names, Organization names,Locations etc. Our approach work according to thefollowing algorithm.

3.2 Approach

Algorithm 1: Semantic NER

Input : Sentence S as collection of words Wand gazateers ListNames ,ListOrganization , ListLocation ,ListIgnore

Output: Set of entities ei ∈ Efor each wi ∈ S dowi ← LowerCaseConverter(wi)if wi /∈ ListIgnore thensynsets[]←WordNetMatcher(wi)if synsets[] /∈ Empty thenif wi ∈ ListNames thenwi ← CamelCaseConverter(wi)

end ifelseif wi ∈ ListOrganizationorwi ∈ ListLocation

thenwi ← UpperCaseConverter(wi)

elsewi ← CamelCaseConverter(wi)

end ifend if

end ifend for(ei)← NamedEntityRecognizer(S)

Our algorithm works by looking up the pre-definedlist in multiple steps. For each word in your input,first it converts to all lower-case, then check the wordagainst the ignore list containing pronouns, preposi-tions, conjunctions and determiners. If it exists thenwe ignore the keywords. Else pass the lower-case-word

to the WordNet API to get list of SynSets. If synsetsare non-empty, such a word is likely to have somemeaning so it will be checked with Names list firstif found convert it to Camel Case like: John Miller, Robert Brown. If not found in namesList, latercheck in organization list and Location list. If matchfound convert to Upper Case otherwise convert inCamel Case. Now this pre-processed text is havingmeaningful representation of entities which is furtherpassed to Named Entity Recognizer to extract theentities from the converted text.

3.3 Model Description

Our Named Entity Recognizer is based on Condi-tional Random Field [SM12], which is a discriminativemodel. We used cleartk library [BOB14] for modelgeneration which uses mallet internally for implemen-tation. Conditional random fields (CRFs) are a proba-bilistic framework for labeling and segmenting sequen-tial data, based on the conditional approach.

Laferty [LMP+01] define the the probability of aparticular label sequence y given observation sequencex to be a normalized product of potential functions,each of the form .

exp (∑

j λjtj(yi−1, yi, x, i)+∑

k λksk(yi, x, i) )

where tj(yi−1, yi, x, i) is a transition feature func-tion of the entire observation sequence and the labelsat positions i and i−1 in the label sequence; sk(yi, x, i)is a state feature function of the label at position i andthe observation sequence; and λj and µk are parame-ters to be estimated from training data.

When defining feature functions, we construct a setof real-valued features b(x, i) of the observation to ex-presses some characteristic of the empirical distribu-tion of the training data that should also hold of themodel distribution. An example of such a feature is :b(x, i) is 1 if observatuin at i is ”Person” else 0

Each feature function takes on the value of one ofthese real-valued observation features b(x, i) if the cur-rent state (in the case of a state function) or previousand current states (in the case of a transition func-tion) take on particular values. All feature functionsare therefore real-valued. For example, consider thefollowing transition function:

tj(yi−1, yi, x, i) = b(x,i)

and ,

Fj(y, x) =∑n

i=1 fj(yi−1, yi, x, i)

Table 1: Features used for NEREntity Type FeaturePerson preceding = 1 succeeding = 2 ,

posTag , characterPattern ,middleNamesList

Location preceding = 3 succeeding = 3 ,characterPattern , isCapital

Organization preceding = 3 succeeding = 3 ,posTag , characterPattern ,orgSuffixList

where each fj(yi−1, yi, x, i) is either a state func-tion sk(yi, x, i) or a transition function t(yi−1, yi, x, i). This allows the probability of a label sequence ygiven an observation sequence x to be written as

p(y|x, λ) = 1Z(x) exp (

∑j λjFj(y, x) )

where Z(x) is a normalization factor.

3.4 Feature Extraction

We used multiple syntactic and linguistic features spe-cific to entities. We also used pre-defined list matchas a feature in couple of entities which improves theaccuracy of our model. Our feature selection is basedon following table 1. Explanation for the features isas follows :

• Preceding : Number of words to be considered forfeature generation before the current word.

• Succeeding : Number of words to be considered forfeature generation after the current word.

• posTag : Part of Speech tag as linguistic feature.

• characterPattern : Character pattern as feature intoken like Camel Case, Numeric, AlphaNumeircetc.

• isCapital : True if all the letters are in capitalizedformat.

• xxxList : Specific keyword list to match withthe current word.True if word matches.For ex :orgSuffix contains list of suffixes used in organi-zation names and middleNames consists the key-words used in middle name.

4 Experimentation Results

4.1 Dataset

We trained our NER model over MASC (Manually An-notated Sub-Corpus) dataset [PBFI12] which contains

Table 2: Comparison Results

Entity Type Approach Precision Recall Acc.Person Our Approach 0.65 0.306 0.27

Stanford-NER 0.23 0.175 0.12Location Our Approach 0.88 0.57 0.53

Stanford-NER 0.71 0.58 0.51Organization Our Approach 0.18 0.32 0.28

Stanford-NER 0.03 0.018 0.012

93232 documents with 3232 different entities. We usedthe bank wire transfer text to verify the approach. Dueto non-availability of bank wire text because of secu-rity reasons, We have to generate test set based on ourclient experience and understanding multiple user sce-narios. We implemented the approach to our product[Pit] which is used by our clients.

4.2 Comparison

Our test dataset contains different types of commentswhich are non-natural in nature. We compare theapproach with existing open source solutions likeOpen NLP [Apa14] and Stanford NER [MSB+14]and we justify that our approach works better dueto the semantic conversion of the text. We observedthat Open nlp is not able to detect much entitieshowever Stanford NER is able to detect some of them.Table 2 describes the results of precision, recall andaccuracy for entities Person, Location & Organization.

5 Conclusion & Future Work

We hereby proposed the approach for semantic con-version of bank wire text and extract the entities fromconverted text. Currently, we tested our approach forperson, organization and location but it is easily ex-tensible for other entities like address, contact num-ber, email information etc. The approach uses seman-tic information from wordnet for preprocessing whichcan further be used to extract the entities from similartypes of dataset like weblogs, DBlogs, transaction logsetc.

References

[Apa14] Apache Software Foundation. openNLPNatural Language Processing Library,2014. http://opennlp.apache.org/.

[BOB14] Steven Bethard, Philip Ogren, and LeeBecker. Cleartk 2.0: Design patterns formachine learning in uima. In Proceed-ings of the Ninth International Confer-ence on Language Resources and Evalua-

tion (LREC’14), pages 3289–3293, Reyk-javik, Iceland, 5 2014. European LanguageResources Association (ELRA). (Accep-tance rate 61%).

[BSW99] Daniel M Bikel, Richard Schwartz, andRalph M Weischedel. An algorithm thatlearns what’s in a name. Machine learn-ing, 34(1-3):211–231, 1999.

[CC03] James R. Curran and Stephen Clark. Lan-guage independent ner using a maximumentropy tagger. In Proceedings of theSeventh Conference on Natural LanguageLearning at HLT-NAACL 2003 - Volume4, CONLL ’03, pages 164–167, Strouds-burg, PA, USA, 2003. Association forComputational Linguistics.

[ECD+04] Oren Etzioni, Michael Cafarella, DougDowney, Ana-Maria Popescu, Tal Shaked,Stephen Soderland, Daniel S Weld, andAlexander Yates. Methods for domain-independent information extraction fromthe web: An experimental comparison. InAAAI, pages 391–398, 2004.

[ECD+05] Oren Etzioni, Michael Cafarella, DougDowney, Ana-Maria Popescu, Tal Shaked,Stephen Soderland, Daniel S. Weld, andAlexander Yates. Unsupervised named-entity extraction from the web: An ex-perimental study. Artificial Intelligence,165(1):91 – 134, 2005.

[GM14] Sonal Gupta and Christopher D Man-ning. Improved pattern learning for boot-strapped entity extraction. In CoNLL,pages 98–108, 2014.

[KT07] Junichi Kazama and Kentaro Torisawa.Exploiting wikipedia as external knowl-edge for named entity recognition. 2007.

[LMP+01] John Lafferty, Andrew McCallum, Fer-nando Pereira, et al. Conditional randomfields: Probabilistic models for segmentingand labeling sequence data. In Proceed-ings of the eighteenth international con-ference on machine learning, ICML, vol-ume 1, pages 282–289, 2001.

[Mil95] George A. Miller. Wordnet: A lexicaldatabase for english. Commun. ACM,38(11):39–41, November 1995.

[MMP03] James Mayfield, Paul McNamee, andChristine Piatko. Named entity recog-nition using hundreds of thousands of

features. In Proceedings of the SeventhConference on Natural Language Learn-ing at HLT-NAACL 2003 - Volume 4,CONLL ’03, pages 184–187, Stroudsburg,PA, USA, 2003. Association for Computa-tional Linguistics.

[MNPT02] Bernardo Magnini, Matteo Negri, RobertoPrevete, and Hristo Tanev. A wordnet-based approach to named entities recogni-tion. In Proceedings of the 2002 workshopon Building and using semantic networks-Volume 11, pages 1–7. Association forComputational Linguistics, 2002.

[MSB+14] Christopher D. Manning, Mihai Surdeanu,John Bauer, Jenny Finkel, Steven J.Bethard, and David McClosky. The Stan-ford CoreNLP natural language process-ing toolkit. In Association for Computa-tional Linguistics (ACL) System Demon-strations, pages 55–60, 2014.

[PBFI12] Rebecca J Passonneau, Collin Baker,Christiane Fellbaum, and Nancy Ide. Themasc word sense sentence corpus. In Pro-ceedings of LREC, 2012.

[Pit] Pitney Bowes Software CIM Suitehttp://www.pitneybowes.com/us/customer-information-management.html.

[RJ86] L. Rabiner and B. Juang. An introductionto hidden markov models. IEEE ASSPMagazine, 3(2):4–16, Jan 1986.

[Sie15] Scharolta Katharina Siencnik. Adaptingword2vec to named entity recognition. InProceedings of the 20th Nordic Conferenceof Computational Linguistics, NODAL-IDA 2015, May 11-13, 2015, Vilnius,Lithuania, number 109, pages 239–243.Linkoping University Electronic Press,2015.

[SM12] Charles Sutton and Andrew McCallum.An introduction to conditional randomfields. Foundations and Trends in MachineLearning, 4(1):267–373, 2012.

Enhancing Topical Word Semantic for Relevance

Feature Selection

Abdullah Semran Alharbi1,2

[email protected] Li1

[email protected] Xu1

[email protected]

1School of Electrical Engineering and Computer ScienceQueensland University of Technology

Brisbane, Australia

2Department of Computer ScienceUmm Al-Qura University

Makkah, Saudi Arabia

Abstract

Unsupervised topic models, such as LatentDirichlet Allocation (LDA), are widely usedas automated feature engineering tools for tex-tual data. They model words semantics basedon some latent topics on the basis that se-mantically related words occur in similar doc-uments. However, words weights that are as-signed by these topic models do not representthe semantic meaning of these words to userinformation needs. In this paper, we presentan innovative and effective extended randomsets (ERS) model to enhance the semantic oftopical words. The proposed model is used asa word weighting scheme for relevance featureselection (FS). It accurately weights wordsbased on their appearance in the LDA latenttopics and the relevant documents. The ex-perimental results, based on 50 collections ofthe standard RCV1 dataset and TREC topicsfor information filtering, show that the pro-posed model significantly outperforms eight,state-of-the-art, baseline models in five stan-dard performance measures.

Copyright c© by the paper’s authors. Copying permitted forprivate and academic purposes.

In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZWorkshop, Location, Country, DD-MMM-YYYY, published athttp://ceur-ws.org

1 Introduction

LDA [BNJ03] is currently the most common prob-abilistic topic model compared to similar mod-els, such as probabilistic Latent Semantic Analy-sis (pLSA) [Hof01], with a wide range of applica-tions [Ble12]. LDA statistically discovers hidden top-ics from documents as features to be used for differenttasks in information retrieval (IR) [WC06, WMW07],information filtering (IF) [GXL15] and for many othertext mining and machine learning applications. LDArepresents documents by a set of topics, and each topicis a set of semantically related terms1. Thus, it is ca-pable of clustering related words in a document col-lection, which can reduce the impact of common prob-lems like polysemy, synonymy and information over-load [AZ12].

The core and critical part of any text FS methodis the weighting function. It assigns a numerical value(usually a real number) to each feature, which specifieshow informative the feature is to the user’s informationneeds [ALA13]. In the context of probabilistic topicmodelling in general and LDA specifically, calculat-ing a term weight is done locally at its document-levelbased on two components; the term local document-topics distributions and the global term-topics assign-ment. Therefore, in a set of similar documents, a spe-cific term might receive a different weight in each singledocument even though this term is semantically iden-tical across all these documents. Such approach doesnot accurately reflect on the semantic meaning andusefulness of this term to the entire user’s informationneeds. It badly influences the performance of LDA

1In this paper, terms, words, keywords or unigrams are usedinterchangeably.

for FS as it is uncertain and difficult to know whichweight is more representative and should be assignedto the intended term. Would it be the average weight?The highest? The lowest? The aggregated? Severalexperiments in various studies confirm that the local-global weighting approach of the LDA is ineffective forrelevant FS [GXL15].

Given a document set that describes user infor-mation needs, global statistics, such as documentfrequency (df), reveal the discriminatory power ofterms [LTSL09]. However, in IR, selecting terms basedon global weighting schemes did not show better re-trieval performance [MO10], because global statisticscannot describe the local importance of terms [MC13].From the LDA’s perspective, it is challenging andstill uncertain on how to use LDA’s local-global termweighting function in a global context due to the com-plex relationships between terms and many entitiesthat represent the entire collection. A term, for ex-ample, might appear in multiple documents and LDAtopics, and each topic may also cover many documentsor paragraphs that contain the same term. Therefore,the hard question this research tries to answer is: howto generalise the local topic weight (at document level)and combine it with global topical statistics such as theterm frequency in both topics and relevant documentsfor more discriminative and semantically representa-tive global term weighting scheme?

The aim of this research is to develop an effectivetopic-based FS model for relevance discovery. Themodel uses a hierarchical framework based on ERStheory to assign a more representative weight to termsbased on their appearance in LDA topics and all rel-evant documents. Therefore, two major contributionshave been made in this paper to the fields of text FSand IF: (a) A new theoretical model based on multipleERS [Mol06] to represent and interpret the complex re-lationships between long documents, their paragraphs,LDA topics and all terms in the collection, where afunction describes each relationship; (b) A new andeffective term weighting formula that assigns a morediscriminately accurate weight to topical terms thatrepresent their relevance to the user information needs.The formula generalises LDA’s local topic weight to aglobal one using the proposed ERS theory and thencombines it with the frequency ratio of words in bothdocuments and topics to answer the question asked bythe authors. To test the effectiveness of our model,we conducted extensive experiments on RCV1 datasetand the assessors’ relevance judgements of the TRECfiltering track. The results show that our model sig-nificantly outperforms all used baseline FS models forIF despite the type of text features they use (terms,phrases, patterns, topics or even a different combina-tion of them).

2 Related Works

In the literature, there is a significant amount of workthat extends and improves LDA to suit different needsincluding text FS [ZPH08, TG09]. However, our modelis intended for IF, and, to the best of our knowledge, itis the first attempt to extend random sets [Mol06] tofunctionally describe and interpret complex relation-ships that involve topical terms and other entities in adocument collection to enhance the semantic of topi-cal words for relevance FS. Relevance is a fundamentalconcept in both IR and IF. IR mainly concerns aboutdocument’s relevance to a query for a specific subject.However, IF discusses the document’s relevance to userinformation needs [LAZ10]. In relevance discovery,FS is a method that selects a subset of features thatare relevant to user’s needs and thus removing thosethat are irrelevant, redundant and noisy. Existingmethods adopt different type of text features such asterms [LTSL09], phrases (n-grams) [ALA13], patterns(a pattern is a set of associated terms) [LAA+15], top-ics [DDF+90, Hof01, BNJ03] or a combination of themfor better performance [WMW07, LAZ10, GXL15].

The most efficient FS methods for relevance, are theones that are developed based on weighting function,which is the core and critical part of the selection al-gorithm [LAA+15]. Using LDA words weighting func-tion for relevance is still limited and does not showencouraging results [GXL15] including similar topic-based models such as the pLSA [Hof01]. For bet-ter performance, Gao et al (2015) [GXL15] integratepattern mining techniques into topic models to dis-cover discriminative features. Such work is expensiveand susceptible to the features-loss problem and alsomight be impacted by the uncertainty of the prob-abilistic topic model. ERS is proven to be effectivein describing complex relations between different enti-ties and interprets them as a function (weighting func-tion) [Li03]. Thus, the ERS-based models can be usedto weight closed sequential patterns more accuratelyand thus facilitate the discovery of specific ones as ap-pears in [ALX14]. However, selecting the most usefulpatterns is challenging due to a large number of pat-terns generated from relevant documents using variousminimum supports (min sup), and also may lead tofeature-loss.

3 Background Overview

For a given corpus C, the relevant long documents setD⊆C represents user’s information needs that mighthave multiple subjects. The proposed model uses Dfor training where each document dx∈D has a set ofparagraphs PS and each paragraph has a set of termsT . Θ is the set of all paragraphs in D and PS⊆Θ. Aset of terms Ω is the set of all unique words in D.

3.1 Latent Dirichlet Allocation

The proposed model uses LDA to reduce the dimen-sionality of D to a set of manageable topics Z, whereV is the number of topics. LDA assumes that eachdocument has multiple latent topics [GXL15], and de-fines each topic zj∈Z as a multinomial probabilitydistribution over all words in Ω as p(wi|zj) in which

wi∈Ω and 1≤j≤V such that∑|Ω|i p(wi|zj)=1. LDA

also represents a document d as a probabilistic mix-ture of topics as p(zj |d). As a result, and based onthe number of latent topics, the probability (localweight) of word wi in document d can be calculated as

p(wi|d)=∑Vj=1

(p(wi|zj)×p(zj |d)

). Finally, all hidden

variables, p(wi|zj) and p(zj |d), are statistically esti-mated by the Gibbs sampling algorithm [SG07].

3.2 Random Set

A random set is a random object that has val-ues, which are subsets that are taken from somespace [Mol06]. It works as an effective measureof uncertainty in imprecise data for decision analy-sis [Ngu08]. For example, let Z and Ω be finite setsthat represent topics and words respectively. Γ is aset-valued mapping from Z (the evidence space) ontoΩ that can be written as Γ: Z → 2Ω, and P is aprobability function defined on Z, thus the pair (P,Γ)is called a random set [KSH12]. Γ can be extendedas ξ :: Z → 2Ω×[0,1] (also called an extended set-valued mapping), which satisfies

∑(w,p)∈ξ(z) p=1 for

each z∈Z. Let P be a probability function on Z, suchthat

∑z∈Z P (z)=1. We call (ξ, P ) an extended ran-

dom set.

4 The Proposed Model

The proposed model (Figure 1) deals with the localweight problem of terms that is assigned by the LDAprobability function (described in section 3.1) by ex-ploring all possible relationships between different en-tities that influence the weighting process. The target-ing entities in our model are documents, paragraphs,topics, and terms. The possible relationships betweenthese entities are complex (a set of one-to-many rela-tionships). For example, a document can have manyparagraphs and terms; a paragraph can have multipletopics; a topic can have many terms. Inversely, a topiccan cover many paragraphs, and a term can appear inmany documents and topics.

In this model, we proposed three ERSs to describesuch complex relationships, where each ERS can beinterpreted as a function by which we can determinethe importance of the main entity in the relationship.Then, the proposed ERS theory is used to developa new weighting scheme to accurately weight topical

Figure 1: our proposed model

words by generalising the topic’s local weight, and,then, combine it with the frequency ratio of words inboth documents and topics.

4.1 Extended Random Sets

Let assume we have a set of top-ics Z=z1, z2, z3, . . . , zV in Θ and letD= d1, d2, d3, . . . , dN is a set of N relevantlong documents. Each document dx consists ofM paragraphs such as dx= p1, p2, p3, . . . , pM. Aparagraph py consists of a set of L words, for example,py= w1, w2, w3, . . . , wL. A word w is a keyword orunigram, where the function words(p) returns a setof words appear in paragraph p. A topic z can bedefined as a probability distribution over the set ofwords Ω where words(p)⊆Ω for every paragraph p∈Θ.

For each zi∈Z, let fi(w, zi) be a frequency func-tion on Ω, such that Γ(zi)=w|w∈Ω, fi(w, zi)≥0while the inverse mapping of Γ is defined asΓ−1 : Ω → 2Z ; Γ−1(w)=z∈Z|w∈Γ(z). Also, foreach dj∈D, let fj(w, dj) be a frequency function onΩ, such that Γ(dj) = w|w∈Ω, fj(w, dj)>0 whilethe inverse mapping of Γ is defined as Γ−1 : Ω →2D; Γ−1(w)=d∈D|w∈Γ(d). These extended set-valued mappings can decide a weighting function onΩ, which satisfies sr :: Ω→ [0,+∞) such that

sr(w) =∑

dj∈Γ−1(w)

[1

fj(w, dj)·( ∑zi∈Γ−1(w)

(Pz(zi)× fi(w, zi)

))](1)

where sr(w) is the combined weight of topical word wat the collection level.

The extended random set Γ1 is proposed to describethe relationships between paragraphs and topics us-

ing the conditional probability function Pxy(z|dxpy) asΓ1 : Θ→ 2Z×[0,1]; Γ1(dxpy)=(z1, Pxy(z1|dxpy)), . . ..

Similarly Γ2 is also proposed to describethe relationship between topics and terms us-ing the defined frequency function fi(w, zi) asΓ2 : Z → 2Ω×[0,+∞); Γ2(zi)=(w1, Pi(w1|zi)), . . ..

Lastly, Γ3 is also proposed to describe therelationship between documents and terms us-ing the defined frequency function fj(w, dj) asΓ3 : D → 2Ω×[0,+∞); Γ3(dj)=(w1, fj(w1, dj)), . . .

Based on the inverse mapping described above,we have Γ−1

1 , Γ−12 and Γ−1

3 . Γ−11 describes the in-

verse relationships between topics and paragraphsusing the probability function Pz(zi) such thatΓ−1

1 (z)=dxpy|z∈Γ1(dxpy) while Γ−12 , on the other

hand, describes the inverse relationships betweenterms and topics using fi(w, zi) function such thatΓ−1

2 (w)=z|w∈Γ2(z). Γ−13 describes the inverse

relationships between terms and documents usingfj(w, dj) function such that Γ3(w)=d|w ∈ Γ3(d)

4.2 Generalised Topic Weight

To estimate the generalised topic weight in D, we needto calculate the probability of each topic Pz(zi) ineach paragraph of document d and similarly for alldocuments in D based on Γ−1

1 in which we assumePΘ(dxpy) = 1

N , where N is the total number of para-graphs as follows:

Pz(zi) =∑

dxpy∈Γ−11 (zi)

(PΘ(dxpy)× Pxy(zi|dxpy))

= 1N

∑dxpy∈Γ−1

1 (zi)

Pxy(zi|dxpy)

(2)where Pxy(zi|dxpy) is estimated by LDA, dxpy refers toparagraph y in document x. Γ−1

1 is a mapping functiondefined previously.

4.3 Topical Word Weighting Scheme

To calculate the topical word weight at collection level,we simply substitute Pz(zi) in Equation 1 by its valuefrom Equation 2. Equation 3 shows the substitution.

5 Evaluation

To verify the proposed model, we designed two hy-potheses. First, our ERS model can effectively gen-eralise the topic’s local weight that is estimated fromall documents paragraphs. The generalisation has ledto a more accurate term weighting scheme especiallywhen it is combined with the term frequency ratioin both documents and topics. Second, our model,

overall, is more effective in selecting relevant fea-tures than most, state-of-the-art, term-based, pattern-based, topic-based or even mix-based FS models. Tosupport these two hypotheses, we conducted experi-ments and evaluated their performance.

5.1 Dataset

The first 50 collections of the standard Reuters CorpusVolume 1 (RCV1) dataset is used in this research dueto being assessed by domain experts at NIST [SR03]for TREC2 in their filtering track. This number of col-lections is sufficient and stable for better and reliableexperiments [BV00]. RCV1 is collections of documentswhere each document is a news story in English pub-lished by Reuters.

5.2 Baseline models

We compared the performance of our model to eightdifferent baseline models. These models are cate-gorised into five groups based on the type of featurethey use. The proposed model is trained only on rele-vant documents and does not consider irrelevant ones.Therefore, for fair comparison and judgement, we canonly select a baseline model that either unsupervisedor does not require the use of irrelevant documents.

We selected Okapi BM25 [RZ09], which is one of thebest term-based ranking algorithm. The phrase-basedmodel n-Grams is selected. It represents user’s infor-mation needs as a set of phrases where n = 3 as it is thebest value reported by Gao et al. (2015) [GXL15]. ThePattern Deploying based on Support (PDS) [ZLW12] isone of the pattern-based models. It can overcome thelimitations of pattern frequency and usage. We se-lected the Latent Dirichlet Allocation (LDA) [BNJ03]as the most widely used topic modelling algorithm.From the same group we also selected the Probabilis-tic Latent Semantic Analysis (pLSA) [Hof01]; it issimilar to the LDA and can deal with the problemof polysemy. Three models were selected from themix-based category. First, we selected the Pattern-Based Topic Model (PBTM-FP) [GXL15] that incor-porates topics and frequent patterns FP to obtain se-mantically rich and discriminative representation forIF. Secondly, the PBTM-FCP [GXL15], which is simi-lar to the PBTM-FP except it uses the frequent closedpattern FCP instead. Lastly, we selected the Top-ical N-Grams (TNG) [WMW07] that integrates thetopic model with phrases (n-grams) to discover top-ical phrases that are more discriminative and inter-pretable.

2http://trec.nist.gov/

sr(w) =1

N

∑dj∈Γ−1

3 (w)

[1

fj(w, dj)×( ∑zi∈Γ−1

2 (w)

(fi(w, zi)×

( ∑dxpy∈Γ−1

1 (zi)

Pxy(zi|dxpy))))]

(3)

5.3 Evaluation Measures

The effectiveness of our model is measured based onrelevance judgements by five metrics that are well-established and commonly used in the IR and IF com-munities. These metrics are the average precisionof the top-20 ranked documents (top-20), break-evenpoint (b/p), mean average precision (MAP), F-score(F1) measure, and 11-points interpolated average preci-sion (IAP). For more details about these measures, thereader can refer to Manning et al (2008) [MRS08]. Foreven better analysis of the experimental results, theWilcoxon signed-rank test (Wilcoxon T-test) [Wil45]was used. Wilcoxon T-test is a statistical non-parametric hypothesis test used to compare and as-sess if the ranked means of two related samples differor not. It is a better alternative to the student’s t-test,especially when no normal distribution is assumed.

5.4 Experimental Design

For each collection, we train our model on all para-graphs of relevant documents D in the training partof the collection. We use LDA to extract ten topicsbecause it is the best number for each collection as ithas reported in [GXL13, GXL14, GXL15]. Then, theproposed model scores documents’ terms, ranks themand uses the top-k features as a query to an IF sys-tem. The IF system uses unknown documents (fromthe testing part of the same collection) to decide theirrelevance to the user’s information needs (relevant orirrelevant). However, specifying the value of k is exper-imental. The same process is also applied separatelyto all baseline models. If the results of the IF sys-tem returned by the five metrics are better than thebaseline results, then we can claim that our model issignificant and outperforms a baseline model.

The IF testing system uses the following equationto rank the testing documents set:

weight(d) =∑t∈Q

x, if

t ∈ d, x = weight(t)

t /∈ d, x = 0(3)

where weight(d) is the weight of document d.

5.5 Experimental Settings

In our experiment, we use the MALLETtoolkit [McC02] to implement all LDA-based modelsexcept for the pLSA model where we used the Lemurtoolkit 3 instead. All topic-based models require some

3https://www.lemurproject.org/

parameters to be set. For the LDA-based models, weset the number of iterations for the Gibbs samplingto 1000 and for the hyper-parameters to β = 0.01and α = 50/V as they were justified in [SG07]. Weconfigured the number of iterations for the pLSAto be 1000 (default setting). For the experimentalparameters of the BM25, we set b = 0.75 and k1 = 1.2as recommended by Manning et al. (2008) [MRS08].

5.6 Experimental Results

Table 1 and figure 2 show the evaluation results of ourmodel and the baselines. These results are the averageof the 50 collections of the RCV1. The results in Table1 have been categorised based on the type of featureused by the baseline model and the improvement%represents the percentage change in our model’s per-formance compared to the best result of the baselinemodel (marked in bold if there is more than one base-line model in the category). We consider any improve-ment that is greater than 5% to be significant.

Table 1 shows that our model outperformed allbaseline models for information filtering in all five mea-sures. Regardless of the type of feature used by thebaseline model, our model is significantly better on av-erage by a minimum improvement of 8.0% and 39.7%maximum. Moreover, the 11-points result in figure 2illustrates the superiority of the proposed model andconfirms the significant improvements that shown intable 1.

Table 1: Evaluation results of our model in comparisonwith the baselines (grouped based on the type of fea-ture used by the model) for all measures averaged overthe first 50 document collections of the RCV1 dataset.

Model Top-20 b/p MAP Fβ=1 IAP

our model 0.560 0.471 0.502 0.475 0.526

LDA 0.492 0.414 0.442 0.437 0.468

pLSA 0.423 0.386 0.379 0.392 0.404

improvement% +13.9% +13.8% +13.7% +8.5% +12.3%

PDS 0.496 0.430 0.444 0.439 0.464

improvement% +12.9% +9.5% +13.2% +8.0% +13.4%

n-Gram 0.401 0.342 0.361 0.386 0.384

improvement% +39.7% +37.8% +39.1% +22.9% +37.1%

BM25 0.445 0.407 0.407 0.414 0.428

improvement% +25.8% +15.6% +23.5% +14.6% +22.9%

PBTM-FCP 0.489 0.420 0.423 0.422 0.447

PBTM-FP 0.470 0.402 0.427 0.423 0.449

TNG 0.447 0.360 0.372 0.386 0.394

improvement% +14.5% +12.1% +17.7% +12.2% +17.1%

Wilcoxon T-test results (Table 2) present the p-values of the results of our model compared to all base-

Figure 2: 11-points result of our model in compari-son with baselines averaged over the first 50 documentcollections of the RCV1 dataset.

line models on all performance measures. A model’sresult is considered significantly different from othermodel’s if the p-value is less than 0.05 [Wil45].Clearly,the p-value for all metrics is largely less than 0.05 con-firming that our model’s performance is significantlydifferent from all baselines. This shows that our modelgains substantial improvement compared to the usedbaseline models.

Table 2: Wilcoxon T-test p-values of the baselinemodels in comparison with our model’s.

Model Top-20 b/p MAP Fβ=1 IAP

LDA 0.004165 0.000179 7.00× 10−6 8.96× 10−6 6.71× 10−6

pLSA 1.48× 10−4 1.49× 10−4 6.65× 10−7 5.86× 10−7 1.72× 10−7

PDS 0.008575 0.003034 0.000194 0.000140 4.53× 10−5

n-Gram 7.46× 10−8 1.05× 10−7 1.71× 10−9 1.86× 10−9 1.23× 10−9

BM25 0.000353 0.008264 0.000279 0.000117 5.68× 10−5

TNG 0.010360 0.000607 0.000180 0.000137 3.76× 10−5

PBTM-FP 0.003442 7.19× 10−4 0.000382 0.000235 5.81× 10−5

PBTM-FCP 0.048010 0.033410 0.000306 0.000289 0.000180

Based on the results presented earlier, we are confi-dent in claiming that our extended random sets modelcan effectively generalise the local topic weight atthe document level in the LDA term scoring functionand, thus, provide a more globally representative termweight when it combined the term frequency in doc-ument and topics. Also, our model is more effectivein selecting relevant features to acquire user’s informa-tion needs that represented by a set of long documents.

6 Conclusion

This paper presents an innovative and effective topic-based feature ranking model to enhance the semanticof topical words to acquire user needs. The model ex-tends random sets to generalise the LDA topic weightat the document level. Then, a term weighting schemeis developed to accurately rank topical terms basedon their frequent appearance in the LDA topics dis-tributions and all relevant documents. The new cal-culated weight effectively reflects the relevance of aterm to user’s information needs and maintains thesame semantic meaning of terms across all relevantdocuments. The proposed model is tested for IF onthe standard RCV1 dataset, TREC topics, five differ-ent performance measurement metrics and eight state-of-the-art baseline models. The experimental resultsshow that our model achieved significant performancecompared to all other baseline models.

References

[ALA13] Mubarak Albathan, Yuefeng Li, and Ab-dulmohsen Algarni. Enhanced N-Gram Ex-traction Using Relevance Feature Discov-ery, pages 453–465. Springer InternationalPublishing, Cham, 2013.

[ALX14] Mubarak Albathan, Yuefeng Li, and YueXu. Using extended random set to find spe-cific patterns. In WI’14, volume 2, pages30–37. IEEE, 2014.

[AZ12] Charu C Aggarwal and ChengXiang Zhai.A survey of text clustering algorithms. InMining text data, pages 77–128. Springer,2012.

[Ble12] David M Blei. Probabilistic topic models.Communications of the ACM, 55(4):77–84,2012.

[BNJ03] David M Blei, Andrew Y Ng, and Michael IJordan. Latent dirichlet allocation. theJournal of machine Learning research,3:993–1022, 2003.

[BV00] Chris Buckley and Ellen M Voorhees. Eval-uating evaluation measure stability. In SI-GIR’00, pages 33–40. ACM, 2000.

[DDF+90] Scott Deerwester, Susan T Dumais,George W Furnas, Thomas K Landauer,and Richard Harshman. Indexing by latentsemantic analysis. Journal of the Americansociety for information science, 41(6):391,1990.

[GXL13] Yang Gao, Yue Xu, and Yuefeng Li.Pattern-based topic models for informa-tion filtering. In ICDM’13, pages 921–928.IEEE, 2013.

[GXL14] Yang Gao, Yue Xu, and Yuefeng Li. Topi-cal pattern based document modelling andrelevance ranking. In WISE’14, pages 186–201. Springer, 2014.

[GXL15] Yang Gao, Yue Xu, and Yuefeng Li.Pattern-based topics for document mod-elling in information filtering. IEEETKDE, 27(6):1629–1642, 2015.

[Hof01] Thomas Hofmann. Unsupervised learningby probabilistic latent semantic analysis.Machine learning, 42(1-2):177–196, 2001.

[KSH12] Rudolf Kruse, Erhard Schwecke, andJochen Heinsohn. Uncertainty and vague-ness in knowledge based systems: numeri-cal methods. Springer Science & BusinessMedia, 2012.

[LAA+15] Yuefeng Li, Abdulmohsen Algarni,Mubarak Albathan, Yan Shen, andMoch Arif Bijaksana. Relevance featurediscovery for text mining. IEEE TKDE,27(6):1656–1669, 2015.

[LAZ10] Yuefeng Li, Abdulmohsen Algarni, andNing Zhong. Mining positive and negativepatterns for relevance feature discovery. InKDD’10, pages 753–762. ACM, 2010.

[Li03] Yuefeng Li. Extended random sets forknowledge discovery in information sys-tems. In RSFDGrC’03, pages 524–532.Springer, 2003.

[LTSL09] Man Lan, Chew Lim Tan, Jian Su, andYue Lu. Supervised and traditional termweighting methods for automatic text cat-egorization. IEEE TPAMI, 31(4):721–735,2009.

[MC13] K Tamsin Maxwell and W Bruce Croft.Compact query term selection using topi-cally related text. In SIGIR’13, pages 583–592. ACM, 2013.

[McC02] Andrew Kachites McCallum. Mallet:A machine learning for language toolkit.2002.

[MO10] Craig Macdonald and Iadh Ounis. Globalstatistics in proximity weighting models. InWeb N-gram Workshop, page 30. Citeseer,2010.

[Mol06] Ilya Molchanov. Theory of random sets.Springer Science & Business Media, 2006.

[MRS08] Christopher D Manning, PrabhakarRaghavan, and Hinrich Schutze. Introduc-tion to information retrieval. CambridgeUniversity Press, 2008.

[Ngu08] Hung T Nguyen. Random sets. Scholarpe-dia, 3(7):3383, 2008.

[RZ09] Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework:BM25 and beyond. Now Publishers Inc,2009.

[SG07] Mark Steyvers and Tom Griffiths. Prob-abilistic topic models. Handbook of latentsemantic analysis, 427(7):424–440, 2007.

[SR03] Ian Soboroff and Stephen Robertson.Building a filtering test collection for trec2002. In SIGIR’03, pages 243–250. ACM,2003.

[TG09] Serafettin Tasci and Tunga Gungor. Lda-based keyword selection in text categoriza-tion. In ISCIS’09, pages 230–235. IEEE,2009.

[WC06] Xing Wei and W Bruce Croft. Lda-baseddocument models for ad-hoc retrieval. InSIGIR’06, pages 178–185. ACM, 2006.

[Wil45] Frank Wilcoxon. Individual comparisonsby ranking methods. Biometrics bulletin,1(6):80–83, 1945.

[WMW07] Xuerui Wang, Andrew McCallum, andXing Wei. Topical n-grams: Phrase andtopic discovery, with an application to in-formation retrieval. In ICDM’07, pages697–702. IEEE, 2007.

[ZLW12] Ning Zhong, Yuefeng Li, and Sheng-TangWu. Effective pattern discovery for textmining. IEEE TKDE, 24(1):30–44, 2012.

[ZPH08] Zhiwei Zhang, Xuan-Hieu Phan, andSusumu Horiguchi. An efficient feature se-lection using hidden topic in text catego-rization. In AINAW’08, pages 1223–1228.IEEE, 2008.

A simple neural network for evaluating semantic textual similarity

Yang SHAOHitachi, Ltd.

Higashi koigakubo 1-280, Tokyo, [email protected]

Abstract

This paper describes a simple neural network sys-tem for Semantic Textual Similarity (STS) task.The basic type of the system took part in the STStask of SemEval 2017 and ranked 3rd in the pri-mary track. More variant neural network struc-tures and experiments are explored in this paper.Semantic similarity score between two sentencesis calculated by comparing their semantic vectorsin our system. Semantic vector of every sentenceis generated by max pooling over every dimen-sion of their word vectors. There are mainly twotrick points in our system. One is that we traineda convolutional neural network (CNN) to trans-fer GloVe word vectors to a more proper formfor STS task before pooling. Another is that wetrained a fully-connected neural network (FCNN)to transfer difference of two semantic vectors tothe probability distribution over similarity scores.In spite of the simplicity of our neural networksystem, the best variant neural network achieved aPearson correlation coefficient result of 0.7930 onthe STS benchmark test dataset and ranked 3rd1.

1 IntroductionSemantic Textual Similarity (STS) is a task of decid-

ing a score that estimating the degree of semantic sim-ilarity between two sentences. STS task is a buildingblock of many Natural Language Processing (NLP) ap-plications. Therefore, it has received a lot of attentionsin recent years. STS tasks in SemEval have been heldfrom 2012 to 2017 [Cer et al., 2017]. In order to provide

Copyright c© by the paper’s authors. Copying permitted for private andacademic purposes.

In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZ Workshop,Location, Country, DD-MMM-YYYY, published at http://ceur-ws.org

1As of May 26, 2017

a standard benchmark to compare among meaning repre-sentation systems in future years, the organizers of STStasks created a benchmark dataset in 2017. STS Bench-mark2 comprises a selection of the English datasets usedin the STS tasks organized in the context of SemEval be-tween 2012 and 2017 [Agirre et al., 2012; 2013; 2014;2015; 2016; Cer et al., 2017]. The selection of datasetsinclude text from image captions, news headlines and userforums. Estimating the degree of semantic similarity of twosentences requires a very deep understanding of both sen-tences. Therefore, methods developed for STS tasks couldalso be used for a lot of other natural language understand-ing tasks, such as ”Paraphrasing” tasks, ”Entailment” tasks,”Answer Sentence Selection” tasks, ”Hypothesis Evidenc-ing” tasks, etc..

Measuring sentence similarity is challenging mainly be-cause of two reasons. One is the variability of linguis-tic expression and the other is the limited amount of an-notated training data. Therefore, conventional NLP ap-proaches, such as sparse, hand-crafted features are difficultto use. However, neural network systems [He et al., 2015a;He and Lin, 2016] can alleviate data sparseness with pre-training and distributed representations. We propose a sim-ple neural network system with 5 components:1) Enhance GloVe word vectors in every sentence by

adding hand-crafted features.2) Transfer the enhanced word vectors to a more proper

form by convolutional neural network (CNN).3) Max pooling over every dimension of all word vectors

to generate semantic vector.4) Generate semantic difference vector by concatenating

the element-wise absolute difference and the element-wise multiplication of two semantic vectors.

5) Transfer the semantic difference vector to the probabil-ity distribution over similarity scores by fully-connectedneural network (FCNN).

2 System DescriptionFigure 1 provides an overview of our system. The

two sentences to be semantically compared are first pre-2http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

Figure 1: Overview of system

processed as described in subsection 2.1. Then the CNNdescribed in subsection 2.2 transfers the word vectors to amore proper form for each sentence. After that, the pro-cesses introduced in subsection 2.3 is adopted to calculatesemantic vector and semantic difference vector from thetransferred word vectors. Then, an FCNN described insubsection 2.4 transfers the semantic difference vector toa probability distribution over similarity scores. We imple-mented our neural network system by using Keras3 [Chol-let, 2015] and TensorFlow4 [Abadi et al., 2016].

2.1 Pre-process

Several text preprocessing operations were performedbefore feature engineering:1) All punctuations are removed.2) All words are lower-cased.3) All sentences are tokenized by Natural Language

Toolkit (NLTK) [Bird et al., 2009].4) All words are replaced by pre-trained GloVe word vec-

tors (Common Crawl, 840B tokens) [Pennington et al.,2014]. Words that do not exist in the pre-trained wordvectors are set to the zero vector.

5) All sentences are padded to a static length l = 30 withzero vectors [He et al., 2015a].

One hand-crafted feature is added to enhance the GloVeword vectors:1) If a word appears in both sentences, add a TRUE flag to

the word vector, otherwise, add a FALSE flag.

2.2 Convolutional neural network (CNN)

The number of our CNN layers is l. Every layer con-sists of kl one dimensional filters. The length of filtersare set to be same as the dimension of enhanced word vec-tors. The activation function of convolutional neural is setto be tanh. We did not use any regularization or drop out.Early stopping triggered by model performance on valida-tion data was used to avoid overfitting. We used the samemodel weights to transfer each of the words in a sentence.

3http://github.com/fchollet/keras4http://github.com/tensorflow/tensorflow

Table 1: Hyper parameters

Sentence pad length 30Dimension of GloVe vectors 300Number of CNN layers l 1Number of CNN filters in layer1 k1 300Activation function of CNN tanhInitial function of CNN he uniformNumber of FCNN layers n 2Dimension of input layer 600Dimension of layer1m1 300Dimension of output layer 6Activation of layers except output tanhActivation of output layer softmaxInitial function of layers he uniformOptimizer ADAMBatch size 1500Max epoch 25Run times 8

2.3 Comparison of semantic vectors

The semantic vector of sentence is calculated by maxpooling [Scherer et al., 2010] over every dimension of theCNN transferred word vector. To calculate the semanticsimilarity score of two sentences, we generate a semanticdifference vector by concatenating the element-wise abso-lute difference and the element-wise multiplication of twosemantic vectors. The calculation equation is

~SDV = (| ~SV 1− ~SV 2|, ~SV 1 ~SV 2) (1)

Here, ~SDV is the semantic difference vector between twosentences, ~SV 1 and ~SV 2 are the semantic vectors oftwo sentences, is Hadamard product which generate theelement-wise multiplication of two semantic vectors.

2.4 Fully-connected neural network (FCNN)

An FCNN is used to transfer the semantic differencevector to a probability distribution over the six similaritylabels used by STS. The number of layers is n. The dimen-sion of every layer is mn. The activation function of everylayer except the last one is tanh. The activation functionof the last layer is softmax. We train without using regu-larization or drop out.

3 Experiments and ResultsThe basic type of our neural network system took part

in the STS task of SemEval 2017 and ranked 3rd in theprimary track [Shao, 2017]. The hyper parameters usedin our basic type system were empirically decided for theSTS task and shown in Table 1. Our objective functionis the Pearson correlation coefficient. ADAM [P.Kingmaand Ba, 2015] was used as the gradient descent optimiza-tion method. All parameters of the optimizer are set to befollowed with the original paper. The learning rate is 0.001,

Table 2: Increasing of dimensions of FCNN

Dimensions of Pearson correlationFCNN layer1m1 coefficient results

300 0.778679± 0.003508600 0.776741± 0.002711900 0.778596± 0.001876

1200 0.779059± 0.0034141500 0.778852± 0.0034001800 0.779247± 0.002261

Table 3: Increasing of filters of CNN

Number of CNN Pearson correlationfilters in layer1 k1 coefficient results

300 0.780586± 0.001843600 0.785420± 0.002587900 0.790137± 0.002325

1200 0.791042± 0.0025571500 0.792357± 0.0022561800 0.792580± 0.002613

β1 is 0.9, β2 is 0.999, ε is 1e-08. he uniform [He et al.,2015c] was used as the initial function of all layers. The ba-sic model achieved a Pearson correlation coefficient resultof 0.778679± 0.003508 and ranked 4th on the STS bench-mark5. We explore more variant neural network structuresand experiments in this section.

3.1 Increasing of dimensions of FCNN

We run the experiment using more FCNN dimensionsin this subsection. The hyper parameters used in this sub-section are same with the basic type system in Table 1 ex-cept the dimension of FCNN layer1 m1. The dimensionsof FCNN layer1m1 and the Pearson correlation coefficientresults are shown in Table 2. Figure 2 shows the average re-sults in every epoch with standard deviation error bar. Thehighest curve is the Pearson correlation coefficient resultson the training data. The curve in the middle is the resultson the validation data. The lowest curve is the results onthe test data.

3.2 Increasing of filters of CNN

We run the experiment using more CNN filters in thissubsection. The hyper parameters used in this subsectionare same with the basic type system in Table 1 except thenumber of CNN filters in layer1 k1. The number of CNNfilters in layer1 k1 and the Pearson correlation coefficientresults are shown in Table 3. Figure 3 shows the averageresults in every epoch with standard deviation error bar.

3.3 Increasing of layers of FCNN

We run the experiment using more FCNN layers in thissubsection. The hyper parameters used in this subsection

5As of May 26, 2017

are same with the basic type system in Table 1 except thenumber of FCNN layers n and the dimension of FCNN lay-ers mn. The number of FCNN layers n is set to be 3 in thissubsection. The dimensions of FCNN layern mn and thePearson correlation coefficient results are shown in Table4. The number of filters in CNN layer1 k1 is set to be 1800based on the previous experiments. Figure 4 shows the av-erage results in every epoch with standard deviation errorbar.

3.4 Increasing of layers of CNN

We run the experiment using more CNN layers in thissubsection. The hyper parameters used in this subsectionare same with the basic type system in Table 1 except thenumber of CNN layers l and the number of filters in CNNlayers kl. The number of CNN layers l is set to be 2 inthis subsection. The number of filters in CNN layerl kland the Pearson correlation coefficient results are shown inTable 5. The dimensions of FCNN layer1 m1 is set to be1800 based on the previous experiments. Figure 5 showsthe average results in every epoch with standard deviationerror bar.

3.5 2 CNN layers with shortcut

We run the experiment using 2 CNN layers in this sub-section. We add a shortcut [He et al., 2015b] between inputlayer and the second layer. The hyper parameters used inthis subsection are same with the basic type system in Ta-ble 1 except the number of CNN layers l and the numberof CNN filters in layers kl. The number of CNN layers l isset to be 2. The number of CNN filters in layer2 k2 is set tobe 301, same with the dimensions of expanded GloVe wordvectors. The number of filters in CNN layersl k1 and thePearson correlation coefficient results are shown in Table6. The dimensions of FCNN layer1 m1 is set to be 1800based on the previous experiments. Figure 6 shows the av-erage results in every epoch with standard deviation errorbar.

3.6 3 CNN layers with shortcut

We run the experiment using 3 CNN layers in this sub-section. We add a shortcut [He et al., 2015b] between thefirst layer and the third layer. The hyper parameters usedin this subsection are same with the basic type system inTable 1 except the number of CNN layers l and the numberof CNN filters in layers kl. The number of CNN layers lis set to be 3. The number of filters in CNN layer3 k3 isset to be same with the number of filters in CNN layer1k1. The number of CNN filters in layers kl and the Pearsoncorrelation coefficient results are shown in Table 7. The di-mensions of FCNN layer1m1 is set to be 1800 based on theprevious experiments. Figure 7 shows the average resultsin every epoch with standard deviation error bar. For thisexperiment, we also tried the model that without the hand-

Table 4: Increasing of layers of FCNN

Dimensions of Dimensions of Pearson correlationFCNN layer1 FCNN layer2 coefficient results

300 300 0.788331± 0.004569600 600 0.785838± 0.003565900 900 0.789736± 0.0025461200 1200 0.786109± 0.0038201500 1500 0.789013± 0.0015241800 1800 0.782995± 0.003396

Table 5: Increasing of layers of CNN

Number of Number of Pearson correlationfilters in layer1 filters in layer2 coefficient results

300 301 0.762369± 0.002277600 301 0.765034± 0.002445900 301 0.765966± 0.003641

1200 301 0.761183± 0.0043221500 301 0.764604± 0.0049691800 301 0.766178± 0.004455

crafted feature. The purely sentence representation systemachieved an accuracy of 0.788154± 0.003412.

4 DiscussionFrom the results of experiment 1, we can find that in-

creasing the dimensions of FCNN does not have remark-able effect on the accuracy of evaluations. The curves inFigure 2 are almost coincidental. From the results of exper-iment 2, we can find that increasing the number of filters inCNN layer can improve the accuracy of evaluations. Al-though the size of training data (5749 records) is not verylarge, abstracting more features still benefits the evalua-tion results. By increasing the number of filters in CNNlayer, we achieved a Pearson correlation coefficient resultof 0.792580± 0.002613 and improve the rank from 4th to3rd.

From the results of experiment 3, we can find that in-creasing the layer of FCNN is harmful to the evaluationresults. But increasing the dimensions of FCNN layer haslittle effect on the accuracy of evaluations. From the re-sults of experiment 4, we can find that increasing the layerof CNN could significantly pull down the accuracy of eval-uations. However, changing the number of filters in CNNlayers only changes the learning speed, has little effect onthe final accuracy. The structure with smaller number offilters can learn faster.

From the results of experiment 5, we can find thatadding a shortcut between input layer and the second CNNlayer can slightly improve the accuracy of evaluations.From the results of experiment 6, we can find that addinga shortcut between the first CNN layer and the third CNNlayer can get a result that close to the model that has onlyone CNN layer. Smaller number of filters in the sec-ond CNN layer can achieve better accuracy of evaluations.

Table 6: 2 CNN layers with shortcut

Number of Number of Pearson correlationfilters in layer1 filters in layer2 coefficient results

300 301 0.762030± 0.008716600 301 0.768793± 0.003466900 301 0.767369± 0.004021

1200 301 0.768415± 0.0057991500 301 0.769528± 0.0022991800 301 0.770214± 0.006707

Table 7: 3 CNN layers with shortcut

Number of Number of Pearson correlationfilters in layer1 filters in layer2 coefficient results

1800 300 0.793013± 0.0023251800 600 0.791661± 0.0024441800 900 0.787749± 0.0037981800 1200 0.785493± 0.0027611800 1500 0.785675± 0.0034131800 1800 0.783370± 0.004499

Comparing with the structure that has only one CNN layer,3 CNN layers with shortcut structure can learn faster. 3CNN layers with shortcut structure achieved a Pearson cor-relation coefficient result of 0.793013± 0.002325 and thatis the best result in all of the variant neural networks.

5 ConclusionWe investigated a simple neural network system for the

STS task. All variant models used convolutional neuralnetwork to transfer hand-crafted feature enhanced GloVeword vectors to a proper form. Then, the models calcu-lated semantic vectors of sentences by max pooling overevery dimension of their transferred word vectors. Afterthat, semantic difference vector between two sentences isgenerated by concatenating the element-wise absolute dif-ference and element-wise multiplication of their semanticvectors. At last, a fully-connected neural network was usedto transfer the semantic difference vector to the probabilitydistribution over similarity scores.

In spite of the simplicity of our neural network system,the basic type ranked 3rd in the primary track of the STStask of SemEval 2017. On the STS benchmark test dataset,the basic model achieved a Pearson correlation coefficientresult of 0.778679± 0.003508 and ranked 4th. By investi-gating several variant neural networks in this research, wefound that 3 CNN layers with shortcut between the firstlayer and the third layer structure achieved the best result, aresult of 0.793013± 0.002325 improved our rank from 4th

to 3rd. We also tried purely sentence representation systemfor this model and the result is 0.788154 ± 0.003412, alsoranked 3rd.

Figure 2: Increasing of dimensions of FCNN

Figure 3: Increasing of filters of CNN

Figure 4: Increasing of layers of FCNN

Figure 5: Increasing of layers of CNN

Figure 6: 2 CNN layers with shortcut

Figure 7: 3 CNN layers with shortcut

References[Abadi et al., 2016] Martın Abadi, Paul Barham, Jian-

min Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,Michael Isard, Manjunath Kudlur, Josh Levenberg, Ra-jat Monga, Sherry Moore, Derek G. Murray, BenoitSteiner, Paul Tucker, Vijay Vasudevan, Pete Warden,Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-flow: A system for large-scale machine learning. In Pro-ceedings of the 12th USENIX Conference on OperatingSystems Design and Implementation, OSDI’16, pages265–283, Berkeley, CA, USA, 2016. USENIX Associa-tion.

[Agirre et al., 2012] Eneko Agirre, Mona Diab, DanielCer, and Aitor Gonzalez-Agirre. Semeval-2012 task 6:A pilot on semantic textual similarity. In Proceedingsof the First Joint Conference on Lexical and Computa-tional Semantics, pages 385–393, 2012.

[Agirre et al., 2013] Eneko Agirre, Daniel Cer, MonaDiab, Aitor Gonzalez-Agirre, and Weiwei Guo. Sem2013 shared task: Semantic textual similarity. In Pro-ceedings of the Main Conference and the Shared Task:Semantic Textual Similarity, pages 32–43, 2013.

[Agirre et al., 2014] Eneko Agirre, Carmen Banea, ClaireCardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre,Weiwei Guo, Rada Mihalcea, German Rigau, andJanyce Wiebe. Semeval-2014 task 10: Multilingualsemantic textual similarity. In Proceedings of the 8thInternational Workshop on Semantic Evaluation, pages81–91, 2014.

[Agirre et al., 2015] Eneko Agirre, Carmen Banea, ClaireCardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre,Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar,Rada Mihalcea, German Rigau, Larraitz Uria, andJanyce Wiebe. Semeval-2015 task 2: Semantic textualsimilarity, english, spanish and pilot on interpretability.In Proceedings of the 9th International Workshop on Se-mantic Evaluation, pages 252–263, 2015.

[Agirre et al., 2016] Eneko Agirre, Carmen Banea, DanielCer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihal-cea, German Rigau, and Janyce Wiebe. Semeval-2016task 1: Semantic textual similarity, monolingual andcross-lingual evaluation. In Proceedings of the 10thInternational Workshop on Semantic Evaluation, pages497–511, San Diego, California, June 2016. Associationfor Computational Linguistics.

[Bird et al., 2009] Steven Bird, Ewan Klein, and EdwardLoper. Natural Language Processing with Python.O’Reilly Media, 2009.

[Cer et al., 2017] Daniel Cer, Mona Diab, Eneko Agirre,Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017task 1: Semantic textual similarity multilingual andcrosslingual focused evaluation. In Proceedings of the11th International Workshop on Semantic Evaluation(SemEval-2017), pages 1–14, Vancouver, Canada, Au-gust 2017. Association for Computational Linguistics.

[Chollet, 2015] Francois Chollet. Keras. https://github.com/fchollet/keras, 2015.

[He and Lin, 2016] Hua He and Jimmy Lin. Pairwise wordinteraction modelling with deep neural networks for se-mantic similarity measurement. In Proceedings of the2016 Conference of the North American Chapter ofthe Association for Computational Linguistics: HumanLanguage Technologies, 2016.

[He et al., 2015a] Hua He, Kevin Gimpel, and Jimmy Lin.Multi-perspective sentence similarity modelling withconvolutional neural networks. In Proceedings of the2015 Conference on Empirical Methods in Natural Lan-guage Processing, pages 1576–1586, 2015.

[He et al., 2015b] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Deep residual learning for imagerecognition. arXiv preprint arXiv:1512.03385, 2015.

[He et al., 2015c] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Delving deep into rectifiers: Sur-passing human-level performance on imagenet classifi-cation. In Proceedings of the International Conferenceon Computer Vision (ICCV), 2015.

[Pennington et al., 2014] Jeffrey Pennington, RichardSocher, and Christopher D. Manning. Glove: Globalvectors for word representation. In Empirical Methodsin Natural Language Processing, pages 1532–1543,2014.

[P.Kingma and Ba, 2015] Diederik P.Kingma andJimmy Lei Ba. Adam: A method for stochasticoptimization. In Proceedings of the 3rd InternationalConference on Learning Representations (ICLR), 2015.

[Scherer et al., 2010] Dominik Scherer, Andreas C.Muller, and Sven Behnke. Evaluation of poolingoperations in convolutional architectures for objectrecognition. In Proceedings of 20th InternationalConference on Artificial Neural Networks (ICANN),pages 92–101, 2010.

[Shao, 2017] Yang Shao. HCTI at semeval-2017 task1: Use convolutional neural network to evaluate se-mantic textual similarity. In Proceedings of the 11thInternational Workshop on Semantic Evaluation (Se-mEval 2017), pages 130–133, Vancouver, Canada, Au-gust 2017. Association for Computational Linguistics.

Long-Term Memory Networks for Question Answering

Fenglong Ma1∗, Radha Chitta2∗, Saurabh Kataria3∗

Jing Zhou2∗, Palghat Ramesh4, Tong Sun5∗, Jing Gao1

1SUNY Buffalo, 2Conduent Labs US, 3LinkedIn4PARC, 5United Technologies Research Center

fenglong, [email protected], radha.chitta, [email protected]@gmail.com, [email protected], [email protected]

Abstract

Question answering is an important and dif-ficult task in the natural language process-ing domain, because many basic natural lan-guage processing tasks can be cast into a ques-tion answering task. Several deep neural net-work architectures have been developed re-cently, which employ memory and inferencecomponents to memorize and reason over textinformation, and generate answers to ques-tions. However, a major drawback of manysuch models is that they are capable of onlygenerating single-word answers. In addition,they require large amount of training data togenerate accurate answers. In this paper, weintroduce the Long-Term Memory Network(LTMN), which incorporates both an exter-nal memory module and a Long Short-TermMemory (LSTM) module to comprehend theinput data and generate multi-word answers.The LTMN model can be trained end-to-endusing back-propagation and requires minimalsupervision. We test our model on two syn-thetic data sets (based on Facebook’s bAbIdata set) and the real-world Stanford ques-tion answering data set, and show that it canachieve state-of-the-art performance.

1 Introduction

Question answering (QA), a challenging problemwhich requires an ability to understand and analyze

∗Work carried out while at PARC, a Xerox Company.

Copyright c© by the paper’s authors. Copying permitted forprivate and academic purposes.

In: Proceedings of IJCAI Workshop on Semantic MachineLearning (SML 2017), Aug 19-25 2017, Melbourne, Australia.

the given unstructured text, is one of the core tasks innatural language understanding and processing. Manyproblems in natural language processing, such as read-ing comprehension, machine translation, entity recog-nition, sentiment analysis, and dialogue generation,can be cast as question answering problems.

Traditional question answering approaches canbe categorized as: (i) IR-based question answering[Pas03] where the question is formulated as a searchquery, and a short text segment is found on the Webor similar corpus for the answer; (ii) Knowledge-basedquestion answering [GJWCL61, BCFL13], which aimsto answer a natural language question by mapping itto a semantic query over a database.

The traditional approaches are simple query-basedtechniques. It is difficult to establish the relationshipsbetween the sentences in the input text, and derive ameaningful representation of the information withinthe text using these traditional question-answeringsystems.

Figure 1 shows an example of question answeringtask. The sentences in black are facts that may berelevant to the questions, questions are in blue, andthe correct answers are in red. In order to correctlyanswer the question “What did Steve Jobs offer Xeroxto visit and see their latest technology?”, the modelshould have the ability to recognize that the sentence“After hearing of the pioneering GUI technology beingdeveloped at Xerox PARC, Jobs had negotiated a visitto see the Xerox Alto computer and its Smalltalk de-velopment tools in exchange for Apple stock options.”is a supporting fact and extract the relevant portion ofthe supporting fact to form the answer. In addition,the model should have the ability to memorize all thefacts that have been presented to it until the currenttime, and deduce the answer.

The authors of [WCB15] proposed a new class oflearning models named Memory Networks (MemNN),which use a long-term memory component to store

1: Burrel’s innovative design, whichcombined the low production cost of anApple II with the computing power ofLisa’s CPU, the Motorola 68K, receivedthe attention of Steve Jobs, co-founderof Apple.2: Realizing that the Macintosh wasmore marketable than the Lisa, he beganto focus his attention on the project.3: Raskin left the team in 1981 over apersonality conflict with Jobs.4: Why did Raskin leave the Apple teamin 1981? over a personality conflictwith Jobs5: Team member Andy Hertzfeld said thatthe final Macintosh design is closer toJobs’ ideas than Raskin’s.6: According to Andy Hertzfeld, whoseidea is the final Mac design closer to?Jobs7: After hearing of the pioneering GUItechnology being developed at XeroxPARC, Jobs had negotiated a visit tosee the Xerox Alto computer and itsSmalltalk development tools in exchangefor Apple stock options.8: What did Steve Jobs offer Xerox tovisit and see their latest technology?Apple stock options

Figure 1: Example of a question answering task.

information and an inference component for reason-ing. [KIO+16] proposed the Dynamic Memory Net-work (DMN) for general question answering tasks,which processes input sentences and questions, formsepisodic memories, and generates answers. These twoapproaches are strongly supervised, i.e., only thesupporting facts (factoids) are fed to the model as in-puts for training the model for each type of question.For example, when training the model with the ques-tion in the fourth line of Figure 1, strongly supervisedmethods only use the sentence in line 3 as input. Thus,these methods require a large amount of training data.

To tackle this issue, [SWF+15] introduced a weaklysupervised approach called End-to-End MemoryNetwork (MemN2N), which uses all the sentences thathave appeared before this question. For the above ex-ample, the inputs are the sentences from line 1 to line3 when training for the question in the fourth line.MemN2N is trained end-to-end and uses an attentionmechanism to calculate the matching probabilities be-tween the input sentences and questions. The sen-tences which match the question with high probabilityare used as the factoids for answering the question.

However, this model is capable of generating onlysingle-word answers. For example, the answer of thequestion “According to Andy Hertzfeld, whose idea is

the final Mac design closer to?” in Figure 1 is onlyone word “Jobs”. Since the answers of many questionscontain multiple words (for instance, the questionlabeled 4 in Figure 1), this model cannot be directlyapplied to the general question answering tasks.

Recurrent neural networks comprising Long ShortTerm Memory Units have been employed to gener-ate multi-word text in the literature [Gra13, SVL14].However, simple LSTM based recurrent neural net-works do not perform well on the question-answeringtask due to the lack of an external memory compo-nent which can memorize and contextualize the facts.We present a more sophisticated recurrent neural net-work architecture, named Long-Term Memory Net-work (LTMN), which combines the best aspects ofend-to-end memory networks and LSTM based recur-rent neural networks to address the challenges facedby the currently available neural network architecturesfor question-answering. Specifically, it first embeds theinput sentences (initially encoded using a distributedrepresentation learning mechanism such as paragraphvectors [LM14]) in a continuous space, and stores themin memory. It then matches the sentences with thequestions, also embedded into the same space, by per-forming multiple passes through the memory, to obtainthe factoids which are relevant to each question. Thesefactoids are then employed to generate the first wordof the answer, which is then input to an LSTM unit.The LSTM unit is used to generate the subsequentwords in the answer. The proposed LTMN model canbe trained end-to-end, requires minimal supervisionduring training (i.e., weakly supervised), and gener-ates multiple words answers. Experimental results ontwo synthetic datasets and one real world dataset showthat the proposed model outperforms the state-of-the-art approaches.

In summary, the contributions of this paper are asfollows:

• We propose an effective neural network architec-ture for general question answering, i.e. for gen-erating multi-word answers for questions. Our ar-chitecture combines the best aspects of MemN2Nand LSTM and can be trained end-to-end.

• The proposed architecture employs distributedrepresentation learning techniques (e.g. para-graph2vec) to learn vector representations for sen-tences or factoids, questions and words, as well astheir relationships. The learned embeddings con-tribute to the accuracy of the answers generatedby the proposed architecture.

• We generate a new synthetic dataset with multipleword answers based on Facebook’s bAbI dataset

[WBC+16]. We call this the multi-word answerbAbI dataset.

• We test the proposed architecture on two syn-thetic datasets (the single-word answer bAbIdataset and the multi-word answer bAbI dataset),and the real-world Stanford question answeringdataset [RZLL16]. The results clearly demon-strate the advantages of the proposed architecturefor question answering.

2 Related Work

In this section, we review literature closely related toquestion answering, particularly focusing on modelsusing memory networks to generate answers.

2.1 Question Answering

Traditional question answering approaches mainlyinclude two categories: IR-based [Pas03] andKnowledge-based question answering [GJWCL61,BCFL13]. IR-based question answering systems useinformation retrieval techniques to extract information(i.e., answers) from documents. These methods firstprocess questions, i.e., detect named entities in ques-tions, and then predict answer types, such as cities’names or person’s names. After recognizing answertypes, these approaches generate queries, and extractanswers from the web using the generated queries.These approaches are easy, but they ignore the seman-tics between questions and answers.

Knowledge-based question answering systems[ZC05, BL14, ZHLZ16] consider the semantics and useexisting knowledge bases, such as Freebase [BEP+08]and DBpedia [BLK+09]. They cast the questionanswering task as that of finding one of the missingarguments in a triple. Most of knowledge-basedquestion answering approaches use neural networks,dependency trees and knowledge bases [BGWB12] orsentences [IBGC+14].

Using traditional question answering approaches, itis difficult to establish the relationship between sen-tences in the input text, and thereby identify the rel-evance of the different sentences to the question. Oflate, several neural network architectures with memo-ries have been proposed to solve this challenging prob-lem.

2.2 Memory Networks

Several deep neural network models use memory archi-tectures [SWF+15, KIO+16, WCB15, GWD14, JM15,MD93] and attention mechanisms for image captioning[YJW+16], machine comprehension [WGL+16] and

healthcare data mining [MCZ+17, SMC+17]. We fo-cus on the models using memory networks for naturallanguage question answering.

Memory networks (MemNN), proposed in[WCB15], first introduced the concept of an ex-ternal memory component for natural languagequestion answering. They are strongly supervised,i.e., they are trained with only the supporting factsfor each question. The supporting input sentences areembedded in memory, and the response is generatedfrom these facts by scoring all the words in thevocabulary in correlation with the facts. This scoringfunction is learnt during the training process andemployed during the testing phase. MemNN arecapable of producing only single-word answers, dueto this response generation mechanism. In addition,MemNN cannot be trained end-to-end.

The authors of [KIO+16] improve over MemNNby introducing an end-to-end trainable network calledDynamic Memory Networks (DMN). DMN have fourmodules: input module, question module, episodicmemory module and answer module. The input mod-ule encodes raw text inputs into distributed vector rep-resentations using a gated recurrent network (GRU)[CVMBB14]. The question module similarly encodesthe question using a recurrent neural network. Thesentences and question representations are fed to theepisodic memory module, which chooses the sentencesto focus on using the attention mechanism. It itera-tively produces a memory vector, representing all therelevant information, which is then used by the answermodule to generate the answer using a GRU. How-ever, DMN are also strongly supervised like MemNN,thereby requiring a large amount of training data.

End-to-End Memory Networks (MemN2N)[SWF+15] first encode sentences into continuousvector representations, then use a soft attention mech-anism to calculate matching probabilities betweensentences and questions and find the most relevantfacts, and finally generate responses using the vocab-ulary from these facts. Unlike the MemNN and DMNarchitectures, MemN2N can be trained end-to-endand are weakly supervised. However, the drawback ofMemN2N is that it only generates answers with oneword. The proposed LTMN architecture improvesover the existing network architectures because (i) itcan be trained end-to-end, (ii) it is weakly supervised,and (iii) can generate answers with multiple words.

3 Long-Term Memory Networks

In this section, we describe the proposed Long-TermMemory Network, shown in Figure 2. It includes fourmodules: input module, question module, memorymodule and answer module. The input module en-

Burrel's innovative

design … co-founder

of Apple.

Realizing that … his

attention on the

project.

Raskin left the team

in 1981 over a

personality conflict

with Jobs.

Sentence

representation

Matching Probability Vector

Word Embeddings

Why did Raskin leave the Apple team in

1981?

Sentence

representation

sentence

representation

Question

representation

LSTM

over

LSTM LSTM LSTM

a Jobs <EOS>Output of

MemN2N

L

L

Input Module Question Module

Memory Module Answer Module

Figure 2: The proposed LTMN model.codes raw text data (i.e., sentences) into vector rep-resentations. Similarly, the question module also en-codes questions into vector representations. The inputand question modules can use the same or differentencoding methods. Given the input sentences’ repre-sentations, the memory module calculates the match-ing probabilities between the question representationand the sentence representations, and then outputsthe weighted sum of the sentence representations andmatching probabilities. Using this weighted sum vec-tor and the question representation, the answer mod-ule finally generates the answer for the question.

3.1 Input Module and Question Module

Let xini=1 represent the set of input sentences. Eachsentence xi ∈ R|V | contains words belonging to a dic-tionary V , and ends with an end-of-sentence token<EOS>. The goal of the input module is to encodesentences into vector representations. The questionmodule, like the input module, aims to encode eachquestion q ∈ R|V | into a vector representation. Specif-ically, we use a matrix A ∈ Rd×|V | to embed sentencesand B ∈ Rd×|V | for questions.

Several methods have been proposed to encode theinput sentences or questions. In [SWF+15], an em-bedding matrix is employed to embed the sentences ina continuous space and obtain the vector representa-tions. [KIO+16, Elm91] use a recurrent neural networkto encode the input sentences into vector representa-tions. Our objective is to learn the co-occurrence andsequence relationships between words in the text inorder to generate a coherent sequence of words as an-swers. Thus, we employ a distributed representationlearning technique, such as paragraph vectors (para-graph2vec) model [LM14] to pre-train A and B (withA = B) for the real-word SQuAD dataset, which takesinto account the order and semantics among words toencode the input sentences and questions1. For syn-thetic datasets, which are based on a small vocabulary,

1We use paragraph2vec in our implementation. Other repre-sentation learning mechanisms may be employed in the proposedLTMN model.

the embedding matrices A and B are learnt via back-propagation.

3.2 Memory Module

The input sentences xini=1 are embedded using thematrix A as mi = Axi, i = 1, 2, . . . , n;mi ∈ Rd andstored in memory. Note that we use all the sentencesbefore the question as input, which implies that theproposed model is weakly supervised. The questionq is also embedded using the matrix B as u = Bq;u ∈Rd. The memory module then calculates the matchingprobabilities between the sentences and the question,by computing the inner product followed by a softmaxfunction as follows:

pi = softmax(uTmi), (1)

where softmax(zi) = ezi/∑

j ezj . The probability pi

is expected to be high for all the sentences xi that arerelated to the question q.

The output of the memory module is a vectoro ∈ Rd, which can be represented by the sum over in-put sentence representations, weighted by the match-ing probability vector as follows:

o =∑i

pimi. (2)

This approach, known as the soft attention mecha-nism, has been used by [SWF+15, BCB15]. The ben-efit of this approach is that it is easy to compute gra-dients and back-propagate through this function.

3.3 Answer Module

Based on the output vector o from the memory mod-ule and the word representations from input module,the answer module generates answers for questions.As our objective is to generate answers with mul-tiple words, we employ Long Short Term MemoryNetworks (LSTM) [HS97] to generate answers.

The core of the LSTM neural network is a mem-ory unit whose behavior is controlled by a set of threegates: input, output and forget gates. The memory

unit accumulates the knowledge from the input dataat each time step, based on the values of the gates,and stores this knowledge in its internal state. Theinitial input to the LSTM is the embedding of thebegin-of-answer (<BOA>) token and its state. Weuse the output of the memory module o, the questionrepresentation u, a weight matrix W (o) and bias bo togenerate the embedding of <BOA> a0 as follows:

a0 = softmax(W (o)(o+ u) + bo). (3)

Using a0 and the initial state s0, LSTM can generatethe first word w1 and its corresponding predicted out-put y1 and state s1. At each time step t, LSTM takesthe embedding of word wt−1 and last hidden state st−1as input to generate the new word wt.

vt = [wt−1] (4)

it = σ(Wivvt +Wimyt−1 + bi) (5)

ft = σ(Wfvvt +Wfmyt−1 + bf ) (6)

ot = σ(Wovvt +Womyt−1 + bo) (7)

st = ft st−1 + it tanh(Wsvvt +Wsmyt−1) (8)

yt = ot st (9)

wt = argmax[softmax(W (t)yt + bt)

](10)

where [wt] is the embedding of word wt learnt fromthe input module, σ and denote the sigmoid func-tion and Hadamard product respectively, and W (t) isa weight matrix and bt is a bias vector.

The model is trained end-to-end with the loss de-fined by the cross-entropy between the true answerand the predicted output wt, represented using one-hot encoding. The predicted answer is generated byconcatenating all the words generated by the model.

4 Experiments

In this section, we compare the performance of theproposed LTMN model with the current state-of-the-art models for question answering.

4.1 Datasets

We use three datasets: the real-world Stanford ques-tion answering dataset (SQuAD) [RZLL16], the syn-thetic single-word answer bAbI dataset [WBC+16],and the synthetic multi-word answer bAbI dataset,generated by performing vocabulary replacements inthe single-word answer bAbI dataset.

Stanford Question Answering Dataset(SQuAD) [RZLL16] contains 100,000+ questionslabeled by crowd workers on a set of Wikipediaarticles. The answer for each question is a segment

of text from the corresponding paragraph. In orderto convert the format of the data to the input formatof our model (shown in Figure 1) , we use NLTKto detect the boundary of sentences and assign anindex to each sentence and question, in accordancewith the starting index of the answer provided by thecrowd workers. The dataset is thus transformed toa question answer dataset containing 18, 893 storiesand 69, 523 questions2. For our experiments, werandomly selected 1, 248 questions for training and1, 248 questions for testing. Each answer contains lessthan or equal to five words.

The single-word answer bAbIdataset [WBC+16] is a synthetic dataset cre-ated to benchmark question answering models. Itcontains 20 types of question answer tasks, and eachtask is comprising a set of statements followed by asingle-word answer. For each question, only some ofthe statements contain the relevant information. Thetraining and test data contains 1, 000 examples foreach task.

The multi-word answer bAbI dataset. Asthe goal of the proposed model is to generate multi-word answers, we manually generated a new datasetfrom the Facebook bAbI dataset, by replacing fewwords, such as “bedroom” and “bathroom” with“guest room”, and “shower room”, respectively. Thereplacements are listed in Table 1.

Table 1: Replacements made in the vocabulary of thebAbI dataset to generate the multi-word answer bAbIdataset.

Original word Replacement

hallway entrance waybathroom shower roomoffice computer science officebedroom guest roommilk hot waterBill Bill GatesFred Fred BushMary Mary Bushgreen bright greenyellow bright yellowhungry extremely hungrytired extremely tired

4.2 Parameters and Baselines

We use 10% of the training data for model valida-tion to choose the best parameters. The best per-formance was obtained when the learning rate was setto 0.002, the batch size set to 32, and the weights ini-tialized randomly from a Gaussian distribution with

2The dataset can be downloaded from http://www.acsu.buffalo.edu/˜fenglong/

zero mean and 0.1 variance. The model was trainedfor 200 epochs. The paragraph2vec model was set togenerate 100-dimensional representations for the inputsentences and the questions.

We first compare the performance of the proposedLTMN model with a simple Long Short Term Memorynetwork (LSTM) model, as implemented in [SVL14] topredict sequences. The LSTM model works by readingthe story until it comes across a question and outputsan answer, using the information obtained from thesentences read so far. Unlike the LTMN model, itdoes not have an external memory component. Wealso compare its performance

On the single-word answer bAbI dataset, we alsocompare our results with those of the attention basedLSTM model (LSTM + Attention) [HKG+15], whichpropagates dependencies between input sentences us-ing an attention mechanism, MemNN [WCB15],DMN [KIO+16], and MemN2N [SWF+15]. Thesemodels cannot be applied as-is to the SQuAD andmulti-word answer bAbI datasets because they areonly capable of generating single-word answers.

4.3 Evaluation Measures

In order to evaluate the performance of all the meth-ods, the following measurements are used:

• Exact Match Accuracy (EMA) represents the ra-tio of predicted answers which exactly match thetrue answers.

• Partial Match Accuracy (PMA) is the ratio of gen-erated answers that partially match the correctanswers.

• BLEU score [CC14], widely used to evaluate ma-chine translation models, measures the quality ofthe generated answers.

Table 2: Test accuracy on the SQuAD dataset.

Measure LSTM LTMN

EMA 8.3 10.6BLEU 12.4 17.0PMA 22.8 27.4

4.4 Results

The performance of the LTMN model is shown in Ta-bles 2, 3, and 4 on the SQuAD, single-word answerbAbI and multi-word answer bAbI datasets, respec-tively.

We observe that LTMN performs better than LSTMin terms of all three evaluation measures, on all thedatasets. On the SQuAD dataset, as the vocabu-lary is large (8, 969), the LSTM model cannot learn

the embedding matrices accurately, leading to its poorperformance. However, as the LTMN model employsparagraph2vec, it learns richer vector representationsof the sentences and questions. In addition, it canmemorize and reason over the facts better than thesimple LSTM model. On the multi-word answer bAbIdataset, the LTMN model is significantly better thanthe LSTM model, especially on tasks 1, 4, 12, 15, 19,and 20. The average EMA, BLEU, and PMA scores ofLTMN are about 30% higher than those of the LSTMmodel. The single-word answer bAbI dataset’s vo-cabulary is small (about 20), so we learn the embed-ding matrices A and B using back-propagation, in-stead of using paragraph2vec to obtain the vector rep-resentations. In Table 3, we observe that the LTMNmodel achieves accuracy close to the strongly super-vised MemNN and DMN models on 4 out of the20 bAbI tasks, despite being weakly supervised, andachieves better accuracy than the weakly-supervisedLSTM+Attention and MemN2N on 7 tasks. The pro-posed LTMN model also offers the additional capa-bility of generating multi-word answers, unlike thesebaseline models.

5 Conclusions

Question answering is an important and challengingtask in natural language processing. Traditional ques-tion answering approaches are simple query-based ap-proaches, which cannot memorize and reason over theinput text. Deep neural networks with memory havebeen employed to alleviate this challenge in the liter-ature.

In this paper, we proposed the Long-Term MemoryNetwork, a novel recurrent neural network, which canencode raw text information (the input sentences andquestions) into vector representations, form memories,find relevant information in the input sentences to an-swer the questions, and finally generate multi-word an-swers using a long short term memory network. Theproposed architecture is a weakly supervised modeland can be trained end-to-end. Experiments on bothsynthetic and real-world datasets demonstrate the re-markable performance of the proposed architecture.

In our experiments on the bAbI question & answer-ing tasks, we found that the proposed model fails toperform as well as the completely supervised memorynetworks on certain tasks. In addition, the model per-forms poorly when the input sentences are very longand the vocabulary is large, as it cannot calculate thesupporting facts efficiently. In the future, we plan toexpand the model to handle long input sentences, andimprove the performance of the proposed network.

Table 3: Test accuracy (EMA) on the single-word answer bAbI dataset

TaskWeakly Supervised Strongly Supervised

LSTM LSTM + Attention MemN2N LTMN MemNN DMN

1: Single Supporting Fact 50 98.1 96 98.2 100 1002: Two Supporting Facts 20 33.6 61 41.6 100 98.23: Three Supporting Facts 20 25.5 30 23.8 100 95.24: Two Argument Relations 61 98.5 93 98.1 100 1005: Three Argument Relations 70 97.8 81 79.5 98 99.36: Yes/No Questions 48 55.6 72 81.8 100 1007: Counting 49 80.0 80 80.2 85 96.98: Lists/Sets 45 92.1 77 72.6 91 96.59: Simple Negation 64 64.3 72 65.4 100 10010: Indefinite Knowledge 46 57.2 63 87.0 98 97.511: Basic Coreference 62 94.4 89 84.7 100 99.912: Conjunction 74 93.6 92 97.9 100 10013: Compound Coreference 94 94.4 93 90.3 100 99.814: Time Reasoning 27 75.3 76 74.3 99 10015: Basic Deduction 21 57.6 100 100 100 10016: Basic Induction 23 50.4 46 43.5 100 99.417: Positional Reasoning 51 63.1 57 57.0 65 59.618: Size Reasoning 52 92.7 90 90.7 95 95.319: Path Finding 8 11.5 9 11.4 36 34.520: Agent’s Motivations 91 98.0 100 100 100 100

Mean (%) 48.8 71.7 73.9 73.9 93.4 93.6

Table 4: Test accuracy on the multi-word answer bAbI dataset.

TaskLSTM LTMN

EMA BLEU PMA EMA BLEU PMA

1: Single Supporting Fact 36.5 38.8 41.1 97.0 97.2 97.32: Two Supporting Facts 26.6 29.7 32.7 31.3 34.5 37.63: Three Supporting Facts 17.1 20.3 23.6 24.5 27.2 29.84: Two Argument Relations 48.2 50.1 51.9 97.9 98.0 98.05: Three Argument Relations 45.3 49.3 53.2 77.9 80.1 82.26: Yes/No Questions 53.8 53.8 53.8 66.1 66.1 66.17: Counting 69.5 69.5 69.5 78.4 78.4 78.48: Lists/Sets 62.1 66.7 71.8 82.1 85.6 89.39: Simple Negation 57.4 57.4 57.4 69.2 69.2 69.210: Indefinite Knowledge 44.4 44.4 44.4 84.7 84.7 84.711: Basic Coreference 33.1 35.1 37.0 83.3 83.7 84.012: Conjunction 33.1 35.7 38.2 99.3 99.3 99.413: Compound Coreference 33.6 35.8 37.9 87.7 88.5 89.214: Time Reasoning 24.6 24.6 24.6 74.4 74.4 74.415: Basic Deduction 46.4 46.4 46.4 100 100 10016: Basic Induction 46.8 51.6 56.3 42.4 47.0 51.617: Positional Reasoning 55.1 55.1 55.1 55.5 55.5 55.518: Size Reasoning 51.9 51.9 51.9 89.6 89.6 89.619: Path Finding 8.1 35.1 56.4 11.3 59.1 10020: Agent’s Motivations 83.3 84.6 85.3 100 100 100

Mean (%) 42.2 46.8 49.4 72.6 75.9 78.8

References

[BCB15] Dzmitry Bahdanau, Kyunghyun Cho, andYoshua Bengio. Neural machine translationby jointly learning to align and translate. InICLR, 2015.

[BCFL13] Jonathan Berant, Andrew Chou, RoyFrostig, and Percy Liang. Semantic parsingon freebase from question-answer pairs. InEMNLP, 2013.

[BEP+08] Kurt Bollacker, Colin Evans, Praveen Par-

itosh, Tim Sturge, and Jamie Taylor.Freebase: a collaboratively created graphdatabase for structuring human knowledge.In SIGMOD, 2008.

[BGWB12] Antoine Bordes, Xavier Glorot, Jason We-ston, and Yoshua Bengio. Joint learning ofwords and meaning representations for open-text semantic parsing. In AISTATS, 2012.

[BL14] Jonathan Berant and Percy Liang. Semanticparsing via paraphrasing. In ACL, 2014.

[BLK+09] Christian Bizer, Jens Lehmann, Georgi Kobi-larov, Soren Auer, Christian Becker, RichardCyganiak, and Sebastian Hellmann. Dbpedia- a crystallization point for the web of data.Web Semant, 2009.

[CC14] Boxing Chen and Colin Cherry. A system-atic comparison of smoothing techniques forsentence-level bleu. In SMT, 2014.

[CVMBB14] Kyunghyun Cho, Bart Van Merrienboer,Dzmitry Bahdanau, and Yoshua Bengio. Onthe properties of neural machine translation:Encoder-decoder approaches. arXiv preprintarXiv:1409.1259, 2014.

[Elm91] Jeffrey L Elman. Distributed representations,simple recurrent networks, and grammaticalstructure. Machine learning, 1991.

[GJWCL61] Bert F Green Jr, Alice K Wolf, Carol Chom-sky, and Kenneth Laughery. Baseball: an au-tomatic question-answerer. In Western jointIRE-AIEE-ACM computer conference, 1961.

[Gra13] Alex Graves. Generating sequences withrecurrent neural networks. arXiv preprintarXiv:1308.0850, 2013.

[GWD14] Alex Graves, Greg Wayne, and Ivo Dani-helka. Neural turing machines. arXivpreprint arXiv:1410.5401, 2014.

[HKG+15] Karl Moritz Hermann, Tomas Kocisky, Ed-ward Grefenstette, Lasse Espeholt, WillKay, Mustafa Suleyman, and Phil Blunsom.Teaching machines to read and comprehend.In NIPS, 2015.

[HS97] Sepp Hochreiter and Jurgen Schmidhuber.Long short-term memory. Neural computa-tion, 1997.

[IBGC+14] Mohit Iyyer, Jordan L Boyd-Graber,Leonardo Max Batista Claudino, RichardSocher, and Hal Daume III. A neuralnetwork for factoid question answering overparagraphs. In EMNLP, 2014.

[JM15] Armand Joulin and Tomas Mikolov. Inferringalgorithmic patterns with stack-augmentedrecurrent nets. In NIPS, 2015.

[KIO+16] Ankit Kumar, Ozan Irsoy, Peter Ondruska,Mohit Iyyer, James Bradbury, Ishaan Gul-rajani, Victor Zhong, Romain Paulus, and

Richard Socher. Ask me anything: Dynamicmemory networks for natural language pro-cessing. In ICML, 2016.

[LM14] Quoc V Le and Tomas Mikolov. Distributedrepresentations of sentences and documents.In ICML, 2014.

[MCZ+17] Fenglong Ma, Radha Chitta, Jing Zhou,Quanzeng You, Tong Sun, and Jing Gao.Dipole: Diagnosis prediction in healthcarevia attention-based bidirectional recurrentneural networks. In KDD, 2017.

[MD93] Michael C Mozer and Sreerupa Das. A con-nectionist symbol manipulator that discoversthe structure of context-free languages. InNIPS, 1993.

[Pas03] Marius Pasca. Open-domain question an-swering from large text collections. Compu-tational Linguistics, 2003.

[RZLL16] Pranav Rajpurkar, Jian Zhang, KonstantinLopyrev, and Percy Liang. Squad: 100,000+questions for machine comprehension of text.arXiv preprint arXiv:1606.05250, 2016.

[SMC+17] Qiuling Suo, Fenglong Ma, Giovanni Canino,Jing Gao, Aidong Zhang, Pierangelo Veltri,and Agostino Gnasso. A multi-task frame-work for monitoring health conditions viaattention-based recurrent neural networks. InAMIA, 2017.

[SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc VLe. Sequence to sequence learning with neu-ral networks. In NIPS, 2014.

[SWF+15] Sainbayar Sukhbaatar, Jason Weston, RobFergus, et al. End-to-end memory networks.In NIPS, 2015.

[WBC+16] Jason Weston, Antoine Bordes, SumitChopra, Alexander M Rush, Bart vanMerrienboer, Armand Joulin, and TomasMikolov. Towards ai-complete question an-swering: A set of prerequisite toy tasks. InICLR, 2016.

[WCB15] Jason Weston, Sumit Chopra, and AntoineBordes. Memory networks. In ICLR, 2015.

[WGL+16] Bingning Wang, Shangmin Guo, Kang Liu,Shizhu He, and Jun Zhao. Employing exter-nal rich knowledge for machine comprehen-sion. In IJCAI, 2016.

[YJW+16] Quanzeng You, Hailin Jin, Zhaowen Wang,Chen Fang, and Jiebo Luo. Image captioningwith semantic attention. In CVPR, 2016.

[ZC05] Luke S. Zettlemoyer and Michael Collins.Learning to map sentences to logical form:Structured classification with probabilisticcategorial grammars. In UAI, 2005.

[ZHLZ16] Yuanzhe Zhang, Shizhu He, Kang Liu, andJun Zhao. A joint model for question answer-ing over multiple knowledge bases. In AAAI,2016.