olga khomitsevich - flexible context extraction for keywords in russian automatic speech...

1

TITLE OF PRESENTATION (FORMAT: TAHOMA 27, UPPER CASE)

Subtitle (FORMAT: TAHOMA 22)

FLEXIBLE CONTEXT EXTRACTION FOR KEYWORDS IN RUSSIAN AUTOMATIC SPEECH RECOGNITION RESULTS

O. Khomitsevich, K. Boyarsky, E. Kanevsky, A. Bulusheva, V. [email protected]

Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

AIST 2016

2Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CONTENTS

IntroductionThe proposed methodThe SemSin systemRules for context extractionExamples for context extractionExperiments and resultsDiscussion and future developments

3

INTRODUCTIONIssues Keyword search tasks Thematic clustering tasks

Existing methods Output the whole sentence Output a window of n words to the right and left of

the keywordProblems The sentence may be very long Poorly punctuated recognizer output The window may miss important information


4

THE PROPOSED METHOD


5

THE SEMSIN SYSTEM

The SemSin is based on three databased:

Morphological database Database of idioms Database of prepositions

SemSin is a system for syntactic and semantic analysis of Russian text. Itcombines the functions of a Part-of-Speech tagger, ontology and syntactic parser.


6

THE SEMSIN SYSTEM

The SemSin system analyses text by paragraph, involving the following steps:

Each word is processed by the morphological analyser (lemma, POS, grammatical form, semantic class and syntactic dependents).The text is tokenized and divided into sentences by the pre-syntax module. Syntactic parse trees are constructed for each sentence by means of the application of about 400 rules.


7

THE SEMSIN SYSTEM

The following features are represented in a resulting xml file:

Id is unique ID of the token inside the sentence lemma is the base form of the word morph contains the information about the POS and grammatical

features of the word (animacy, gender, number, case, tense, etc) class number refers to the semantic class of the word rel is the tag containing information about relations between

words in the sentence id_head contains Id of the parent node type indicates the type of the dependency relation between the two words


8

THE SEMSIN SYSTEM

A fragment of a resulting file

“Саудовская Аравия предпочитает” (Translation: “Saudi Arabia prefers”)

<w Id="1" lemma="САУДОВСКИЙ" morph="ПРИЛ жр,ед,им" class="$715"> <rel id_head="2" type="Часть_Назв"/> Саудовская </w><w Id="2" lemma="АРАВИЯ" morph="СУЩ но,жр,ед,им" class="$1231000"> <rel id_head="3" type="Субъект"/> Аравия </w><w Id="3" lemma="ПРЕДПОЧИТАТЬ" morph="Г пе,нс,дст,нст,3л, ед" class="$1241/41561"> <rel id_head="" type=""/> предпочитает </w>


9

RULES FOR CONTEXT EXTRACTION

The algorithm extracts:

all words immediately dependent on keyword; the topmost node of the clause (normally the predicate)

and all the nodes between it and the target word; the subject of the predicate (unless it is already extracted

or coincides with the target word); the direct object of the predicate, and, for verbs of the

class “speech/information/reporting”, the object denoting the content of the report;

prepositional and other groups linked to the predicate by a “where?”-type link;

all the words in genitive case that depend on those already extracted;


10

EXAMPLE FOR CONTEXT EXTRACTION

News articles:

Keyword: США ( Translation: “USA”).

The original transcript: Полевой командир талибов Маулави Сангин сообщил в четверг западным информационным агентствам, что военнослужащий США, пропавший в афганской провинции Пактика в конце июня, находится в руках боевиков.(Translation: “Talib field commander Mawlawi Sangin informed Western information agencies on Thursday that the USA serviceman who went missing in the Afghan Paktika province in the end of June is in the hands of militants”).

Context: военнослужащий США находится в руках боевиков (Translation: “the USA serviceman is in the hands of militants”)


11



12

EXAMPLE FOR CONTEXT EXTRACTIONRecognition output:

Keyword: льготный (Translation: “relating to benefits”).

The recognized transcript: меня очень интересует, почему у нас так плохо стало с с лекарством бы льготным лекарствам.(Approximate translation: “I’m really interested why for us it has become so bad with with medicine to benefit medicines”).

The original transcript: меня очень интересует, почему у нас так плохо стало с лекарством, льготным лекарством (“I’m really interested why for us it has become so bad with a medicine, a benefit edicine”).

Context: у нас плохо стало льготным лекарствам (Translation: “for us it has become bad to benefit medicines”)


13



14

EXPERIMENTS AND RESULTSNews articles

20 human experts2 context quality measures(from 1 to 10, the more the better): completeness and conciseness

Test-case: 500 sentences from news articles, 55 keywords. 237 contexts were extracted.

Algorithm Avg. completness

Avg. conciseness

Context with window n=4 6.2 7.64Context with window n=5 6.74 7.3Flexible context extraction 7.34 8.5


15

EXPERIMENTS AND RESULTS

Recognition output

Test-case: 2000 sentences were produced by Russian ASR system with 80% accuracy, social thematic.23 keywords.223 contexts were extracted.

Algorithm Avg. completness

Avg. conciseness

Context with window n=4 7.59 7.44Context with window n=5 7.92 7.24Flexible context extraction 7.41 8.28


16

DISCUSSION AND FUTURE DEVELOPMENTS

We are going toadd new syntactic dependencies;do a context more shorter or longer according to the user’s need;include more advanced NLP methods;make a syntactic parser more robust for spontaneous speech recognition results;test the use of the extracting contexts in a clustering task;


17

THANK YOU

CONTACTS

Russia 4 Krasutskogo street, St. Petersburg, 196084Tel.: +7 812 325-8848 Fax: +7 812 327 9297Email: [email protected]

USASuite 316, 369 Lexington aveNew York, NY, 10017Tel.: +1 646 237 7895Email: [email protected]

ABOUT THE COMPANY

STC-Innovations is a leader in the multimodal biometric market. STC-Innovations develops multimodal biometric solutions based on person-identifying technologies via voice, face and other noncontact biometric features.

STC-Innovations is a spin-off company of the Speech Technologies Center, leading global provider of innovative systems in high-quality recording, audio and video processing and analysis, speech synthesis and recognition, and real-time, high-accuracy voice and facial biometrics solutions with over 20 years of research, development and implementation experience in Russia and internationally. STC is ISO-9001: 2008 certified.


AIST 2016

olga khomitsevich - flexible context extraction for keywords in russian automatic speech...

Data & Analytics