lidia pivovarova
Post on 10-May-2015
415 Views
Preview:
DESCRIPTION
TRANSCRIPT
Lidia Pivovarovaphd student, lecturer, researcher
Saint-Petersburg State University
My supervisor
V. Sh. Rubashkin -Dr. technical science, professor, PhD in Philosophy
Our goals (in general)
• Natural Language Understanding (NLU)
• Conceptual Modeling
• The title of my future phd thesis:
Ontology-based Information Extraction for newspaper texts
Ontology & Ontoeditor
• We are developing universal ontology
• It means:– Top-level– General model appropriable for all domains– Several domains deeply developed
Conceptual model
• Common approach:– Hierarchy of objects
Our approach: attribute tree
Рубашкин В. Ш. Представление и анализ смысла в интеллектуальных информационных системах – М.: Наука Гл. ред. Физ.-мат. Лит., 1989 – 192 с. – (Проблемы искусственного интеллекта) – ISBN – 5-02-01-4213-1
InTez ontology
• An attribute tree: objects alternate with attributes
• A small fragment: TRANSPORT
o BY ENERGY SOURCE ELECTRIC TRANSPORT ATOMIC TRANSPORT FUEL TRANSPORT WIND-DRIVEN TRANSPORT
o BY ENVIRONMENT TYPE AIR TRANSPORT WATER TRANSPORT LAND TRANSPORT SPACE TRANSPORT
Attribute tree- the most natural way to present different
links between concepts
• value <-> attribute *great color vs. great volume
• attribute <-> object classSOLID -> SHAPE vs. *LIQUID -> SHAPE
• extension relations: incompatibility, intersection, inclusion
e.g. HUMAN AGE
– CHILD– ADULT– AGED
SEX– MALE– FEMALE
Formal definitions:Girl = child & femaleBoy = child & male
BOY GIRL
BOY STUDENT
«Associative» relations
- unified
• PART -> WHOLE
• OBJECT -> LOCALIZATION
• OBJECT -> FUNCTION
- specialized
• COUNTRY –> CAPITAL
• ORGANIZATION –> CHIEF
- an internal part of the working ontology
CONCEPTS LEXICAL UNITS
Lexical units:
• words or collocations
• standard terms (names of ontology concepts) or “synonyms”
Lexicon
Functionality
Ontology means a terminology system model.
From the technological point of view ontology is a library of program functions (*.dll).
The functions look as:– F(1)( D ), F(2)( D1 , D2 ),
where D, D1, D2 - are ontology concepts.
Ontoeditor InTez
Developed by: V.Sh. Rubashkin and B.U. Chuprin
http://inttez.ru/ - in Russian, sorry
Technological point of view
Ontology is a library of program functions (*.dll).
The functions look as:– F(1)( D ), F(2)( D1 , D2 ),
where D, D1, D2 - are ontology concepts.
Information Extraction
Valery Sh. Rubashkin, Boris Chuprin, Lidia Pivovarova, Anton Babanov, Olga Usmanova
We are developing the intelligent environment which supports thedomain expert activity and capable for adaptation to texts features.The environment have to minimize the expert efforts, not replace him.
General System DescriptionThe
Ontology
TEXTS
Lemmatization, part-of-
speech tagging, semantic mark-up
Morph. analyzer
Semantic analyzer Situatio
n State
Search Patterns
The Factors
Factors – the required information aspects.~ 100 factors
Factors: - qualitative
e.g. social tension, investment attractiveness,level of sovereignty, human rights activity
- quantitativee.g. the number of unemployed, an average
salary, the inflation level, the ammount of import
Numerical valuesQualitative factors:
very small, small, less than average, average, more
than average, large, very large.
Quantitative factors:
the number + <unit>
e. g.
an average salary –> monetary unit (ruble, $, …)
the number of unemployed -> no units
The PatternsQualitative factors ->“factor + numerical value” patterns.
e. g. Social tension <-- spontaneous meeting (large)
Quantitative factors -> “only factor” patterns.
e. g. The number of unemployed <-- become unemployed
Search algorithm
1) find a pattern
2) find a number + unit
if not
3) find words large, small, increase, decrease etc.
Pattern Formation ProcessPattern is a set of words and ontology concepts.
Ontology provides:
- pattern generalization
- synonym accumulation
- information about units
Pattern formation: user marks relevant fragment in a text or chooses concept from the ontology.
Example
As is known, European Union strictly demanded Latvia to close the both generating units of Ignalinskaya nuclear power station. It is also promised to remit 3 billions euro for this goal.
Factors:
The EU pressure to Latvia.
The financial aid of EU to Latvia.
Discussion
• I am not sure that such thinks as level of sovereignty might be found in newspapers
• We had very few examples, so we wasn’t able to test it
• I think that it is necessary to collaborate with experts (sociologists) to address such task
• Russian language: very few resources
Ontology Learning
V. Bocharov, L. Pivovarova, V. Rubashkin, B. Chuprin Ontological Parsing of Encyclopedia Information. In Computational Linguistics and Intelligent Text Processing 11th International Conference, CICLing 2010, Iasi, Romania, March 21-27, 2010. Proceedings. Lecture Notes in Computer Science. - Springer Berlin / Heidelberg – 2010 – pp. 564 – 579
~ 2500 concepts~ 1000 words and collocations
Should include ~ 100000 concepts
- reuse of traditional lexicographical informationRussian Encyclopedic Dictionary. A. M. Prohorov
(ed.). Russian Encyclopedic Dictionary, Moscow (2001) [In Russian]
- without toponyms and proper names26,375 entries21,782 different terms
Ontology learning: our approach
Basic hypothesis
Usually, a hyperonym for a dictionary term is the first subjective-case noun of its definition (“basic word”).
ПЕРИСТИЛЬ – прямоугольный двор, сад, площадь, окруженные с 4 сторон крытой колоннадой.
PERISTYLE – a colonnade surrounding a building or court.
ЯТАГАН – рубяще-колющее оружие (среднее между саблей и кинжалом) у народов Ближнего и Среднего Востока (известно с 16 в.).
YATAGHAN - a long knife or short saber that lacks a guard for the hand at the juncture of blade and hilt and that usually has a double curve to the edge and a nearly straight back.
Basic hypothesis
Dictionary entry (text + labels + abbrevations)
Lexicographical processing
Morphology and syntax
Relation extraction
Dictionary entry (text only)
Dependency tree
Relation (term <–> basic word)
Import to ontology
General framework
Lexicographical processing
• term recognition
• replacement of abbreviations by full forms of words
• removing of labels
• bracket text elimination
Lexicographical processing
на Сев. Кавказе
at N. Caucasus
на Северном Кавказе
at the North Caucasus
в 18 в.
in 18 c.
в 18 веке
in 18th century
Morphology and syntax
• Simple context-free grammar (noun groups only) – Tomita formalism
• AOT tool to compile grammar (immediate constituent structure)
• Dependency tree
[ANP] -> [ADJ] [NP root]
: $0.grm := case_number_gender($1.grm, $2.type_grm, $2.grm);
[GP] -> [NP root] [NP grm="рд"];
[PP] -> [PREP root] [NP];
[NP] -> [NOUN];
[NP] -> [NP root] [PP] ;
[NP] -> [PP] | [GP] | [ANP];
Morphology and syntax
Morphology and syntax: example
Халат - верхняя одежда у некоторых азиатских народов.
Oriental robe – outdoor clothes of some Asian nations.
ВЕРХНЯЯ
ОДЕЖДА
НЕКОТОРЫХ
У
НАРОДОВ
АЗИАТСКИХANP
ANP
ANP
PP
NP
Immediate constituent structure
Dependency tree
ВЕРХНЯЯ
ОДЕЖДА
НЕКОТОРЫХ
У
НАРОДОВ
АЗИАТСКИХANP
ANP
ANP
PP
NP
Disambiguation
Before syntax
After syntax
Average number of lemmas for one word form
1,27 1,06
Average number of morphological analysis outputs for one word form
2,26 1,64
Disambiguation• о чукотском море (about Chukchee sea)• море
– мор (pestilence), prepositional case, singular, masculine gender;
– море (sea), prepositional case, singular, neuter gender;
– мора (mora), prepositional case, singular, feminine gender
• чукотском (Chukchee) adjective in prepositional case and masculine or neuter gender
• мора has to be rejected
Relation recognition
Relation Description Notation
GENERALIZATION (IS-A) – default value
Gen
INSTANCE (reverse to Gen) Spec
IDENTITY Same
PART Part
WHOLE (reverse to Part) Whole
FUNCTION Func
OTHER Other
Logical-linguistic rules
• a specific rule is attached to a certain word
• describes, first, the type of relation indicated by this word
• and, second, a directive of saving this word as a basic, or rejecting it and obtaining the next basic word candidate
Examples of GENERALIZATION relation rules Basic word: род, вид, сорт, тип… (kind, sort, type, class, etc.)Example: ПИДЖИНЫ – тип языков, используемых как средство
межэтнического общения в среде разноязычного населения.
PIDGINS – a sort of languages, used for communication between people with different languages.
Rule:1. Save default type of relation (<Gen> )2. Save next noun as a basic word Result:ПИДЖИН язык GENPIDGIN language GEN
Examples of GENERALIZATION relation rules Basic word: жанр genre
Example: МИСТЕРИЯ – жанр средневекового западноевропейского
религиозного театра.MYSTERY – a genre of the religious medieval theatre.Rule:1.Save word as a basic word with default relation type2. Save default type of relation (<Gen>)3. Save the next noun as a basic word context.
Result:МИСТЕРИЯ жанр GENМИСТЕРИЯ театр GEN
MYSTERY genre GENMYSTERY theatre GEN
Main types of rules
1.– save the first basic word;
– change the type of relation;
– save the next basic word.
2.– reject the first basic word;
– change the type of relation;
– save the next basic word.
“Complicated” rulesBasic word: инструмент, прибор, аппарат... (instrument, tool,
device, etc.)
Example: ФЕН – электрический аппарат для сушки волос.HAIRDRYER – an electric device for hair drying. Rule:Save word – move to the next prepositionIf it is для (for):- change relation type to <Func>- save next nounResult:ФЕН аппарат GENФЕН сушка FUNC
HAIRDRYER device GENHAIRDRYER drying FUNC
OTHER relation
АБОРТ – прерывание беременности в сроки до 28 недель (то есть до момента, когда возможно рождение жизнеспособного плода).
ABORTION – the termination of a pregnancy after, accompanied by, resulting in, or closely followed by the death of the embryo or fetus.
OTHER relation
ХОМИНГ – способность животного возвращаться со значительного расстояния на свой участок обитания, к гнезду, логову и т. д.
HOMING – the ability of animals to come back from the considerable distance to their home range, nest, lie etc .
OTHER relation- features: характеристика (characteristic), признак
(attribute), свойство (property), число (number), показатель (index), степень (degree), количество (quantity), характер (character), масса (mass), состояние (condition), способность (ability), место (place), источник (source)
- transformations: переход (transition), извлечение (extraction), превращение (transformation), введение (introduction), выделение (emission), возникновение (origination), нарушение (deviation), прерывание (termination), развитие (evolution), образование (formation), увеличение (increase), уменьшение (decrease)
The last slide about rules
• 18 rules
• 91 basic words
• 8484 dictionary entries where rules are used
• 4679 different basic words
• 1978 basic terms
Most frequent basic words1 УСТРОЙСТВО DEVICE 3322 МИНЕРАЛ MINERAL 3223 ЕДИНИЦА UNIT 2934 ПРИБОР INSTRUMENT 2925 ВЕЩЕСТВО SUBSTANCE 2776 ПРОЦЕСС PROCESS 2437 ИНСТРУМЕНТ TOOL 2358 ЭЛЕМЕНТ ELEMENT 2289 ЗАБОЛЕВАНИЕ DISEASE 21010 НАУКА DISCIPLINE 19911 СОЕДИНЕНИЕ COMPOUND 18412 БОЛЕЗНЬ ILLNESS 17413 ПОРОДА BREED 17014 ОРГАН ORGAN 168
Evaluation• Expert evaluation, 200 entries
• 90% of entries (179 of 200), the results obtained by the expert and our sofware are identical.
• 21 dictionary entries, which are incorrectly processed by the program:– 16 of 21 can be eliminated by minor
modifications – 5 – а basic word is missing from the definition
text
Inconvenient entries• АБРАЗИВНЫЙ ИНСТРУМЕНТ – служит
для механической обработки (шлифование, притирка и другие ).
• ABRASIVE TOOL – is designed for mechanical processing (grinding, reseating, etc.).
• АБИТУРИЕНТ – оканчивающий среднее учебное заведение.
• COLLEGE APPLICANT – a person graduating from high school.
Import to ontology
Manual process
• choosing a basic word in the ontology taxonomy (attribute tree)
• forming a subset of dictionary entries• adding subset terms to the ontology• postediting
Wikipedia
– Articles design … is various• where is «the first sentence of definition»?
– Topics … are peculiar• computer games ~ 2000 articles
– Articles without definitions • «List of FTP server return codes»• «March 25 is the 84th day of the year…»
Wikipedia: preliminary results
• Expert evaluation, 500 entries
• 82% of entries (410 of 500), the results obtained by the expert and our sofware are identica
• 40% of the errors (36 of 90 entries) - irregularities in the article texts
Wikipedia vs. EncyclopediaBasic word Wikipedia Encyclopedia
pод (kind) 3084 58вид (sort) 2526 384образование (formation) 2215 114название (name) 2129 594персонаж (character) 1809 8cемейство (family) 1644 7растение (plant) 1388 146птица (bird) 1319 5единица (unit) 1316 286система (system) 1239 391район (region) 1182 9группа (group) 1077 224организация (organization) 1005 50
• Device
Encyclopedia – 331, Wikipedia - 672For example: A Stargate is a portal device within the
Stargate fictional universe that allows practical, rapid travel between two distant locations
• Science
Encyclopedia – 196, Wikipedia – 338For example: A vampirology
Wikipedia vs. Encyclopedia
1. An improvement of ontoeditor
2. An expansion of syntax
3. An expansion of rules
4. Collocation extraction techniques
5. Better evaluation
6. Studies of dictionary structures
Future work
What else about me?
• Teaching: Information Retrieval, Information Systems
• Supervising: Lena Bilyk & Lena Sergeeva Citations Extraction from the Newspaper texts
• Co-organizing: – Natural Language Processing seminar
– Russian Summer School of Information Retrieval
Thank you!
top related