information access i multilingual text summarization gslt, göteborg, october 2003 barbara...

47
Information Access I Multilingual Text Summarization GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde

Post on 21-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Information Access I Multilingual Text Summarization

GSLT,

Göteborg, October 2003

Barbara Gawronska, Högskolan i Skövde

Types of summaries

(Spärck Jones 1999, Hovy & Lin 1999)

With respect to content: Indicative: provide an idea what the text is about, but do not

render the content Informative: shortened versions of the text

With respect to the way of creating: Extracts: reused portions of the text Abstracts: re-generated text reflecting the important content Compressed texts: (Knight & Marcu 2000): compressing

syntactic parse trees in order to get a shorter text

Text compression (Knight & Marcu 2000, Lin 2003)

”Given the original sentence t, find the best short sentence s generated from t, i.e. maximize P(s|t).

Original sentence (Lin 2003):

In Louisiana, the hurricane landed with wind speeds of about 120 miles per hour and caused severe damage in small coastal centres such as

Morgan City, Franklin and New Iberia

Text compression (2) (fragments of Fig. 1 in Lin 2003)

Number of Words Adjusted Log-Prob

Raw Log -Prob Sentence

14 -9.212 -128.967 In Louisiana, the hurricane landed withwind speeds of about 120 miles perhour.

14 -9.216 -129.022 The hurricane landed and causedsevere damage in small centres suchas Morgan C ity.

12 -9.252 -111.020 In Louisiana, the hurricane landed withwind speeds and caused severedamage.

14 -9.315 -130.406 In Louisiana the hurricane landed withwind speeds of about 120 miles perhour.

12 -9.372 -112.459 In Louisiana the hurricane landed withwind speeds and caused severedamage.

12 -9.680 -116.158 The hurricane landed with wind speedsof about 120 miles per hour.

10 -9.821 -98.210 The hurricane landed with wind speedsand caused severe damage.

Different genres and tasks require different summaries (informative summaries not so good for detective stories

)

and

Different texts require different summarization techniques

A special case: dialogue summarization:

selecting successful ’dialog transactions’ –

the game theoretical approach

(Verbmobil: Wahlster, Alexandersson)

Multilingual summarization:

Extracting/compressing + MT

or

Abstracting + multilingual generation

A possible combination system including multilingual summarization of news reports

Lexicaldatabases,

grammar rules

Speechrecognition Parsing

Speechsynthesis

Textgeneration

Newsreports

Inform ationExtraction

Key wordsand phrases

         Evaluation of different methods of semantic classification in the lexicon

         Development of a summarization module that would be well-suited for the news domain

         A comparison between the ‘traditional’ machine translation (MT) on the one side, and information extraction (IE) combined with reading

comprehension (RC) and multilingual text generation (MTG) on the other side

         Exploration of the interplay between textual structure, syntax, and prosodic markers.

The main objectives of the Newspeak project:

GUERILLA FIGHTS IN LEBANON

Israeli warplanes and artillery attacked suspected guerrilla hideouts Friday following a series of clashes in south Lebanon. Four guerrillas were reportedly killed. Guerrillas of the Syrian-backed Amal group attacked Israeli and allied militia positions in the Israeli-occupied zone at daybreak, Lebanese security officials said. Three guerrillas were killed in the assaults, said an Israeli army spokesman in Jerusalem. Amal said none of its fighters was killed.

One of the main problems with media texts: no possibility of stating what is a true fact (hence, some

criticism could be raised against TREC factoid questions...)

Base space B

Belief space M

a

a1

a1: blue eyes

a: green eyes

Max believes thewoman with greeneyes has blue eyes

The Theory of Mental Spaces (Fauconnier1985, Fauconnier and Sweetser 1996)

The notion of ’mental spaces’(Fauconnier 1985, Sweetser & Fauconnier 1996, Sanders & Redeker 1996)

ID

B 1 M = B 2sa id

s

x '(m ') x (m )

"T he m an w as c lea rly on the run fromthe po lice ", the spokesm an sa id .

s = spokesm anm = the m anx = c lea rly on the run from the po liceB 1 = N arra to r's rea lity , base spaceM = C harac te r's rea lity , em bedded spaceB 2 = C harac te r's rea lity , new base space

ID

B 1 Maccord ing to

s

x ''(m ') x '(m ) x

B 2

" "

A cco rd ing to the spokesm an , the m anw as "c lea rly on the run from the po lice ".

s = spokesm anm = the m anx = c lea rly on the run from the po liceB 1 = N arra to r's rea lity , base spaceM = C harac te r's rea lity , em bedded spaceB 2 = C harac te r's rea lity , new base space

GUERILLA FIGHTS IN LEBANON

Israeli warplanes and artillery attacked suspected guerrilla hideouts Friday following a series of clashes in south Lebanon. Four guerrillas were reportedly killed. Guerrillas of the Syrian-backed Amal group attacked Israeli and allied militia positions in the Israeli-occupied zone at daybreak, Lebanese security officials said. Three guerrillas were killed in the assaults, said an Israeli army spokesman in Jerusalem. Amal said none of its fighters was killed.

One of the main problems with media texts: no possibility of stating what is a true fact

(hence, some criticism could be raised against TREC factoid questions...)

’Mental Spaces’ in sample text 1

M 2Sender: Lebanese

security offic ia ls

Am al attacked IsraelP lace: Israeli-occupied zone

T im e:Saturday daybreak

M 3Sender: Israeliarm y spokesm an

M 4 Sender:Am al

R esult in M 3:Three guerillasdead

R esult in M 4: N oguerillas dead

M 1:Israel attacks guerilla h ideoutsfour guerillas k illedT im e: F riday

MSender: newsagency

Sample text 2

BEIT JALA, West Bank

Israeli troops pulled out of Beit Jala before dawn on Thursday, leaving the Palestinian town quiet amid reports of fresh violence in other West Bank towns.The Palestinians said the Israel Defence Forces had staged incursions into Hebron, killing one and injuring 16 others, and Tulkarem, killing one and injuring 10. The Israel Defence Forces (IDF) had no immediate comment on the accusation that troops had entered Tulkarem, and strongly denied there was an incursion at Hebron.

’Mental Spaces’ in sample text 2

in f_source : C N Np lace : B e it Ja latim e : A ugus t 30 2001

in f_source : new s agencyc la im :[p lace : B e it Ja la

tim e : B e fo re daw n on T hursdayac tion : Is rae li troops pu lled ou tresu lt: N o dead , no in ju red ]

M

M1

M2

M3

M4M3bin f_source : R eports

c la im :[p lace : W es t B ank tow nstim e : T hursdayac tion : V io lenceresu lt: N o t know n ]

in f_source : T he Is rae l D e fense F o rcesc la im : the even t repo rted in M 3a is no t true

c la im :[p lace : H ebrontim e : T hursdayac tion : Is rae li incu rs ionsresu lt: 1 dead , 16 in ju red ]

c la im :[p lace : T u lka remtim e : T hursdayac tion : Is rae li incu rs ionsresu lt: 1 dead , 10 in ju red ]

M3a

in f_source : P a les tin ians

English N ews

The reading com ponent: aprocedure transform ing the news

texts file in to P rolog-lists

P reparsing and pre lim inarysubdom ain identification

Identification of m enta l spaces

Tem plate filling

G eneration of a sum m ary in therequired target language provided

with prosodic m arkers

Infovox text-to-speech system

English lexicon

Subdom ain-specific tem plates lis tsof key words and sem antic features

Target language lexiconand gram m ar

Newspeak – the extraction and generation modules

Exploding objects

anyth ing having existence (liv ing or nonliv ing)

a physica l (tangib le and vis ib le) entity

a m an-m ade object

an artifact (or system of artifacts) that is instrum enta l inaccom plish ing som e end

som eth ing that serves as a m eans of transportation

weapons considered collectively

w eaponry used in fighting or hunting

a conveyance that transports people or objects

any vehicle propelled by a rocket_engine

a rocket-propelled vehicle carrying passengers orinstrum ents or a w arhead (missile)

a body that is throw n or pro jected (missile)

an instrum enta lity invented for a particu lar purpose

bursts with sudden violence from internal energy

an explosive device fused to denote under specific conditions(bomb)

missilebomb

WordNet Classification:

S y n set (W ord N etd efin ition )

S am p leW ord C ategory inN ew sp eak

D om in atin g ca tegoryin N ew sp eak

a g ro u p o f p e o p le w h ow o rkto g e th e r

a rm y , m ilita ry , tro o p s ,o rg a n isa tio n , a g e n c y ,p a rty

G ro u p o f p e o p le

g ro u p o f p e o p le w illin g too b e y o rd e rs

a rm y , m ilita ry , troo p s ,p o lic e …

A rm e d fo rc e s G ro u p o f p e o p le

a c o n v e y a n c e th a ttra n sp o rts p e o p le o ro b je c ts

ta n k , m iss ile , ro c k e t.. M e a n s o ftra n sp o rta tio n

a v e h ic le th a t m o v e s o nw h e e ls a n d u su a lly h a s ac o n ta in e r fo r tra n sp o rtin gth in g s o r p e o p le

ta n k , c a r… M e a n s o ftra n sp o rta tio n ,m e d iu m : e a rthsu rfa c e

M e a n s o ftra n sp o rta tio n

a v e h ic le u se d b y th ea rm e d fo rc e s

ta n k D e stru c tio n a sfu n c tio n

b u rs ts w ith su d d e nv io le n c e fro m in te rn a le n e rg y

b o m b D e stru c tio n a sfu n c tio n

WordNet vs. Newspeak noun classification

S y n set (W ord N etd efin ition )

S am p leW ord C ategory inN ew sp eak

D om in atin g ca tegoryin N ew sp eak

a sp e e c h a c t th a t c o n v e y sin fo rm a tio n

re p o rt S o u rc e o fin fo rm a tio n

a sp e e c h a c t th a t c o n v e y sin fo rm a tio n

re p o rt n e u tra l S p e e c h a c t

a m e a n s o r in s tru m e n ta lityfo r c o m m u n ic a tin g

ra d io , n e w sp a p e r S o u rc e o fin fo rm a tio n

a la rg e in d e fin ite lo c a tio no n th e su rfa c e o f th e E a rth

re g io n , to w n , c o u n try p la c e 2 D + c o n v .b o rd e rs

P la c e

a n y ro a d o r p a th a ffo rd in gp a ssa g e fro m o n e p la c e toa n o th e r

s tre e t p la c e -p a th (1 D ) P la c e

a s tru c tu re th a t h a s a ro o fa n d w a lls a n d s ta n d s m o reo r le ss p e rm a n e n tly in o n ep la c e

re s ta u ra n t,c h u rc h

p la c e _ 3 D P la c e

th e so lid p a rt o f th e e a rth 'ssu rfa c e

p e n in su la ,m o u n ta in s

p la c e 2 D + n a tu ra lb o rd e rs

P la c e

WordNet vs. Newspeak noun classification (2)

The outline of the summarization processEnglish News

Tokenization

Named Entity Recognition

Semantic classification, identificationof closed-class words

Identification of words denotingspeech acts

Identification of ”senders”(coreference identification included)

Identification of ”Mental Spaces”,selection decisions

TL-summary generation

Closed-class lexicon

VerbNet

WordNet

Subdomain-specific templates

TL-generators

Domain-specific reclassificationpatterns

Template filling”Squeezing” or

Named Entity Recognition and ClassificationStart

Place pointer at the firstword in the sentence

Move pointer to nextword

First LetterUppercase?

Word in’NO-ProperName’ DB?

Add to Proper NameCandidate String

Word inProper Name

IndicatorDB?

Proper NameCandidate String

empty?

The 1:st wordin Proper Name Candidate

String =2nd word in the

sentence?

the 1:st word in thesentence = closed-

class word?

Add to Proper NameCandidate String (initial

position)Semantic Classification

of Proper Name(clear Proper Name

string)

More words in thesentence?

Isthe Sentence FirstWord Classified?

Is the word alluppercase and more

than one token

End

Closed-classword?

SemanticClassification

Yes

No

Yes

No

Yes

No

Yes

No

Yes

NoYes

No

Yes No Yes

No

Yes

Yes

No

Iraqi President Saddam Hussein is striking a defiant tone a day after U.S. President George Bush's State of the Union address, saying his nation is ready to "destroy and defeat" any American attack.

In a televised meeting with his military commanders on Wednesday, Saddam said the U.S. had no right to attack his country, and every American soldier is coming "as an aggressor."

"If they have illusions, by God, America will be harmed," the Iraqi leader said. "[It is] not in the American people's interest that such harm come to it, its reputation and economy."

In a powerful address Tuesday evening, Bush braced Americans and the rest of the world for a possible war with Iraq, warning that America was determined in its resolve to see Saddam disarmed.

Sample text 3

[source(semcat(Iraqi President Saddam Hussein,[propername,human([]),human([high_status])])),semcat(tone,[[],speech_act(manner)]),circ([semcat(is,[[],cop([])]),semcat(striking,[[],[]]),semcat(a,[[],det([])]),semcat(defiant,[[],[]])]),said([semcat(a,[[],det([])]),semcat(day,[[],time_period([])]),semcat(after,[[],prep([])]),semcat(U.S. President George Bush_s State,[propername,place([country]),group_of_people([]),human([high_status]),human([]),place([d23,convent_borders])]),semcat(of,[[],prep([])]),semcat(the,[[],det([])]),semcat(Union,[propername,explosion([]),group_of_people([]),place([country])]),semcat(address,[[],speech_act([neutral]),place([d2])]),semcat(saying,[[],say_verb([neutral])]),semcat(his,[[],poss([])]),semcat(nation,[[],place([country]),group_of_people([])]),semcat(is,[[],cop([])]),semcat(ready,[[],[]]),semcat(to,[[],prep([])]),semcat(",[[],[]]),semcat(destroy,[[],[]]),semcat(and,[[],konj([])]),semcat(defeat,[[],[]]),semcat(",[[],[]]),semcat(any,[[],det([])]),semcat(American,[propername,human([])]),semcat(attack,[[],military_operation([])]),semcat(.,[[],[]])]),[]]

[source(semcat(Saddam,[propername,[]])),semcat(said,[[],say_verb([neutral])])…

Sample output from SemCat + speaker and speech act identification

coreference checked

[source(semcat(Iraqi President Saddam Hussein,[propername,human([]),human([high_status])])),semcat(tone,[[],speech_act(manner)]),circ([semcat(is,[[],cop([])]),semcat(striking,[[],[]]),semcat(a,[[],det([])]),semcat(defiant,[[],[]])]),said([semcat(a,[[],det([])]),semcat(day,[[],time_period([])]),semcat(after,[[],prep([])]),semcat(U.S. President George Bush_s State,[propername,place([country]),group_of_people([]),human([high_status]),human([]),place([d23,convent_borders])]),semcat(of,[[],prep([])]),semcat(the,[[],det([])]),semcat(Union,[propername,explosion([]),group_of_people([]),place([country])]),semcat(address,[[],speech_act([neutral]),place([d2])]),semcat(saying,[[],say_verb([neutral])]),semcat(his,[[],poss([])]),semcat(nation,[[],place([country]),group_of_people([])]),semcat(is,[[],cop([])]),semcat(ready,[[],[]]),semcat(to,[[],prep([])]),semcat(",[[],[]]),semcat(destroy,[[],[]]),semcat(and,[[],konj([])]),semcat(defeat,[[],[]]),semcat(",[[],[]]),semcat(any,[[],det([])]),semcat(American,[propername,human([])]),semcat(attack,[[],military_operation([])]),semcat(.,[[],[]])]),[]]

[source(semcat(Iraqi President Saddam Hussein,[propername,human([]),human([high_status])])),semcat(said,[[],say_verb([])]),…

Sample output from SemCat + speaker and speech act identification (2)

Searle’s classification of illocutionary acts

Macro-class Words-worldrelation

The psychologicalstate of the sender

Sample verbs

Representatives The speaker fits hiswords to the world

Belief that p Claim, announce,forecast, predict

Directives Attempt to achieve asituation where theworld fits t o thewords

Wanting that p Ask, beg, order,forbid, instruct

Commissives Commit the speakerto act in order to fitthe world to thewords

Intending p Promise, offer,swear, threaten

Declarations Alter the world Wed, baptise, name,call, dub

Expressive s No dynamic world -words relationship

Specified in thesincerity conditionexpressed by theprepositional content

Thank, apologize,congratulate, regret,pardon

The classification of speech act phrases in the Newspeak lexicon (1)

Phrases Intention: Swants R tobelieve that…

Feature(s) in the system lexicon Macro group inthe system lexicon

say, claim,report,announce,inform

p informative(neutral) informative

confirm p informative(positive) informativedeny Not p informative( negative) informativecall p p’ p=p’ interpretation(neutral) interpretationcondemn p & negative(p) interpretation(negative) interpretationprize p & positive(p) interpretation(positive) interpretationforecast,predict, assume,hypothesise

p is placed in ahypothetical(future) mentalspace

hypothese(neutral) hypothese

The classification of speech act phrases in the system lexicon (2)

Phrases Intention: Swants R tobelieve that…

Feature(s) in the system lexicon Macro group inthe system lexicon

offer, promise,swear

p is placed in ahypothetical(future) mentalspace &positive(p)

hypothese(positive) hypothese

warn, threaten p is placed in ahypothetical(future) mentalspace &negative(p)

hypothese(negative) hypothese

blame x on paccuse x of p

p & cause(x,p)& negative(p)

cause_interpretation(negative) interpretation

suspect x for p p & negative(p)& in ahypotheticalspace cause(x,p)

hypothetical_cause_interpretation interpretation,hypothese

declined to say,refused to say,neitherconfirmed nordenied, had nocomments

preferably not p utterance_refusal utterance_refusal

Some principles for selection of claims to be rendered:

1) Informatives: • Neutral, the sender is not marked for high status: officials said, the news agency reported, reportedly…A claim p introduced by a neutral informative is rendered in the summary; the source is omitted if there are no denials or confirmations of p in the text and if the source is not marked for high status, like ‘President’

• Neutral, the sender marked for high status, and ‘declarations’: the President said…the government condemned…The source is rendered if it is marked for high status

• Affirmative; confirmations of explicit claims: Israeli sources confirmed that…Confirmations of previous explicit claims are omitted in the summary

• Affirmative; confirmations of claims that are not explicitly mentioned:Both the information source and the claim, including the type of the speech act phrase, are rendered in the summary, if the speech act is a confirmation of a claim not present in the news report

Some principles for selection of claims to be rendered:

1) Informatives: • Negative, or neutral followed by denied claims: The president denied, The Israeli source said that it is not true…Both the initial claim and its denial are rendered in the summary together with the information about the senders 

2) Utterance refusal, negated speech act phrases, hypotheses, commissives, interpretations: The Israeli sources neither denied or confirmed, the minister did not say, if…, the defense secretary declined to say…, the government had no immediate comments…

Utterance refusals or negated speech act phrases related to an explicit claim are omitted

If a source refuses to confirm/deny a claim that has not been explicitly mentioned in the previous part of the text, the whole speech act is rendered, inclusive the type of the speech act

Hypotheses and commissives are rendered together with their sources and marked for unsure epistemic status

Some principles for selection of claims to be rendered:

3) Epistemic spaces: e. g. no one knows if the device was planted deliberately or if it was leftover from New Year’s Eve

If two claims would exclude each other in the same mental space, and if no source in the text takes responsibility for any of these claims, both claims are to be rendered as hypotheses

Sample input text

RAMALLAH, West Bank -- Palestinian leader Yasser Arafat said Thursday that elections as part of a reform of the Palestinian Authority will be held this winter, whether or not Israeli forces withdraw from the Palestinian territories.That represented a change of course from Arafat, who said last week that no elections would be held until the Israelis pulled back. Shortly after Arafat's announcement, a committee he had appointed to set up elections resigned, according to Israel Radio, because Arafat would not agree to a specific date for the elections. Other Palestinian leaders said the resignations were a procedural matter. Arafat also condemned Wednesday's suicide bombing in the Israeli town of Rishon Letzion . Two Israelis were killed and at least 37 others wounded when the bomber detonated explosives in the center of a crowded pedestrian district.The terror attack marked the second time in two weeks a suicide bombing directed at civilians has rocked Rishon Letzion, a town about 15 miles southeast of Tel Aviv. On May 8, a suicide attack at a pool hall killed 15 people and wounded dozens of others."Suddenly there was an explosion," 16-year-old Shmuel Voller told The Associated Press on Wednesday.The bombing occurred on Rothschild Street in the heart of the town around 9:15 p.m. (2:15 p.m. ET).

Generation: sample summary

RAMALLAH, West Bank -- Palestinian leader Yasser Arafat said Thursday that elections as part of a reform of the Palestinian Authority will take place this winter, whether or not Israeli forces withdraw from the Palestinian territories.On Wednesday, a suicide bombing took place in the Israeli town of Rishon Letzion, on Rothschild Street in the center of a crowded pedestrian district, around 9:15 p.m. (2:15 p.m. ET). Two Israelis were killed and at least 37 others wounded. Arafat condemned the attack.

Swedish:Israeliska trupper tågade ut ur Beit JalaIsraeli+pl troops marched out of/left Beit Jala (tågade ut ur instead of *drog ut av)

Polish:Wojska izraelskie wycofały się z Beit JalaTroops Israeli backed out from Beit Jala(wycofały się instead of *wyciągnęły or *wyciągały).

Generation

TL vocabulary more restricted than SL vocabulary

TL pattern fit textual/semantic relations

E: A bomb exploded in Bilbao, Spain, early Friday morning. S: En bomb exploderade i den spanska staden Bilbao tidigt på fredagsmorgonen a bomb explode-past in def Spanish city Bilbao early on Friday-morning-def  E: There were no injuries.S: Inga personskador rapporterades no person-injuries report-past-passive E: ETA is suspected for being responsible for the attack.S: Förmodligen ligger ETA bakom bombdådet. Presumably lay-pres ETA behind bomb-outrage-def

Generation

Animacy degree

Gramma-tical gender

Semantic features

Accusative form

Adjective ending in plural

Verb ending in plural,past tense

inanimate +ma/+fe -alive acc=nom -e -ły

+ne +/- alive

semianimate +ma - alive, + mobile or + spherical

sg: acc=gen or acc=nom,pl: acc=nom

-e -ły

animate +ma/+fe + alive sg: acc=gen,pl:acc=nom

-e -ły

superanimate +ma + humanacc=gen

-i/-y -li

The grammatical and semantic characteristics of Polish nouns

Krzesła sta-łyChair+PL stand+PAST+PL’The chairs were standing (there)’

Psy sta-łyDog+PL stand+PAST+PL’The dogs were standing (there)’

Duchy sta-łyGhost+PL stand+PAST+PL’The ghosts were standing (there)’

Dziewczynki sta-łyGirl+PL stand+PAST+PL’The girls were standing (there)’

Krzesła sta-łyChair+PL stand+PAST+PL’The chairs were standing (there)’

Chłopcy sta-liBoy+PL stand+PAST+PL+MALE+HUMAN’The boys were standing (there)’

PolW N Database

Case DeclNumberGenderSemCatW ord Cat

Case DeclNumberGenderSemCatW ord Cat

Pojawili się więc Algierczycy, Jemeńczycy, obywatele Bangladeszu, Uzbecy, Kirgizi i Tadżycy.

AlgierczycyJemeńczycyobywateleUzbecy

n hum ma pl nom 35n hum ma pl nomn hum ma pl nomn hum ma pl nom

351436

KirgiziTadżycy

n hum ma pl nomn hum ma pl nom

3835

Pojawili v hum ma pl

Stop-list and a suffix list with declension numbers

’There arrived Algerians, Yemenis, citizens of Bangladesh, Uzbeks, Kirgizis, and Tadjiks’

Extracting ’superanimate’ nouns (1)

Postverbal subjects:We wtorek w stolicy Kataru zebrali się na nieformalnej konferencji ministrowie 22 państw Ligi Arabskiej.

Preverbal subjects:W przyjętej w Dausze wspólnej deklaracji Arabowie zdecydowanie potępili terroryzm we wszelkich formach.

Antecedents of the relative pronoun ’którzy’:Komórka składała się z wielu dziesiątek osób, w tym dwóch pilotów, którzy kształcili się w tych samych szkołach amerykańskich, co Mohammed Atta.

‘22 ministers of the Arab countries gathered together at an informal conference in the capital of Qatar on Tuesday.‘

‘In the joint declaration the Arab leaders have strongly condemned all forms of terrorism.’

‘The cell consisted of dozens of people, including two pilots, who had completed their education at the same American schools that Mohammed Atta attended.’

Extracting ’superanimate’ nouns (2)

The decrease of unknown superanimate noun forms during the training phase

(training on 4 files, ca 11 000 words each) – normalized data

0

20

40

60

80

100

120

140

160

1 2 3 4

Corpus

Co

un

t

normalised unique forms

normalized unknownforms

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4

Corpus

%

Percent correctlyclassified nouns

Correctly classified nouns

0

20

40

60

80

100

120

+ma,+hum,+pl,+nom

+ma,+hum

+ne,+sg,+nom

other nouns

other categories

  World news Sport Science Business

Nouns (types) found in the database 166 47 64 48

Nouns (types) added to the database 24 121 98 78

Total 190 168 162 126

The results of post-editing after the training phase

The lexical coverage of different text domains

The general procedure for extracting and classifying different word classes in Polish

Stop-listB

Items w ith a highmarkedness degree

ACorpus study,

linguistichypotheses

Database

FMost frequent

inflectional forms

EPost-editing

CAgreeing items in

sentence or phrasecontext

DGenerating

inflectional forms

Stop-listpojaw ili

v: +pl,+ma,+hum

ACorpus study,

linguistichypotheses

Database

FMost frequent

inflectional forms

EPost-editing

CAgreeing items in

sentence or phrasecontext

DGenerating

inflectional forms

Stop-listpojaw ili

v: +pl,+ma,+hum

ACorpus study,

linguistichypotheses

Database

FMost frequent

inflectional forms

EPost-editing

Algierczycyn:+pl,+ma,

+hum,+nom

DGenerating

inflectional forms

FMost frequent

inflectional forms

Stop-listpojaw ili

v: +pl,+ma,+hum

ACorpus study,

linguistichypotheses

Database

EPost-editing

Algierczycyn:+pl,+ma,

+hum,+nom

Algierczyk+sg,+nom

Algierczyków+pl,+gen

Stop-listB

Items w ith a highmarkedness degree

ACorpus study,

linguistichypotheses

Database

Algierczyków+pl,+gen

EPost-editing

CAgreeing items in

sentence or phrasecontext

DGenerating

inflectional forms

Stop-listB

Items w ith a highmarkedness degree

ACorpus study,

linguistichypotheses

Database

Algierczyków+pl,+gen

EPost-editing

DGenerating

inflectional forms

protestujących

'protesting':prt,+pl,+gen

Stop-listB

Items w ith a highmarkedness degree

ACorpus study,

linguistichypotheses

Database

Algierczyków+pl,+gen

EPost-editing

protestujących

'protesting':prt,+pl,+gen

protestować protestujący protestującą

v, infprt,+pl,+nomprt,+sg,+fe,+acc

Stop-list

ACorpus study,

linguistichypotheses

Database

FMost frequent

inflectional forms

EPost-editing

CAgreeing items in

sentence or phrasecontext

DGenerating

inflectional forms

protestującą prt,+sg,+fe,+acc

Stop-list

ACorpus study,

linguistichypotheses

Database

FMost frequent

inflectional forms

EPost-editing

DGenerating

inflectional forms

protestującą prt,+sg,+fe,+acc

grupę 'group', n,+sg,+fe,+acc

GUI for linking WN synsets and Polish words

        Domain extension        Further work on the target lexicon        Feedback from the generation module

into the source lexicon        Continued study of relations between

textual structure and prosody

Further development