parlaclarinii: session 2coverage of wikipedia is not always enough (especially for historical data)...

13
ParlaCLARIN II: Session 2 11 May 2020 Virtual Valley https://www.clarin.eu/ParlaCLARIN-II

Upload: others

Post on 20-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

ParlaCLARIN II: Session 2

11 May 2020Virtual Valley

https://www.clarin.eu/ParlaCLARIN-II

Page 2: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Session 2: Tools for parliamentary corporaFrancesca Frontini

● Who mentions whom? Recognizing political actors in proceedings. Lennart Kerkvliet, Jaap Kamps and Maarten Marx● Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings: A Pilot Study. Hugo de Vos and Suzan

Verberne● Parsing Icelandic Alþingi Transcripts: Parliamentary Speeches as a Genre. Kristján Rúnarsson and Einar Freyr Sigurðsson

Bootstrapping NE annotation for Dutch Parliamentary proceedings Visualisation

Page 3: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Session 2: Tools for parliamentary corpora● Who mentions whom? Recognizing political actors in proceedings. Lennart Kerkvliet, Jaap Kamps and Maarten Marx● Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings: A Pilot Study. Hugo de Vos and

Suzan Verberne● Parsing Icelandic Alþingi Transcripts: Parliamentary Speeches as a Genre. Kristján Rúnarsson and Einar Freyr Sigurðsson

EU parliament committee meetings (English)

- not transcribed - ASR needed- challenging but interesting data

Page 4: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Session 2: Tools for parliamentary corpora● Who mentions whom? Recognizing political actors in proceedings. Lennart Kerkvliet, Jaap Kamps and Maarten Marx● Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings: A Pilot Study. Hugo de Vos and Suzan

Verberne● Parsing Icelandic Alþingi Transcripts: Parliamentary Speeches as a Genre. Kristján Rúnarsson and Einar Freyr Sigurðsson

Syntactic parsing of Icelandic Parliamentary transcripts

Linguistic analysis and comparison w.r.t. to other speech genres

Page 5: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Session 2: Tools for parliamentary corporaQuestions1) Currently the ParlaCLARIN projects are converging towards the use of a common TEI-XML format (https://clarin-eric.github.io/parla-clarin), which is intended to represent the metadata and structure of the transcripts. Do you think that the tools and annotation methods you have described can be adapted to treat and exploit this type of input?

2a) What are the interoperability requirements for the tool and the level(s) of annotation you have described in your paper, in order to allow for cross corpus/cross language comparisons and analysis?

2b) Two of the papers directly address the question of Named Entities, the need to recognise them and the challenge they represent. An important requirement is to make NE annotation and referencing more interoperable across corpora and languages, and thus to allow for cross corpora searches for mentions of persons (“Donald Trump” or “the president of the united states”), places (“Germany”) or institutions (“WHO” = “OMS”)? Have you thought about this issue, and the possible alignment with Semantic Web resources (ISNI and VIAF, Geonames, DBpedia/Wikidata, etc.)?

Page 6: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Who mentions whom? Recognizing political actors in proceedings. Lennart Kerkvliet, Jaap Kamps and Maarten Marx

Answers to questions.1. Spacy NER can easily be applied to data in TEI-XML.2. Cross corpus comparisons are possible eg on NE/token ratios (also

grouped by NE type). 3. NE linking is usually done to local databases on MPs etc.

a. Coverage of wikipedia is not always enough (especially for historical data)b. Local databases change all the time (eg committee names and memberships,

names , party affiliations of MPs ).c. This is hard to keep up to date.

Page 7: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Parsing Icelandic Alþingi Transcripts: Parliamentary Speeches as a Genre. Kristján Rúnarsson and Einar Freyr Sigurðsson

● We have made a new addition of parliament transcripts to the Icelandic Parsed Historical Corpus (IcePaHC)

● Our source is the Icelandic Gigaword Corpus, which uses a TEI XML format, so we already have scripts to work with such a format

● A TEI XML output format would also be a nice option for users and will be considered when we package IcePaHC in the future

● The Penn-based format will likely be retained for internal use and as a more human-readable option for users

● We consider XML to be impractical for manual editing of parse trees in text files, which we still must do quite a bit of.

Page 8: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Parsing Icelandic Alþingi Transcripts: Parliamentary Speeches as a Genre. Kristján Rúnarsson and Einar Freyr Sigurðsson

● IcePaHC was based on the Penn historical corpora and the format is largely compatible with them and several other treebanks for English and some for other languages, e.g. Greek, German

● The same corpus search tools, etc. can be used

● Linguistic analysis and annotation conventions stay the same where possible

● Tools to transform IcePaHC into the Universal Dependency format have just been developed; the UD-version of IcePaHC adds much more interoperability

Page 9: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Parsing Icelandic Alþingi Transcripts: Parliamentary Speeches as a Genre. Kristján Rúnarsson and Einar Freyr Sigurðsson

● We have not considered NER for IcePaHC, but it would be good to facilitate it, as NER is very relevant for parliament transcripts

● We have discussed finding better ways to parsing proper names:Currently, “proper nouns” are tagged at the word level, which has been problematic for multi-word proper names, especially ones which have a more complex inner structure, such as prepositional phrases, etc.What about tagging these as regular nouns, adjectives, prepositions, etc. and encapsulating the entire proper name in a phrase tag?

● Facilitating NER would be additional motivation for such an approach, as it would provide a clear anchor point for a named entity ID

Page 10: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings: A Pilot Study. Hugo de Vos and Suzan Verberne

EU parliament: Too shallow

EU parliament committees: more detail yet no transcriptions

Make transcriptions with ASR is feasible

Page 11: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings: A Pilot Study. Hugo de Vos and Suzan Verberne

1. Depends on the quality of the ASR (Can we sufficiently recognize entities)

Currently we are pessimistic

We are looking into recognizing whether an entity was named

Page 12: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings: A Pilot Study. Hugo de Vos and Suzan Verberne

2. Aims:

- Quality as high as possible- Interoperability is desired but of later concern- Further development based on needs of Political Science community

Page 13: ParlaCLARINII: Session 2Coverage of wikipedia is not always enough (especially for historical data) b. Local databases change all the time (eg committee names and memberships, names

Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings: A Pilot Study. Hugo de Vos and Suzan Verberne

Questions for community

- Is correcting transcriptions an acceptable way to make a reference text in ASR?

@[email protected]