balricling athens vg

Upload: voula-giouli

Post on 06-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 BalricLing Athens VG

    1/23

    Standards for

    collection and annotation, processing and

    advanced HLT applications.

    Dissemination of experience in linguistics

    infrastructure harmonisation wrt. corpora

  • 8/2/2019 BalricLing Athens VG

    2/23

    Basic and applied research in the field of NaturalLanguage Processing focusing on the design ofcomputational models for natural language

    recognition and "understanding" with applicationto two interwoven tracks:

    - information processing, extraction and retrieval

    - multilingual information processing (multilingualapplications and translation systems)

    What is a corpus

  • 8/2/2019 BalricLing Athens VG

    3/23

    With the use of corpora:

    corpus-based NLP

    Research and development of NLP tools

    Testing and evaluation at component and system level

    Automatic creation and maintenance of other

    linguistic resources (lexica, name lists, etc)

  • 8/2/2019 BalricLing Athens VG

    4/23

    design and implementation of annotationenvironments for training, development and

    evaluation of components

    compilation of linguistic resources

    design and implementation of components

    Information Processing, Extraction & Retrieval

  • 8/2/2019 BalricLing Athens VG

    5/23

    Levels of Annotation

    1. Surface Text Analysis (tokenisation and handling)

    2. Morphosyntactic Annotation

    3. Lemmatisation

    4. Named Entity Recognition5. Surface Syntactic Analysis

    6. Computation of Grammatical Functions in sentence parses

    7. Coreference Annotation

    8. Term detection

  • 8/2/2019 BalricLing Athens VG

    6/23

    B.Multilingual Information Processing

    parallel text processing text alignment at three levels:

    sentence

    clause word/term

    example based translation text matching template extraction template matching

  • 8/2/2019 BalricLing Athens VG

    7/23

    Name Lists

    Information Processing, Extraction and Retrieval

    Surface Text Analysis

    Morphosyntactic Annotation

    Lemmatisation

    Named Entity Recognition

    Surface Syntactic Analysis

    Functional Analysis

    Coreference Resolution

    Template Construction

    Lexicon

    Grammar Rules

    Context Rules

    Subcategori

    sation Frames

    Domain Model

    Inference Rules

    Input Document

    Template

    Text Handler

    POS Tagger &Lemmatiser

    Name Recogniser

    Shallow Parser &

    Semantic Processor

    Discourse Interpreter

    NE Rules

  • 8/2/2019 BalricLing Athens VG

    8/23

    Surface Text Analysis

    Purpose

    identification of word and sentence boundaries

    detection of dates numbers enumeration listsacronyms abbreviations punctuation

    according to Multext specifications

    Tool: Handlerset of filters chained together from the entire segmentation tool:split text isolate punctuation identify abbreviations, dates,

    numbers and enumerations identify sentences

    ResourcesRegular Expression Grammars

    Abbreviation Lists

  • 8/2/2019 BalricLing Athens VG

    9/23

    POS Tagging & Lemmatisation

    Purpose

    assignment of unique part of speech & lemma according to thetokens local function according to PAROLE specifications

    Tools: POS Tagger & Lemmatiser

    Rule Based Tagging

    automatic induction of transformation rules from annotatedcorpus

    twofold nature of rules : lexical & contextual application of rules on local context to assign tags to ambiguous

    or unknown words (a laBrill)Resources

    manually annotated corpus using a Parole compliant tagsetextended with tokenisation tags (~600 tags)

    morphological lexicon

  • 8/2/2019 BalricLing Athens VG

    10/23

    Named Entity Recognition

    Purpose

    recognition and classification of named entities classification in person, organisation, location, time, date,

    money, percent

    according to MUC-7 specifications

    Tools: Name Recogniser

    cascaded finite-state machines recognising regular

    expression based grammars

    Resources

    name lists

    trigger word lists

  • 8/2/2019 BalricLing Athens VG

    11/23

    Surface Syntactic & Functional Analysis

    Purpose

    identification of chunk boundaries and phrase structure

    computation of grammatical functions

    according to EAGLES specifications

    Tools cascaded finite-state machines recognising regular

    expression based grammars

    Resources regular expression grammars subcategorisation frames derived from the Parole syntactic

    layer

  • 8/2/2019 BalricLing Athens VG

    12/23

    Anaphoric Relations

    Purpose

    detection and classification of candidate antecedents creation of coreferential chains by linking each anaphor to

    its antecedent

    according to MUC-7 and MATE specifications

    Tools: Discourse Interpreter

    candidate antecedents detection -> weight estimation of their features ->

    anaphoric expression detection -> candidate elimination on the basis of

    selectional restrictions -> estimation of salience value for candidate

    antecedents -> selection of the most salient candidate -> discourse

    model update

    Resources

    domain model

  • 8/2/2019 BalricLing Athens VG

    13/23

    (SENT 1\59 TOK THN1\64 TOK 1\71 TOK

    1\78 TOK 1\84 TOK 1\93 TOK 1\96 TOK 1\113 TOK 1\117 TOK 1\121 TOK 1\129 TOK

    1\133 TOK 1\145 TOK O1\156 TOK 1\172 OPUNCT (1\173 TOK BTC1\176 CPUNCT )1\178 TOK 1\182 TOK

    1\186 TOK O1\196 TOK 1\212 TOK 1\216 TOK 1\224 OPUNCT (1\225 TOK O1\228 CPUNCT )1\274 PTERM_P .

    )SENT

    Sample Handled Text in Tipster Format

  • 8/2/2019 BalricLing Athens VG

    14/23

    (SENT 1\59 TOK THN AsPpPaFeSgAc1\64 TOK AjBaFeSgAc1\71 TOK NoCmFeSgAc

    1\78 TOK VbMnIdPr03PlXxIpAvXx1\84 TOK VbMnNfXxXxXxXxPePvXx1\93 TOK AtDfFePlNm1\96 TOK NoCmFePlNm1\113 TOK AsPpSp1\117 TOK AtDfFeSgAc1\121 TOK NoCmFeSgNm1\129 TOK AtDfMaSgGe

    1\133 TOK AjBaMaSgGe1\145 TOK O NoCmMaSgGe1\156 TOK NoCmFePlGe1\172 OPUNCT ( ( OPUNCT1\173 TOK BTC BTC RgFwOr1\176 CPUNCT ) ) CPUNCT1\178 TOK AsPpSp1\182 TOK AtDfMaSgAc

    1\186 TOK O NoCmMaSgAc1\196 TOK NoCmFePlGe1\212 TOK AtDfFeSgGe1\216 TOK NoPrFeSgGe1\224 OPUNCT ( ( OPUNCT1\225 TOK O O RgAnXx1\228 CPUNCT ) ) CPUNCT1\277 PTERM_P . . PTERM_P

    )SENT

    ample POS-Tagged & Lemmatised Text in Tipster Format

  • 8/2/2019 BalricLing Athens VG

    15/23

    (SENT 1\59 TOK THN AsPpPaFeSgAc1\64 TOK AjBaFeSgAc1\71 TOK NoCmFeSgAc1\78 TOK VbMnIdPr03PlXxIpAvXx1\84 TOK VbMnNfXxXxXxXxPePvXx1\93 TOK AtDfFePlNm1\96 TOK NoCmFePlNm1\113 TOK AsPpSp1\117 TOK AtDfFeSgAc1\121 TOK NoCmFeSgNm orgproperty1\129 TOK AtDfMaSgGe

    NE [org1\133 TOK AjBaMaSgGe nationality1\145 TOK O NoCmMaSgGe company1\156 TOK NoCmFePlGe

    NE /org]1\178 TOK AsPpSp1\182 TOK AtDfMaSgAc

    NE [org

    1\186 TOK O NoCmMaSgAc orgdesign1\196 TOK NoCmFePlGe1\212 TOK AtDfFeSgGe1\216 TOK NoPrFeSgGe

    NE /org]1\277 PTERM_P . . PTERM_P

    )SENT

    Name Recogniser Output inTipster Format

  • 8/2/2019 BalricLing Athens VG

    16/23

    SYN [clSYN [pp

    1\59 TOK THN AsPpPaFeSgAc as_otherSYN [np_ac

    SYN [adjp_ac1\64 TOK AjBaFeSgAc ajbasgacSYN /adjp_ac]

    1\71 TOK NoCmFeSgAcSYN /np_ac]SYN /pp]SYN [vg

    1\78 TOK VbMnIdPr03PlXxIpAvXx vb_exw1\84 TOK VbMnNfXxXxXxXxPePvXx

    SYN /vg]

    SYN [np_nm1\93 TOK AtDfFePlNm atdfplnm1\96 TOK NoCmFePlNm

    SYN /np_nm]1\113 TOK AsPpSp as_gia1\117 TOK AtDfFeSgAc atdfsgac

    SYN [np_nm1\121 TOK NoCmFeSgNm nosgnm

    SYN /np_nm]

    SYN [np_ge1\129 TOK AtDfMaSgGe atdfsggeSYN [adjp_ge

    1\133 TOK AjBaMaSgGe ajbasggeSYN /adjp_ge]

    1\145 TOK O NoCmMaSgGe nosggeSYN /np_ge]SYN [np_ge

    1\156 TOK NoCmFePlGeSYN /np_ge]

    Sample Syntactically Analysed Text in Tipster Format

  • 8/2/2019 BalricLing Athens VG

    17/23

    Marker: A unified multi-level annotation tool

  • 8/2/2019 BalricLing Athens VG

    18/23

    Marker: A unified multi-level annotation tool

  • 8/2/2019 BalricLing Athens VG

    19/23

    Term Extraction

    Purpose

    identification of candidate (indexing) terms in a text

    Tools

    finite-state machines recognising regular expression basedgrammars

    filters implementing a number of statistical scores

    Resources regular expression grammars describing major term

    patterns

  • 8/2/2019 BalricLing Athens VG

    20/23

    Term Normalisation

    Purpose

    robust processing of queries allowing for error-tolerantmatching of query terms and index terms

    Resources list of functional words for each language (optional)

    Tools

    finite state machines

    combinatorial processor

  • 8/2/2019 BalricLing Athens VG

    21/23

    Robust Querying of Text Memories

    Purpose

    robust processing of queries allowing for error-tolerantmatching of query terms and index terms

    retrieval of appropriate document portions

    Resources

    list of functional wordsfor each language (optional)

    Tools

    textual database indexing tool

    fuzzy matching tool

  • 8/2/2019 BalricLing Athens VG

    22/23

    Multilingual Information Processing

    parallel text processing

    automatically induce translation equivalencies at different levels

    Methodological principles

    knowledge acquisition through hybrid (probabilistic & linguistic) corpus-

    based methods

    make effort to devise methods and tools that are as languageindependent as possible

    To this end:

    as little linguistic processing as possible

    exploit the power of statistical tools

  • 8/2/2019 BalricLing Athens VG

    23/23

    Parallel Text Processing

    text alignment three levels of alignment

    sentence compact translation ambiguities within sentence

    use in translation aid tools for translation ortranslation example retrieval

    word/term create candidate bilingual equivalencies

    use in translation and cross-lingual IR

    clause create candidate bilingual equivalent chunks

    use in draft translation production