isocat data model: workflow & guidelines marc kemps-snijders a, sue ellen wright b, menzo...

29
ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a , Sue Ellen Wright b , Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b Kent State University Marc.kemps-snijders @ mpi.nl , [email protected] , [email protected] NEERI Helsinki Standards Workshop 2009-09-30

Upload: genevieve-jeffries

Post on 31-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

ISOcat Data Model:Workflow & Guidelines

Marc Kemps-Snijdersa, Sue Ellen Wrightb, Menzo Windhouwera aMax Planck Institute for Psycholinguistics, bKent State University

[email protected], [email protected], [email protected]

NEERIHelsinki

Standards Workshop2009-09-30

Page 2: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org Data category

The result of the specification of a given data field A data category is an elementary

descriptor in a linguistic structure or an annotation scheme (ISO 1087-2)

Linguistic data categories: /part of speech/, /noun/, /verb/ /definition/, /context/, etc.

© DCR Group 2009 2 of 29

Page 3: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data Category Applications

DCs are used as: Field names in databases Permissible values for closed and constrained data categories Tag names and attribute values in annotation frameworks

DCs are used by: Different broad thematic domains (e.g., terminology,

morphosyntax, lexicography, etc.) Different communities of practice within a given domain

Data category selections exist as: Resource tag sets (e.g., tagsets used in major corpora) Standardized sets of field names and values (e.g., TBX Basic,

TBX [ISO 29042])

© DCR Group 2009 3 of 29

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 4: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data category

TC 37 practice treats both data fields and enumerated domain values as data categories: Open data categories: e.g., term, which can take any

value designated as a term Closed data categories: e.g., grammatical gender, which

takes a set of enumerated values as its content Constrained data category: e.g., Olympic years, which

takes as its content values defined by a formal constraint (i.e., every fourth year starting from a certain date)

Simple data categories: e.g., masculine, member of an enumerated value domain

© DCR Group 2009 4 of 29

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 5: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data category types

grammaticalGender

enumerated string

neuter

masculine

feminine

closed

simple:

emailAddress

constrained string

constrained

Constraint: .+@.+

complex:

© DCR Group 2009 5 of 29

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 6: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data category relationships

Value domain membership Subsumption relationships

between simple data categories Relationships between complex

data categories are not stored in the DCR

partOfSpeech

pronoun

enumerated string

© DCR Group 2009 6 of 29

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 7: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data Category Registry (DCR)

set of data categories to be used as a reference for the definition of linguistic annotation schemes or any other formats used in the area of language resources Implemented as the TC 37 ISOcat registry Registration Authority: Max Planck Institute for

Psycholinguistics Nijmegen Open and accessible at: http://www.isocat.org Come play with the cat! But – he’s a bit fussy and likes to have people

follow some simple rules! Simple rules are spelled out in the DCR

Guidelines.

© DCR Group 2009 7 of 29

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 8: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

ISOcat model and mission & metaphor

Not a layered onion … A segmented aggregate, like knob of garlic instead:

“Cloves” are sets of private data categories The center stem represents the standardization core Many DCs and DCS may never be intended for

standardization Only the standardized core described

in ISO 12620:2009 Need to define non- and pre-

standardization procedures

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 9: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

ISOcat Data model

The ISO 12620 data model consists of 3 main parts: Administrative part

Administration and identification Descriptive part

Documentation and information for working language or languages

Data element names and identifiers Data element concept definitions

Linguistic part Conceptual domain of object language Data element type declarations Special object language constraints

© DCR Group 2009 9 of 29

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 10: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data Model and DC life cycle

Part 1 of the ISOcat data model reflects the DC standardization cycle

Major steps in the workflow = classes in the DC model But the creation cycle precedes standardization A DC must be created, and ideally discussed in a group

before the standardization process even begins. Not all DCs will be standardized.

© DCR Group 2009 10 of 29

The process starts out here, and we need to define this process.

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 11: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Non- & Pre Standardization Workflow

DC created in private work space Option: DC remains private Option: assign DC to a group Option: DC discussed & revised in

the group to achieve consensus Option: DC used in group Option: DC used widely by public

group Option: DC submitted for

standardization Standards process starts with

submission

Stan-dardized

Core

DCS

DCS

DCS DCS

DCS

DCS

DCSDCS

DCR

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 12: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Cascade of Responsibility

ISOcat Model Design the ISOcat development group Approved by TC 37 Standardized in ISO 12620:2009

ISOcat input template, interface presentation Implementation by the ISOcat programmer/system

administrator Approved by development group Scrutiny of beta testers, user community

ISOcat Guidelines for data category specifications http://www.isocat.org/manual/DCRGuidelines.pdf Instantiation by the individual expert user Scrutiny by other users, eventually by DCR TDGs/DCRB

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 13: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

ISO 12620 DCR

Data Category

Global Information

Administration Information Section

Administration Record

Registration Group

Submission Group

Stewardship Group

Decision Group

Description Section

Language Section

Name Section

Definition Section Example Section

Explanation Section

Data Element Name Section

Complex Data...

Simple Data...

Closed Data...

Open Data...

Constrained Dat...

Linguistic Section

Closed Linguistic...

Open Linguisti...

Constrained Linguistic...

Conceptual Domain

Value Domain

Open Conceptua...

Conceptual...

Profile Value...

Change Section

Figure 5 - The description part

Figure 6 - The linguistic part

Figure 4 - The administration part

Overview

Three partsLynch pin:Data Category

Page 14: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Part 1 DCR

Data Category

Global Information

Administration Information Section

Administration Record

Registration Group

Submission Group

Stewardship Group

Decision Group

Description Section

Language Section

Name Section

Definition Section Example Section

Explanation Section

Data Element Name Section

Complex Data...

Simple Data...

Closed Data...

Open Data...

Constrained Dat...

Linguistic Section

Closed Linguistic...

Open Linguisti...

Constrained Linguistic...

Conceptual Domain

Value Domain

Open Conceptua...

Conceptual...

Profile Value...

Change Section

Figure 5 - The description part

Figure 6 - The linguistic part

Figure 4 - The administration part

Global Information &AdministrationInformationSection

DCR

Data CategoryGlobal Information

Administration Information Section

Administration Record

Registration Group

Submission Group

Stewardship Group

Decision Group

Change

0..1

1

0..1

1

0..1

1

0..1

1

1..*

1

0..1

1

1..*

1

1

1

1

1

Page 15: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Identifiers – Responsibilities

Required

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 16: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Justification – Creator Responsibility

Justification for /part of speech/:

Part of speech obvious, but not true of every DC for every potential user.

Required for standardization Highly desirable for any DC that will be shared outside a

private scope

NeeriHelsinki

2009-09-30

www.isocat.org

Required

Page 17: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Administration Information Section

Implementation of the standardization workflow Embodied in the information workflow associated with the

standardization process Standardized in ISO 12620:2009 in compliance with ISO

Directives Annex ST for Standards as Databases Represented by the flowchart in slide 19/20 Responsibility:

Thematic Domain Groups (TDGs), which act as stewards in maintaining data category specifications (DCs) and data category selections (DCSs)

Data Category Registry Board (DCRB), which validates DCs and DCSs and endeavors to harmonize among TDGs

NeeriHelsinki

2009-09-30

www.isocat.org

Page 18: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data categoryThe standardization option

Data categories can be kept private or submitted to the standardization process, in which case they are assigned to a Thematic Domain Group which judges them.

DCR Board

TDGmetadata

TDG…..

TDGmorphosyntax

TDGterminology

At regular intervals, snapshots of

the standardized subset of the

DCR will be submitted to ISO to form a “standard as database” according to Annex ST of the ISO/IEC Directives.

NEERIHelsinki

2009-09-30

www.isocat.org

Page 19: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

TDG Role: Maintenance TeamNeeri

Helsinki2009-09-30

www.isocat.org

Page 20: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

DCRB Role: Validation RoleNeeri

Helsinki2009-09-30

www.isocat.org

Page 21: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Part 2 DCR

Data Category

Global Information

Administration Information Section

Administration Record

Registration Group

Submission Group

Stewardship Group

Decision Group

Description Section

Language Section

Name Section

Definition Section Example Section

Explanation Section

Data Element Name Section

Complex Data...

Simple Data...

Closed Data...

Open Data...

Constrained Dat...

Linguistic Section

Closed Linguistic...

Open Linguisti...

Constrained Linguistic...

Conceptual Domain

Value Domain

Open Conceptua...

Conceptual...

Profile Value...

Change Section

Figure 5 - The description part

Figure 6 - The linguistic part

Figure 4 - The administration part

DescriptivePart

Data Category

Description Section

Language Section

Name Section

Definition Example

Explanation

Data Element Name

0..*

1

1

1

0..*

1

0..*

1

1..*

1

0..*

1

0..*

1

Describes equivalents in working languages;English data element name, definition, and

justification comment required

Database, format or application specific data

element names

Rigorous terminological definition consisting of a single sentence fragment linked to a

logical concept system

Page 22: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Part 2: Guideline Responsibilities

Data Element Name: Language-independent name for the data category used

in a specific application domain (specified in the Source) PoS / POS / pos are all common short forms used for /part of

speech/ in various application environments.

Name Section in a Language Section (Min. one required in English Language Section) (Multiple in multiple Language Sections permitted) Human-legible (mnemonic) name

‘part of speech’ in the English language section ‘partie du discours’ in the French language section

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 23: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

NeeriHelsinki

2009-09-30

www.isocat.org

One en Name required.

Multiple Names optional.

Multiple Names in other languages optional.

Page 24: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Part 2: Guideline Responsibilities

Definition: Rigorous intentional definitions (ISO 704) Single sentence fragment Additional information in comments fields, justification, etc. Example: Die Klasse von Wörtern einer Sprache (broader concept) … auf Grund der Zuordnung (characteristic) nach gemeinsamen grammatischen Merkmalen.

(characteristic) Source:

The source for any quoted material; here: Wikipedia

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 25: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Part 3 DCR

Data Category

Global Information

Administration Information Section

Administration Record

Registration Group

Submission Group

Stewardship Group

Decision Group

Description Section

Language Section

Name Section

Definition Section Example Section

Explanation Section

Data Element Name Section

Complex Data...

Simple Data...

Closed Data...

Open Data...

Constrained Dat...

Linguistic Section

Closed Linguistic...

Open Linguisti...

Constrained Linguistic...

Conceptual Domain

Value Domain

Open Conceptua...

Conceptual...

Profile Value...

Change Section

Figure 5 - The description part

Figure 6 - The linguistic part

Figure 4 - The administration part

LinguisticPart

Data Category

Complex Data Category

Simple Data Category

Closed Data Category

Open Data Category

Constrained Data Category

Linguistic Section

Closed Linguistic Section

Constrained Linguistic Section

Conceptual Domain

Value Domain

Open Conceptual Domain

Schema Specific Domain

Profile Value Domain

Example Explanation

0..*

1

0..*1

1..*1

0..*

1

1..*

1 0..*

1

0..10..*

11

0..* 1

1..*

1

1..* 1

0..1 1

0..*

1

0..*

1

1..*1

Page 26: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data categoryLinguistic part

Data Category

Complex Data Category

Simple Data Category

Closed Data Category

Open Data Category

Constrained Data Category

Linguistic Section

Closed Linguistic Section

Constrained Linguistic Section

Conceptual Domain

Value Domain

Open Conceptual Domain

Schema Specific Domain

Profile Value Domain

Example Explanation

0..*

1

0..*1

1..*1

0..*

1

1..*

1 0..*

1

0..10..*

11

0..* 1

1..*

1

1..* 1

0..1 1

0..*

1

0..*

1

1..*1

Complex , constrained and simple data categories are

explicitly modeled here

Constraints for a given object

language

Enumeration of permissible values in closed value domains

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 27: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Data categoryLinguistic part (example)

Data category: /grammatical gender/ Conceptual domain: /masculine/, /feminine/, /neuter/

Lists all admissible values for all languages Linguistic Section

Language: fr Value Domain: /masculine/, /feminine/ Lists all admissible values for French Linguistic section values must be subset of the defined

conceptual domain.

Data category: /part of speech/, value: /partitive/ Limited in the Linguistic Section to French Issue with the partitive case in Finnish – some values are very

language dependent

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 28: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

QA Components

Option for ad hoc group validation TDG approval during standardization DCRB harmonization & validation ISOcat Checker

NEERIHelsinki

2009-09-30Standards Workshop

www.isocat.org

Page 29: ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b

Thank you for your attentionCome play with the cat!

http://www.isocat.org

http://blogs.warwick.ac.uk/jmiles/tag/shadow_of_the_colossus/