hitime project

13
HiTiME project description Christian Roosendaal ([email protected]), Vyacheslav Tykhonov ([email protected]), HiTiME System developers IISH Amsterdam

Upload: vty

Post on 15-May-2015

148 views

Category:

Spiritual


0 download

TRANSCRIPT

Page 1: HiTIME project

HiTiME project description

Christian Roosendaal ([email protected]),

Vyacheslav Tykhonov ([email protected]),

HiTiME System developers IISH Amsterdam

Page 2: HiTIME project

Processing module● Check for new documents● Split into words● Store in DB

CMS (Drupal, WordPress, …,)

Sourcedata

NERNER

NER

Knowledge Base

Entity Recognize module● Retrieve document tokens● Send to NER by telnet● If token is recognized entity → store in DB

Meanings module● Look for sequences of entities● Replace with known composite entities

1.

2.

3.

4.

6.

5.

HiTiME prototype data flow

Input DB

7.

Training sets from IISH archives

Page 3: HiTIME project

Clio-Infrastructure● Infrastructure to store data from different systems● Connect dates and locations with datasets● Find relevant documents in time/location domain● Visualize trends relevant to documents

HiTiME application- Persons- Organizations- Locations- Dates- Professions

LINKSDatabase with 8000+ professions● Create training sets

Evergreen librarySystem● Create training sets for

authority records● Improve MARC21● records

Searchsearch.iisg.nl● Improve metadata● Extend

functionality with new filters

PID service● Store entities

IISH systems integration

Knowledge baseExport data to e.g.RDF, OWL, XML

OCR application● Scans, posters,

archives

External applications● BWSA● Timeline● Visual Mets

Page 4: HiTIME project

System design

Inputdata

KB

HiTiME core

doc_id last_modified data

Document 1 12-13-12 12:04 “Petrus Alma is great...”

Document 2 12-13-12 11:37 “...”

doc_id last_modified data

Document 1 12-13-12 12:04 <person>Petrus Alma</person> is great...”

Document 2 12-13-12 11:37 “...”

● HiTiME core checks for new or updated documents in input database● Input database can be any type of database with timestamps

Page 5: HiTIME project

doc_id word_id word

0 0 Petrus

0 1 Alma

0 2 is0 3 great

doc_id sentence_id position word_id meaning_flag identity_id

0 0 0 0 0

0 0 1 1 0

0 0 2 2 0

0 0 3 3 0

Example string: “Petrus Alma is great”

Split text into words and store words separately in table:

Store coordinates of each word in coordinate table:

Database design (1/2)

Page 6: HiTIME project

word_id NER Frog Heidel UCTO Decision0 PERS PERS1 PERS PERS

Processing of text by NER. Output of NER:

“Petrus” → PERS“Alma” → PERS“is” → 0“great” → 0

doc_id sentence_id position word_id meaning_flag identity_id

0 0 0 0 1

0 0 1 1 1

0 0 2 2 0

0 0 3 3 0

Store in decision table:

Database Design (2/2)

Update meaning_flag in coordinate table:

Page 7: HiTIME project

Improvement : Integration of FROG, UCTO and HeidelTime

● Prototype only uses NER, and crude methods to split raw text into sentences and words● Splitting can be made more reliable with UCTO and FROG● Time expressions are not recognized in prototype → HeidelTime

Page 8: HiTIME project

Word NER Frog Heidel ... Decision

Amsterdam LOC LOC

Amsterdam is a location. Seems right, but what if the text means the VOC ship “Amsterdam”?

Improvement: Disambiguation of recognized entities (1/2)

Page 9: HiTIME project

Improvement: Disambiguation of recognized entities (2/2)

NER can be trained to improve accuracy. By making use of differently trained NER'swe can build an Expert System:

Word NER Frog Heidel NER2 NER3 Decision

Amsterdam LOC SHIP BAND ?

Final decision can be made based on priorities of trained models.Our idea is to assign lowest priorities to wide scope models.

ShipsAmsterdam (VOC ship), an 18th century cargo ship

MS Amsterdam, a cruise ship owned and operated by Holland America LineMusicAmsterdam (band), a pop band from the United Kingdom

"Amsterdam" (Jacques Brel song), a song by Jacques Brel

Page 10: HiTIME project

“Petrus Alma is great”

Recognized as person

“Petrus Alma is great”

Recognized as one person

In our prototype:

Should be:

Recognized as person

Improvement: “composite” entities (1/2)

Page 11: HiTIME project

Search for sequences of recognized entities in coordinate table:doc_id sentence_id position word_id meaning_flag identity_id

0 0 0 0 1 0

0 0 1 1 1 0

0 0 2 2 0

0 0 3 3 0

identity_id name type

0 Petrus Alma PERS

1 Aron van Dam PERS

2 Frederik Feringa PERS

“Petrus Alma”

Compare these sequences with entities in entities table:

Improvement: “composite” entities (2/2)

identity_id name type

0 Petrus Alma PERS

Possible solution: Keep track of known entities in separate entities table:

Final decision about entity:

Page 12: HiTIME project

BWSA application before processing

Page 13: HiTIME project

BWSA application after processing