hitime project

Post on 15-May-2015

148 Views

Category:

Spiritual

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HiTiME project description

Christian Roosendaal (christian.roosendaal@gmail.com),

Vyacheslav Tykhonov (vty@iisg.nl),

HiTiME System developers IISH Amsterdam

Processing module● Check for new documents● Split into words● Store in DB

CMS (Drupal, WordPress, …,)

Sourcedata

NERNER

NER

Knowledge Base

Entity Recognize module● Retrieve document tokens● Send to NER by telnet● If token is recognized entity → store in DB

Meanings module● Look for sequences of entities● Replace with known composite entities

1.

2.

3.

4.

6.

5.

HiTiME prototype data flow

Input DB

7.

Training sets from IISH archives

Clio-Infrastructure● Infrastructure to store data from different systems● Connect dates and locations with datasets● Find relevant documents in time/location domain● Visualize trends relevant to documents

HiTiME application- Persons- Organizations- Locations- Dates- Professions

LINKSDatabase with 8000+ professions● Create training sets

Evergreen librarySystem● Create training sets for

authority records● Improve MARC21● records

Searchsearch.iisg.nl● Improve metadata● Extend

functionality with new filters

PID service● Store entities

IISH systems integration

Knowledge baseExport data to e.g.RDF, OWL, XML

OCR application● Scans, posters,

archives

External applications● BWSA● Timeline● Visual Mets

System design

Inputdata

KB

HiTiME core

doc_id last_modified data

Document 1 12-13-12 12:04 “Petrus Alma is great...”

Document 2 12-13-12 11:37 “...”

doc_id last_modified data

Document 1 12-13-12 12:04 <person>Petrus Alma</person> is great...”

Document 2 12-13-12 11:37 “...”

● HiTiME core checks for new or updated documents in input database● Input database can be any type of database with timestamps

doc_id word_id word

0 0 Petrus

0 1 Alma

0 2 is0 3 great

doc_id sentence_id position word_id meaning_flag identity_id

0 0 0 0 0

0 0 1 1 0

0 0 2 2 0

0 0 3 3 0

Example string: “Petrus Alma is great”

Split text into words and store words separately in table:

Store coordinates of each word in coordinate table:

Database design (1/2)

word_id NER Frog Heidel UCTO Decision0 PERS PERS1 PERS PERS

Processing of text by NER. Output of NER:

“Petrus” → PERS“Alma” → PERS“is” → 0“great” → 0

doc_id sentence_id position word_id meaning_flag identity_id

0 0 0 0 1

0 0 1 1 1

0 0 2 2 0

0 0 3 3 0

Store in decision table:

Database Design (2/2)

Update meaning_flag in coordinate table:

Improvement : Integration of FROG, UCTO and HeidelTime

● Prototype only uses NER, and crude methods to split raw text into sentences and words● Splitting can be made more reliable with UCTO and FROG● Time expressions are not recognized in prototype → HeidelTime

Word NER Frog Heidel ... Decision

Amsterdam LOC LOC

Amsterdam is a location. Seems right, but what if the text means the VOC ship “Amsterdam”?

Improvement: Disambiguation of recognized entities (1/2)

Improvement: Disambiguation of recognized entities (2/2)

NER can be trained to improve accuracy. By making use of differently trained NER'swe can build an Expert System:

Word NER Frog Heidel NER2 NER3 Decision

Amsterdam LOC SHIP BAND ?

Final decision can be made based on priorities of trained models.Our idea is to assign lowest priorities to wide scope models.

ShipsAmsterdam (VOC ship), an 18th century cargo ship

MS Amsterdam, a cruise ship owned and operated by Holland America LineMusicAmsterdam (band), a pop band from the United Kingdom

"Amsterdam" (Jacques Brel song), a song by Jacques Brel

“Petrus Alma is great”

Recognized as person

“Petrus Alma is great”

Recognized as one person

In our prototype:

Should be:

Recognized as person

Improvement: “composite” entities (1/2)

Search for sequences of recognized entities in coordinate table:doc_id sentence_id position word_id meaning_flag identity_id

0 0 0 0 1 0

0 0 1 1 1 0

0 0 2 2 0

0 0 3 3 0

identity_id name type

0 Petrus Alma PERS

1 Aron van Dam PERS

2 Frederik Feringa PERS

“Petrus Alma”

Compare these sequences with entities in entities table:

Improvement: “composite” entities (2/2)

identity_id name type

0 Petrus Alma PERS

Possible solution: Keep track of known entities in separate entities table:

Final decision about entity:

BWSA application before processing

BWSA application after processing

top related