premapper: improving entity extraction accuracy in the digital humanities
Post on 18-Dec-2014
65 Views
Preview:
DESCRIPTION
TRANSCRIPT
IBM Haifa Research Lab
© 2014 IBM Corporation
PreMapper:Improving Entity Extraction Accuracy in theDigital Humanities
Cormac Hampson (cormac.hampson@scss.tdc.ie)Ella Rabinovich (ellak@il.ibm.com), Sara Porat (porat@il.ibm.com)Maya Koleva (maya.koleva@commetric.com), Ivan Uzunov (ivan.uzunov@commetric.com)Owen Conlan (owen.conlan@scss.tcd.ie)
IBM Haifa Research Lab
© 2014 IBM Corporation2
What is CULTURA
• Digital humanities portal supporting the exploration of cultural heritage collections by a range of different users
• Professional researchers and historians
• Students with little or no experience of a particular archive
• There are three digitised collections in the portal
• 1641 Depositions (http://cultura-project.eu/1641/)
• Bureau of Military History - 1916 Rising (http://cultura-project.eu/1916)
• IPSA Collection (http://cultura-project.eu/ipsa)
IBM Haifa Research Lab
© 2014 IBM Corporation3
Smart Content Analysis with Entity-Relationship Extraction
• A powerful technique for injecting semantics into unstructured text
• Employing Natural Language Processing (NLP)
• Involving training a dataset and/or using prior knowledge (e.g., dictionaries) so that specific entities can be identified within the text
• Each collection introduces its unique entity-relationship model
• Entities, e.g., Person, Location, Event
• Entity attributes, e.g., Person.occupation, Deposition.mentioned_date
• Relation between entities, e.g., Person at Location
IBM Haifa Research Lab
© 2014 IBM Corporation4
Entity Extraction – Example
<title> first-name last-name
sir Robert Andrew
IBM Haifa Research Lab
© 2014 IBM Corporation5
Manual Updates of Extracted Entities - Motivation
• The automatic task of entity extraction cannot provide full accuracy
• The 1641 Depositions collection introduces additional difficulty due to the noisy text, inconsistent grammar and spelling
• Extraction errors can damage a curator’s trust in the automatic processing, as well as an end user’s overall confidence in the system
• Approaches to improve the accuracy of entity extraction are of major benefit of the CULTURA environment
IBM Haifa Research Lab
© 2014 IBM Corporation6
Entities Visualisation and Modification with PreMapper
• PreMapper is a web-based visualization and analysis tool that is integrated into the CULTURA environment
• Provides visualisation and editing of entities, maps, flows and relationships between individuals and groups
• Entities (people, organizations) are represented by nodes, links present relationships between these nodes
IBM Haifa Research Lab
© 2014 IBM Corporation7
Manual Changes of Extracted Entities
PreMapper enables curators of the collection to make manual changes to
the extracted entities using a GUI
• Add/delete/update entity
• Merge two entities into a single entity (entities disambiguation)
• Add/delete relationship between entities
The entity “sir phelim” can be merged with theentity “phelim neil” if an expert deems that theseentities refer to the same person
IBM Haifa Research Lab
© 2014 IBM Corporation8
General Flow
IBM Haifa Research Lab
© 2014 IBM Corporation9
Entity Disambiguation via PreMapper
• The task of determining the identity of entities mentioned in the text
• e.g., based on entity’s key attributes
• Entity disambiguation in historical content is one of the main challengesof CULTURA professional users
• Are “sir Phelim” and “Phelim o neil” the same person?
• Are “Rob. Meredith” and “Robert Meredith” the same person?
• Entities scope matter (disambiguation of entities found in the same deposition vs. entities found in different depositions)
• Non-functional challenges
• Authorization – who is allowed to make changes?
• Personalization – what is the scope of a specific change (specific researcher, group of researchers, the entire professional community)?
• Verification – who verifies the changes?
IBM Haifa Research Lab
© 2014 IBM Corporation10
Summary and Future Work
• Entity-relationship extraction is a powerful technique for extracting structured information from unstructured documents
• PreMapper is a visualization tool that allows domain experts to improve the accuracy of the entity-relationship data
• Domain experts feedback is important in refining the user interfacewith the CULTURA environment
• It becomes vital when entity extraction is error-prone, as with the 1641 Depositions collection that contains a lot of noise and misspellings
• Future work includes further exploration and design of the fullyintegrated end-to-end solutionhttp://staging1.commetric.com:8080/cultura/?q=1641&ids=836062r034&nodeTypeId=7&layout=circle#svg-graph-editor-switch
IBM Haifa Research Lab
© 2014 IBM Corporation11
top related