mary mcgee wood, shenghui wang dept of computer science, u. of manchester valentin tablan, diana...
TRANSCRIPT
Mary McGee Wood, Shenghui WangDept of Computer Science, U. of Manchester
Valentin Tablan, Diana Maynard,
Hamish Cunningham Dept of Computer Science, U. of Sheffield
Susannah LydonEarth Science Education Unit, U. of Keele
Populating a Database from Parallel Texts using “Ontology-based” Information Extraction
The hypothesis
Overview
Parallel texts
Legacy data in the natural sciences
“Ontology-based” Information Extraction
NLDB’04 - a few running threadsMultiple / semi-overlapping text sources
Sophisticated vs shallow or statistical text processing
“Ontologies” are not the same as gazetteers or lexicons (or semantic nets!)
Autonomous agents vs HCC (Human-Computer Collaborative) approaches
We are doing…
Highly homogeneous data sources
Shallow text processing
“Ontologies” only as a last resort
HCC approach
We are not doing…
Heterogeneous data sources
Sophisticated language processing
Improvement of single-source IE or question-answering
Autonomous agents
Parallel textsText descriptions in the traditional descriptive sciences.
Descriptions of protein sequences and functions in molecular biology.
Press coverage of news stories.
Police witness-of-crime reports.
(Semi-) automatic marking of free text answers in examinations.
Legacy data in the natural sciences
Text descriptions in the traditional descriptive sciences:
Species descriptions in botany and zoology
Descriptions of diseases in medicine.
Five species of Ranunculus (buttercups)
Six botanists’ text descriptions (Floras)
Data sources
R. acris L. - Meadow Buttercup. Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; 2n=14.
Typical data
Hand Parsing & Correlation
CTM FE FNA GLEASON GRAY STACE
Petals Petals petals Petals petals
number 5 5
length usually 10-15 mm
9-13 mm 8-14 mm. long
0.8-1.4 cm long
width 8-11 mm nearly as broad
shape broadly obovate with cuneate base
broadly obovate
rounded-obovate
colour bright glossy yellow, rarely paler or white
yellow
Hand-parsed species descriptions for Ranunculus bulbosus
Results of hand-analysis of Ranunculus descriptions from six sources
- Most data from one source only
- Individual texts contain on average 39% of the total information for each species
Department of BotanyNatural History Museum,
LondonRob HuxleyDavid Sutton
MultiFlora IAutomatic compilation of accurate taxonomic databases from multiple non-computerised sourcesDepartment of Computer ScienceUniversity of ManchesterMary McGee WoodDavid RydeheardSusannah Lydon
Supported by the BBSRC / EPSRC joint Bioinformatics Initiative, grant reference number 34/BIO12072
GATE I
Tagger output
Parse trees
Names & verbs
‘Basal leaves more or less deeply divided…’1231 semantics 179 191 (qlf:[ne_tag(e13, offsets(179, 184)), name(e13, 'Basal'), realisation(e13, offsets(179, 184)), leave(e12), time(e12, present), aspect(e12, simple), voice(e12, active), realisation(e12, offsets(185, 191)), realisation(e12, offsets(185, 191)), lsubj(e12, e13)])
1247 semantics 200 226 (qlf:[divide(e14), adv(e14, less), adv(e14, deeply), time(e14, none), aspect(e14, simple), voice(e14, passive), into(e14, e15), count(e15, 3), realisation(e15, offsets(225, 226)), realisation(e14, offsets(200, 226)), realisation(e14, offsets(200, 226))])
Template output (1)Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; HEAD KIND FEATURE TYPE KIND
Erect
Perennial
to 1m measure unknown
basal position pubescent
leaves Prefix
deeply
palmately
lobed
Template output (2)flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; HEAD KIND FEATURE TYPE KIND NEGATION
flowers 15-25mm measure width
across
sepals reflexed true
achenes short
hooked
smooth
glabrous
2-3.5mm measure unknown
MultiFlora II:Combining Information Extraction and Knowledge Representation for Biodiversity Informatics
Department of Computer Science, University of ManchesterMary McGee WoodSusannah LydonAlan Rector
Department of Botany, Natural History Museum, LondonRob Huxley
Natural Language Processing Group, University of SheffieldHamish CunninghamValentin TablanDiana Maynard
Supported by the BBSRC Bioinformatics and E-science Programme, grant reference number 34/BEP17049
GATE II
“Ontology-based” Information Extraction
“Ontology” – classes of heads, properties, and features
Gazetteers – instances of these classes
(Lexicons – not currently used)
Head categoriesSpecific plant parts:Flower: Flower, floret, FlLeaf: leaf, leaves, FrondsPetal: petal, honey-leaf, vexillum
Collective categories:PlantSeparatablePart:
appendage, glume, tuberPlantUnseparatablePart:
beak, lobe, segmentSpecificRegionOfWhole: apex, border, head
Ontology: Heads
ontology-heads.eps
Properties
2DShape: arching, linear, toothed
3DShape: branching, thickened, tube
Colour: glossy, golden, greenish
Count: numerous, several
Ontology: Properties
Features
Habit: bush, shrub, succulent MorphologicalProperty:
dense, contiguous, separate SurfaceProperty:
pilose, pitted, rugose
Ontology: Features
Perennial herb with overwintering lf-rosettes from the short oblique to erect premorse stock up to 5 cm, rarely longer and more rhizome-like; roots white, rather fleshy, little branched.
More typical data
System outputHead Class Head Property FeatClass Feature
Plant herb hasLifeform Lifeform Perennial
Leaf lf-rosettes hasLifeform Lifeform overwintering
PlantSepPart stock hasRelProperty RelProperty short
PlantSepPart stock hasOrientation Orientation oblique to erect
PlantSepPart stock hasLength Length up to 5 cm
PlantSepPart stock hasRelProperty RelProperty rhizome-like
Root roots hasColour Colour white
Root roots hasShape3D Shape3D rather fleshy
Root roots hasShape3D Shape3D little branched
R. acris R. bulbosus R. hederaceus Avg
Single description, average
78 60 83 74
Single description, average, for whole template
78 60 83 74
Merged, for whole template
63 58 69 63
Precision
R. acris R. bulbosus R. hederaceus Avg
Single description, average
70 55 74 66
Single description, average, for whole template
22 18 26 22
Merged, for whole template
69 61 82 71
Recall
R. acris R. bulbosus R. hederaceus Avg
Single description, average
73.78 57.39 78.2469.77
Single description, average, for whole template
34.32 27.69 39.6033.92
Merged for whole template
65.86 59.46 74.9466.76
F-measure
Of all instances of missed information, percentage compensated for by merging
50 46 55 50
Of total number of slots in template, percentage where merging allowed compensation for missed information
25 29 18 24
Information merging
These figures based on human judgement
Automated “merging reasoner” under active construction
Information merging
Future work – short termFine-tuning to improve precision
(Semi-) automatic template correlation heuristics
(Semi-) automatic data correlation heuristics
Extend coverage and evaluation
Future targets Techniques:
Merging reasoner
Temporal reasoner
Data types:
Large-scale legacy data in biodiversity studies
Free text annotations in Bioinformatics databases
…