on embedding machine-processable semantics into documents

Post on 14-Jan-2016

32 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

On Embedding Machine-Processable Semantics into Documents. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA. Talk Outline. Background and Motivation ( Why ?) Goals ( What? ) Details ( How ?) Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

1

On Embedding Machine-Processable Semantics into Documents

Krishnaprasad ThirunarayanDepartment of Computer Science & Engineering

Wright State UniversityDayton, OH-45435, USA

2

Talk Outline

Background and Motivation (Why?)

Goals (What?)

Details (How?)

Conclusions

3

Background and Motivation

4

Heterogeneous Doc. Spec. Defn. Rep.

Content Extraction:

Formalize doc, using controlled vocabulary

5

Problems with this approach to content extraction

Archiving spec (for human comprehension) separately from its formalization is not conducive traceability.Manual extraction from spec (from scratch) for each use is labor intensive, time consuming, and prone to typographical errors.

6

Observation

Conceptually, every piece of information in an extraction owes its existence to a phrase in spec, and possibly, controlled vocabulary. So, explore techniques to maintain correspondence between a spec fragment and its formalization.

7

Goal

8

General Problem

Embed domain-specific mark-up (annotations) into human sensible document to make explicit semantics of

“content” text and complex data, and to augment an interpretation in a

modular fashion. Document text: Human comprehensible Semantic Mark-up: Machine processable

9

Details (How?)

10

Nature of Specs

Semi-structured Heterogeneous

Text Tables Images

Constrained technical vocabulary

Available as MS Word document

11

Pre-processing Spec

Abstract content from spec document by removing display oriented information Save text Save tabular data, preserving grid

layout Retain links to images …

Note: “Save As text” option in MS Word inadequate

12

Heterogeneous Document

13

XML generated by Majix

14

ASCII Output

15

Annotating Pre-processed Spec

Embedding Machine Processable Semantics Recognizing and tagging text using

controlled vocabulary By product of: Document Indexing and Semantic

Search Tagging tabular data to make explicit its

semantics : Same grid layout, but different interpretation and dependencies based on headings

Explore: XML-based programming language Water for defining data and its behavior (semantics)

16

Locating Controlled Vocabulary Terms

17

Example Table

Thickness (mm)

Tensile Strength

(ksi)

Yield Strength

(ksi)

0.50 and under

165 155

0.05 – 1.00 160 150

1.00 – 1.50 155 145

18

Example of Tagged Table

Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi)

table.<setHeading thickness strength.tensile strength.yield/>

0.50 and under 165 155

table.<addRow 0 0.50 165 155 />

0.50 - 1.00 160 150

table.<addRow 0.50 1.00 160 150 />

1.00 - 1.50 155 145

table.<addRow 1.00 1.50 155 145 /> ...

19

Example of Processing Code

<defclass table rows=required=vector heading=optional=vector>

<defmethod setHeading t=required ts=required ys=required>

<set heading=<vector t ts ys/>/>

</>

<defmethod addRow smin smax ts ys>

<set rows=

table.rows.<insert <vector smin smax ts ys/>/>/>

</>

<defmethod computeYieldStrength> … </>

<defmethod computeTensileStrength> … </>

</>

20

(cont’d)<defclass table rows=required=vector

heading=optional=vector>

<defmethod computeTensileStrength>

<set temp=fluid.Thickness/>

<set i=0/>

<do>

<until <and temp.<less table.rows.<get i/>.1/>

temp.<more_or_equal table.rows.<get i/>.0/> /> >

table.rows.<get i/>.2

</until>

<set i=i.<plus 1/>/>

</do>

</>

</>

21

(cont’d)

<defclass table rows=required=vector heading=optional=vector>

</>

fluid.<set Thickness=0.60>

<try

<set TensileStrength=table.<computeTensileStrength/>/>

TensileStrength

>

"TABLE: out of range error occurred"

</try>

22

Water

XML-based OO Scripting LanguageFacilitates creating Web Services Run methods remotely via web-

browser

Generalizes dynamic typing to constraint checking Conformance of actuals to formals

23

Pros and cons

Encoding Improvement Amount of tagging can be controlled by

suitably delimiting table data and annotating it with corresponding “string-processing” method

Master Copy Update Changes to spec requires manual

modification to archived annotated version.

Irregular Tables in Specs Different units, etc

24

Some Related Work

Microsoft Smart Tags Recognize “controlled” words in

Office 2003 documents and associate predefined list of actions with each occurrence

SHOE Table data in a declarative (logic)

language

25

Prolog rendition

strengthTableRow( 0, 0.50, 165, 155).strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ...strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength,

YieldStrength), L =< Thickness, U > Thickness.

thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _).thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength).

?- thicknessToYieldStrength(0.6,YS).

26

Conclusions

27

A Step towards Holy Grail

Ultimately enable authoring and/or extracting, human-comprehensible and machine-processable parts of a document “hand in hand”, and keep them “side by side”.

top related