On Embedding MachineProcessable Semantics into Documents - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

On Embedding MachineProcessable Semantics into Documents

Description:

Archiving spec (for human comprehension) separately from its ... ASCII Output. 15. Annotating Pre-processed Spec. Embedding Machine Processable Semantics ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 28
Provided by: TKPr6
Learn more at: http://cecs.wright.edu
Category:

less

Transcript and Presenter's Notes

Title: On Embedding MachineProcessable Semantics into Documents


1
On Embedding Machine-Processable Semantics into
Documents
  • Krishnaprasad Thirunarayan
  • Department of Computer Science Engineering
  • Wright State University
  • Dayton, OH-45435, USA

2
Talk Outline
  • Background and Motivation (Why?)
  • Goals (What?)
  • Details (How?)
  • Conclusions

3
Background and Motivation
4

Content Extraction Formalize doc, using
controlled vocabulary
Heterogeneous Doc.
Spec. Defn. Rep.
5
Problems with this approach to content extraction
  • Archiving spec (for human comprehension)
    separately from its formalization is not
    conducive traceability.
  • Manual extraction from spec (from scratch) for
    each use is labor intensive, time consuming, and
    prone to typographical errors.

6
Observation
  • Conceptually, every piece of information in an
    extraction owes its existence to a phrase in
    spec, and possibly, controlled vocabulary.
  • So, explore techniques to maintain correspondence
    between a spec fragment and its formalization.

7
Goal
8
General Problem
  • Embed domain-specific mark-up (annotations) into
    human sensible document
  • to make explicit semantics of content text and
    complex data, and
  • to augment an interpretation in a modular
    fashion.
  • Document text Human comprehensible
  • Semantic Mark-up Machine processable

9
Details (How?)
10
Nature of Specs
  • Semi-structured
  • Heterogeneous
  • Text
  • Tables
  • Images
  • Constrained technical vocabulary
  • Available as MS Word document

11
Pre-processing Spec
  • Abstract content from spec document by removing
    display oriented information
  • Save text
  • Save tabular data, preserving grid layout
  • Retain links to images
  • Note Save As text option in MS Word inadequate

12
Heterogeneous Document
13
XML generated by Majix
14
ASCII Output
15
Annotating Pre-processed Spec
  • Embedding Machine Processable Semantics
  • Recognizing and tagging text using controlled
    vocabulary
  • By product of Document Indexing and Semantic
    Search
  • Tagging tabular data to make explicit its
    semantics Same grid layout, but different
    interpretation and dependencies based on
    headings
  • Explore XML-based programming language Water for
    defining data and its behavior (semantics)

16
Locating Controlled Vocabulary Terms
17
Example Table
18
Example of Tagged Table
  • Thickness (mm) Tensile Strength (ksi) Yield
    Strength (ksi)
  • table.ltsetHeading thickness strength.tensile
    strength.yield/gt
  • 0.50 and under 165
    155
  • table.ltaddRow 0 0.50 165
    155 /gt
  • 0.50 - 1.00 160
    150
  • table.ltaddRow 0.50 1.00 160
    150 /gt
  • 1.00 - 1.50 155
    145
  • table.ltaddRow 1.00 1.50 155
    145 /gt ...

19
Example of Processing Code
  • ltdefclass table rowsrequiredvector
    headingoptionalvectorgt
  • ltdefmethod setHeading trequired tsrequired
    ysrequiredgt
  • ltset headingltvector t ts ys/gt/gt
  • lt/gt
  • ltdefmethod addRow smin smax ts ysgt
  • ltset rows
  • table.rows.ltinsert ltvector smin
    smax ts ys/gt/gt/gt
  • lt/gt
  • ltdefmethod computeYieldStrengthgt
    lt/gt
  • ltdefmethod computeTensileStrengthgt lt/gt
  • lt/gt

20
(contd)
  • ltdefclass table rowsrequiredvector
    headingoptionalvectorgt
  • ltdefmethod computeTensileStrengthgt
  • ltset tempfluid.Thickness/gt
  • ltset i0/gt
  • ltdogt
  • ltuntil ltand temp.ltless table.rows.ltget
    i/gt.1/gt
  • temp.ltmore_or_equal
    table.rows.ltget i/gt.0/gt /gt gt
  • table.rows.ltget i/gt.2
  • lt/untilgt
  • ltset ii.ltplus 1/gt/gt
  • lt/dogt
  • lt/gt
  • lt/gt

21
(contd)
  • ltdefclass table rowsrequiredvector
    headingoptionalvectorgt
  • lt/gt
  • fluid.ltset Thickness0.60gt
  • lttry
  • ltset TensileStrengthtable.ltcomputeTensileStre
    ngth/gt/gt
  • TensileStrength
  • gt
  • "TABLE out of range error occurred"
  • lt/trygt

22
Water
  • XML-based OO Scripting Language
  • Facilitates creating Web Services
  • Run methods remotely via web-browser
  • Generalizes dynamic typing to constraint checking
  • Conformance of actuals to formals

23
Pros and cons
  • Encoding Improvement
  • Amount of tagging can be controlled by suitably
    delimiting table data and annotating it with
    corresponding string-processing method
  • Master Copy Update
  • Changes to spec requires manual modification to
    archived annotated version.
  • Irregular Tables in Specs
  • Different units, etc

24
Some Related Work
  • Microsoft Smart Tags
  • Recognize controlled words in Office 2003
    documents and associate predefined list of
    actions with each occurrence
  • SHOE
  • Table data in a declarative (logic) language

25
Prolog rendition
  • strengthTableRow( 0, 0.50, 165, 155).
  • strengthTableRow(0.50, 1.00, 160, 150).
  • strengthTableRow(1.00, 1.50, 155, 145).
  • ...
  • strengthTable(Thickness, TensileStrength,
    YieldStrength) -
  • strengthTableRow(L, U,
    TensileStrength, YieldStrength),
  • L lt Thickness, U gt Thickness.
  • thicknessToTensileStrength(Thickness,
    TensileStrength) -
  • strengthTable(Thickness,
    TensileStrength, _).
  • thicknessToYieldStrength(Thickness,
    YieldStrength) -
  • strengthTable(Thickness, _,
    YieldStrength).
  • ?- thicknessToYieldStrength(0.6,YS).

26
Conclusions
27
A Step towards Holy Grail
  • Ultimately enable authoring and/or extracting,
    human-comprehensible and machine-processable
    parts of a document hand in hand, and keep them
    side by side.
Write a Comment
User Comments (0)
About PowerShow.com