Formalizing and Querying Heterogeneous Documents with Tables - PowerPoint PPT Presentation

About This Presentation
Title:

Formalizing and Querying Heterogeneous Documents with Tables

Description:

Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 29
Provided by: Krishn54
Learn more at: http://cecs.wright.edu
Category:

less

Transcript and Presenter's Notes

Title: Formalizing and Querying Heterogeneous Documents with Tables


1
Formalizing and Querying Heterogeneous Documents
with Tables
  • Krishnaprasad Thirunarayan and Trivikram Immaneni
  • Department of Computer Science and Engineering
  • Wright State University
  • Dayton, OH-45435

2
Overall RD Agenda
  • Develop semi-automatic techniques for
    information extraction/retrieval to enable man
    and machine to complement each other in
    assimilation of semi-structured, heterogeneous
    documents
  • gt Semantic Web Technologies.

3
  • Goal (What?)
  • Background and Motivation (Why?)
  • Implementation Details (How?)
  • Evaluation and Applications (Why?)
  • Conclusions

4
Goal
5
  • Define, embed, and use metadata in
    semi-structured documents containing tables.
  • Content-oriented/domain-specific annotation of
    human sensible document
  • Makes explicit semantics of complex data
  • Enables augmentation of an interpretation in a
    modular fashion.

6
Heterogeneous Document
7
Background and Motivation
8
  • Generate XML Master Document that is both machine
    processable and that can serve as a basis for
    human sensible presentation.
  • Basis of semi-automation in practice.

9
  • Embedding metadata improves traceability, thereby
    facilitating
  • Content Extraction
  • Verification
  • Update

10
Implementation Details (How?)
11
XML Technology
  • Document-Centric View XML is used to annotate
    documents for use by humans in the realm of
    document processing and content extraction.
  • Data-Centric View XML is used as text-based
    format for information exchange / serialization
    in the context of Web Services.

12
Basic idea behind our approach
  • Unify the two views by using XML-elements to
    materialize abstract syntax, and together with
    XML attributes and XML element definitions,
    formalize the content.
  • Key advantage Minimizes maintenance of
    additional data structures to relate original
    document with its formalization.

13
Two Concrete Implementations
  • Use Web Services language Water which amalgamates
    XML Technology with programming language concepts
  • Use XML/XSLT infrastructure

14
Water-based approach
  • Each annotation reflects the semantics of the
    text fragment it encloses.
  • The annotated data can be interpreted by viewing
    it as a function/procedure call in Water. The
    correspondence between formal parameter and
    actual argument is position-based.
  • The semantics of annotation is defined in Water
    as a method definition in a class, separately.

15
Example Table
Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi)
0.50 and under 165 155
0.05 1.00 160 150
1.00 1.50 155 145
16
Example of Tagged Table
  • Thickness (mm) Tensile Strength (ksi) Yield
    Strength (ksi)
  • table.ltsetHeading thickness strength.tensile
    strength.yield/gt
  • 0.50 and under 165
    155
  • table.ltaddRow 0 0.50 165
    155 /gt
  • 0.50 - 1.00 160
    150
  • table.ltaddRow 0.50 1.00 160
    150 /gt
  • 1.00 - 1.50 155
    145
  • table.ltaddRow 1.00 1.50 155
    145 /gt ...

17
Example of Processing Code
  • ltdefclass table rowsrequiredvector
    headingoptionalvectorgt
  • ltdefmethod setHeading trequired tsrequired
    ysrequiredgt
  • ltset headingltvector t ts ys/gt/gt
  • lt/gt
  • ltdefmethod addRow smin smax ts ysgt
  • ltset rows
  • table.rows.ltinsert ltvector smin
    smax ts ys/gt/gt/gt
  • lt/gt
  • ltdefmethod computeYieldStrengthgt
    lt/gt
  • ltdefmethod computeTensileStrengthgt lt/gt
  • lt/gt

18
XML/XSLT-based approach
  • Each annotation reflects the semantics of the
    text fragment it encloses.
  • To make the annotated data XML compliant, dummy
    attributes such as one, two, three, etc are
    introduced. The correspondence between formal
    attribute and the actual value is name-based.
  • The semantics is defined modularly by
    interpreting XML-elements and its XML-attributes
    via XSLT, separately.

19
Example of Tagged Table
  • lttable type"Tensile"gt
  • ltdependency name"Yield Offset" value"0.2"/gt
  • lttableSchema one"Thickness(min)"
    two"Thickness(max)"
  • three"Tensile Strength
    four"Yield Strength"/gt
  • lttableUnits one"in" two"in"
    three"ksi" four"ksi" /gt
  • lttableData one"0" two"0.50" three"165"
    four"155" /gt
  • lttableData one"0.50" two"1.00" three"160"
    four"150" /gt
  • ...
  • lt\tablegt

20
XSLT Stylesheets can be used to
  • Query to perform table look-ups.
  • Transform to change units of measure such as
    from standard SI units to FPS units and vice
    versa.
  • Format to display the table in HTML form.
  • Extract to recover the original table.
  • Verify to check static semantic constraints on
    table data values.

21
Evaluation and Application (Why?)
22
Advantage
  • Only tabular data in each document is annotated.
    The annotation definition is factored out as
    background knowledge.
  • Thus, the semantics of each table type is
    specified just once outside the document and is
    reused with different documents containing
    similar tables.

23
Disadvantage
  • Both avenues require mature tool support for wide
    spread adoption.
  • For example, develop MS FrontPage like interface
    where the Master document is the annotated form,
    and the user explicitly interacts with/edits only
    a view of the annotated document, for readability
    reasons, and has support for export as XML to
    generate well-formed XML document.

24
Prolog rendition
  • strengthTableRow( 0, 0.50, 165, 155).
  • strengthTableRow(0.50, 1.00, 160, 150).
  • strengthTableRow(1.00, 1.50, 155, 145).
  • ...
  • strengthTable(Thickness, TensileStrength,
    YieldStrength) -
  • strengthTableRow(L, U,
    TensileStrength, YieldStrength),
  • L lt Thickness, U gt Thickness.
  • thicknessToTensileStrength(Thickness,
    TensileStrength) -
  • strengthTable(Thickness,
    TensileStrength, _).
  • thicknessToYieldStrength(Thickness,
    YieldStrength) -
  • strengthTable(Thickness, _,
    YieldStrength).
  • ?- thicknessToYieldStrength(0.6,YS).

25
Conclusion and Future Work
26
  • Develop a catalog of predefined tables,
    specifying them using Semantic Web formalisms
    (such as RDF, OWL, etc) and mapping the tabular
    data into a set of pre-defined tables, possibly
    qualified.
  • Develop techniques for manual mapping of complex
    tables into simpler ones
  • To provide semantics to data.
  • To improve traceability.
  • To facilitate automatic manipulation.

27
  • Tailor and improve IE and IR techniques developed
    in the context of text processing to Semantic
    Web documents such as in XML, RDF, etc benefiting
    from additional support from ontologies such as
    in OWL, etc

28
Holy Grail
  • Ultimately develop principles, techniques and
    tools, to author and extract human-readable and
    machine-comprehensible parts of a document hand
    in hand, and keep them side by side.
Write a Comment
User Comments (0)
About PowerShow.com