Representing Multilingual and Annotated Text in Memory and in a Relational Database - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Representing Multilingual and Annotated Text in Memory and in a Relational Database

Description:

marriage. Defn: sequence of characters sharing the same properties. Same for ALL properties. ... No records or joiner tables for runs. ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 21
Provided by: ldcU
Category:

less

Transcript and Presenter's Notes

Title: Representing Multilingual and Annotated Text in Memory and in a Relational Database


1
Representing Multilingual and Annotated Text in
Memory and in a Relational Database
  • John Thomson
  • SIL International

2
Whats hard about text?
  • It has complex structure.
  • Characters, words, phrases, sentences,
    paragraphs, sections.
  • It has many kinds of related information.
  • Formatting, annotations, language.
  • Length is highly variable.

3
Relational DBMS issues
  • Normalization.
  • Each entity represented only once.
  • All relationships explicit.
  • Many-many relationships require a separate table.
  • Sequence is difficult to handle.
  • RDBMS is built on unordered sets.
  • Text is (usually) an exception.

4
What is an entity?
  • How much text corresponds to a record or object?
  • Character?
  • Word?
  • Paragraph?
  • Document?
  • The answer we would like varies for different
    tasks.

5
Examples.
  • Layout wants to treat a paragraph as a unit.
  • Language and formatting may change within a
    paragraph.
  • Lexical annotation works on words or morphemes.
  • Grammatical annotation represents hierarchy.
  • Topical annotations may overlap.

6
Characters as entities
  • Pure relational semantics, but
  • Too expensive.
  • Cant use standard tools.
  • Sequence and editing get messy.

7
Documents as entities
  • Database engine doesnt help with sub-document
    relationships.
  • May be inconveniently large.

8
Runs as entities
  • Defn sequence of characters sharing the same
    properties.
  • Same for ALL properties.

9
Runs (advantages)
  • Complete information (like the character-level
    scheme).
  • Usually much less overhead.
  • Usable. We did this in our previous product.

10
Runs (disadvantages)
  • Still quite costly for many annotations or
    frequent changes.
  • Editing is difficult.
  • Many operations create and destroy (split and
    combine) runs.
  • Hard to hide this from users.
  • Simple-looking selections have complex entity
    structure.

11
Run complexity
Everyone ate ants at the wedding feast. No one
got sick.
  • If the red text is selected, and paragraphs and
    runs are both entities, we have selected...
  • Half of one run (ts).
  • One whole run (at thefeast).
  • Together making part of one paragraph.
  • Plus a whole run and part of a run which make
    part of another paragraph.

12
Deleting
Everyone ate ants at the wedding feast. No one
got sick.
  • Now if we delete this selection
  • Links to the at the wedding feast and No runs
    are (probably) lost.
  • A paragraph object is gone (which one??)
  • What do we do with links to the deleted
    paragraph?

13
Paragraphs as entities
  • In FieldWorks texts paragraphs are the smallest
    entities.
  • Run structure is represented as a binary field.
  • A one-paragraph or smaller text occupies two
    fields (X and X_fmt).
  • Larger texts are StText entities, made up of
    paragraph entities.

14
In memory
  • Each paragraph is an object.
  • The object stores the annotations with the text.
  • Custom editing tools maintain the relationships.

15
Advantages (1)
  • No records or joiner tables for runs.
  • A simple text need not occupy even one record
    (just two fields).
  • Fewer joins.
  • No entities to create or destroy for
    sub-paragraph editing.
  • Only one level of entities to deal with for
    multi-paragraph editing.

16
Advantages (2)
  • Efficient loading into memory.
  • Two fields rather than iterating through a
    RowSet.
  • Pattern match scope is natural (paragraph).
  • Can even match targets that cross run boundaries.

17
Disadvantages
  • Run links not explicit in database.
  • Cant have DB build indexes to find runs with
    particular properties.
  • Cant base queries on run properties.

18
Overcoming problems
  • We can create redundant tables linking
    annotations to paragraphs.
  • These can be updated when we record a new version
    of a paragraph.
  • We need only do this where indexing or querying
    is needed.

19
Demonstration
  • Data notebook is a tool for managing
    anthropological notes.
  • (Linguistic tool soon we hope.)
  • Views.
  • Annotations.
  • Editing.
  • Committed to open source.

20
Questions?
  • John_thomson_at_sil.org
  • www.sil.org
  • http//fieldworks.sil.org/
Write a Comment
User Comments (0)
About PowerShow.com