Title: Flexible Interfaces in the Application
1Flexible Interfaces in the Application of
Language Technology to an eScience Corpus
C.J. Rupp, Ann Copestake, Simone Teufel
Benjamin Waldron Computer Laboratory,
University of Cambridge
2Outline
- Two key interfaces
- SciXML XML markup for the logical structure of
research papers - SAF Standoff Annotation Formalism for diverse
linguistic information - Both
- coded in XML and designed for flexibility,
- But
- what that means is distinct in the two cases.
3SciBorg Architecture
OSCAR
RASP
WSD
RMRS merge
POS tagging
anaphora
tasks
rhetorical analysis
ERG/PET
standoff annotation
4Sciborg Corpus
- A corpus of Chemistry research papers from 3
publishers - The Royal Society of Chemistry (RSC),
- The Nature Publishing Group (NPG), and
- The International Union of Crystallography.
- Provided in Publishers XML markup, but with
distinct markup schemes.
5Conversion to SciXML
RSC papers
PLOS Biology papers
Nature papers
SciXML
IUCr papers
Biology and CL (pdf)
6SciXML Interface Requirements
- Extensible
- So we can add additional publications
- Neutral
- So as not to compromise any IP issues
- Compatible with existing software
- Expressive enough
- For adequate rendering in applications
7Rendering Issues
- We assume application will display the paper
- Probably in Hypertext
- We must retain enough information to do this
effectively - Previous versions of SciXML have focused on the
logical structure of scientific papers.
8The Development of SciXML
- Developed for a medical corpus (2000)
- Extracted from HTML web pages
- Extended for a Computational Linguistics corpus
- First from LaTeX
- Then from PDF via OCR
- Now defined as Relax NG Schema
9Legacy Issues
- The original SciXML schema had to interpret
formatting. - Lacking any organisation by function
- Dictating a flat paragraph structure
- Collecting all floats and notes in end lists
- But excluding text formatting
10Adapted from Publishers Markup
- List and Table formats
- Inline text formatting
- Functional paragraph types (e.g. Theorem)
- Position markers for floats
11Conversion by XSLT
- Most constructs can be handled quite simply
- ltxsltemplate match"sec"gt
- ltDIV DEPTH"_at_level"gt
- ltxslapply-templates/gt
- lt/DIVgt
- lt/xsltemplategt
- Making the script virtually a stylesheet
12Schema Development
- Both the XSLT stylesheet and RNG Schema have been
developed on a naïve basis. - Coding conversion for constructs that occur in
the corpus - Eventually we have a big enough bag of tricks to
make extension quite painless.
13SciXML Constructs
- Paper Identifiers
- Unique identifiers, titles and authors
- Sections
- Divisions embed recursively with headers
- Inline text markup
- Font settings and LaTeX inclusion
- Paragraph structure
- Paragraph elements and sub paragraph boundaries
in lists, abstracts, captions, etc.
14SciXML Constructs
- Citations and Cross References
- Citations are significant, but we also need
textual cross references, compound references,
footnote markers, float markers. - Equations and examples
- (Linguistic) examples and equation environments
- Lists, tables and figures
- Lists, including definitions lists, tables,
figures, and various other sections for
(external) data. - Bibliography
- The bibliography section is important for
citation tracking
15RNG Schema (Fragment)
- ltdefine name"PAPER.ELEMENT"gt
- ltelement name"PAPER"gt
- ltref name"METADATA.ELEMENT" /gt
- ltoptionalgtltref name"PAGE.ELEMENT"
/gtlt/optionalgt - ltref name"TITLE.ELEMENT" /gt
- ltoptionalgt ltref name"AUTHORLIST.ELEMENT" /gt
lt/optionalgt - ltoptionalgt ltref name"ABSTRACT.ELEMENT" /gt
lt/optionalgt - ltelement name"BODY"gt
- ltzeroOrMoregt
- ltref name"DIV.ELEMENT" /gt
- lt/zeroOrMoregt
- lt/elementgt
- ltoptionalgt
- ltelement name"ACKNOWLEDGMENTS"gt
- ltzeroOrMoregt
- ltchoicegt
- ltref name"REF.ELEMENT" /gt
- ltref name"INLINE.ELEMENT" /gt
- lt/choicegt
- ltoptionalgt
- ltref name"REFERENCELIST.ELEMENT"gt
- lt/optionalgt
- ltoptionalgt
- ltref name"AUTHORNOTELIST.ELEMENT"gt
- lt/optionalgt
- ltoptionalgt
- ltref name"FOOTNOTELIST.ELEMENT"gt
- lt/optionalgt
- ltoptionalgt
- ltref name"FIGURELIST.ELEMENT"gt
- lt/optionalgt
- ltoptionalgt
- ltref name"TABLELIST.ELEMENT"gt
- lt/optionalgt
- lt/elementgt
- lt/definegt
- ltdefine name"REFERENCELIST.ELEMENT"gt
16Language Technology in Sciborg
- The goal is Information Extraction from Chemistry
research papers. - various analysis components interfacing
- Different levels of analysis
- Different analysis methods
- Specialised and General analysers
- But a common semantic representation RMRS
(Robust Minimal Recursion Semantics) - And a common interface structure SAF
17Multiple Analysis Components
- PET/ERG deep analysis using detailed (HPSG)
grammars and lexicons - RASP Robust shallow parsing with a statically
trained grammar - Each strand has a tokeniser, tagger and parser
- OSCAR-3 analyses Chemistry terms and notation
18Getting the Text out of SciXML
- Only some spans of marked up text contain
linguistic text. - Using SciXML we can divide element into
- Text (ltPgt), Markup (ltITgt), Non-Text elements
(ltSUPgt). - The analysers process, ignore and skip these,
respectively. - We also use OSCAR-3 to detect data sections
without significant text portions.
19SciBorg Parsing Architecture
OSCAR
RASP parser
Tokeniser for Rasp
SAF Lattice
SciXML
Sentence splitter
POS tagging
PET parser
Tokeniser for ERG
20SAF Interface Requirements
- Support results from different analysis
components. - Allow the combination of complementary results
- But they will assign conflicting structures
- Ambiguity is common
- Analyses will form a graph or lattice (c.f. chart
parsing and word lattices)
21Motivating Standoff
- XML can only combine linguistic and formatting
markup if they share the same tree structure - calculated for C11 H18 O3
- ltITgtcalculated forlt/ITgt CltSBgt11lt/SBgtHltSBgt18lt/SBgtOlt
SBgt3lt/SBgt - ltvgtcalculatedlt/vgt ltppgtfor ltnegtC11H1803lt/negtlt/ppgt
22Standoff Annotation
- A common solution is to separate the flow of text
from the annotations representing its analysis - The connection is formed by indexing at some
consistent common level - SAF supports character offset indexing and XPoint
indexing
23Character Offset Indexing
- Formatted text Come here!
- raw text "ltpgtCome ltigtherelt/igt!lt/pgt"
- Unicode character points
- .lt.p.gt.C.o.m.e. .lt.i.gt.h.e.r.e .lt ./ .i .gt .! .lt
./ .p .gt . - 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 - Tokens
- lttoken from'3' to'7' value'Come'/gt
- lttoken from'11' to'14' value'here'/gt
- lttoken from'18' to'19' value'!'/gt
24XPoint Indexing
Root (/)
. P(/1).
. I(/1/2).
. text(/1/3).
. text(/1/1).
. text(/1/2/1).
. C.o.m.e.
. !.
. h.e.r.e.
25Index Conversion
- We currently use both character offset and XPoint
indexing. - The choice is influenced by the XML parser.
- This implies maintaining a conversion table for a
(SciXML) file. - /1/3/0 lt-gt 18
26Standards for Standoff Annotation
- MAF ISO standard for morphological annotation
- SMAF an emergent standard extending this to
sentence, e.g. for parser input - SAF includes all annotations for a paper in one
file
27Types of SAF Annotation
- Sentence segments
- ltannot type'sentence' id's133' from'42065'
source'v4987' target'v5154' to'43039'
value'calculated for C11H18O3.'/gt - Tokens
- ltannot type'token' id't5151' from'42988'
to'43030' deps's133' source'v5150'
target'v5151' value'calculated'/gt - ltannot type'token' id't5152' from'43031'
to'43034' deps's133' source'v5151'
target'v5152' value'for'/gt - ltannot type'token' id't5153' from'43035'
to'43043' deps's133' source'v5152'
target'v5153' value'C11H18O3'/gt
28Types of SAF Annotation
- Part of Speech (POS) Tags
- ltannot type'pos' id'p5151' deps't5151'
source'v5150' target'v5151' value'VVN'/gt - ltannot type'pos' id'p5152' deps't5152'
source'v5151' target'v5152' value'IF'/gt - ltannot type'pos' id'p5153' deps't5153'
source'v5152' target'v5153' value'NP1'/gt - OSCAR (NER) mark up
- ltannot from"/1/5/6/27/51/2/83.1"
to"/1/5/6/27/51/2/88/1.1" type"oscar"
id"o554"gtltslot name"type"gtcompoundlt/slotgtltslot
name"surface"gtC11H18O3lt/slotgtltslot
name"provenance"gtformulaRegexlt/slotgtlt/annotgt
29Types of SAF Annotation
- RMRS analyses
- ltrmrs cfrom'42329' cto'43303'gt
- ltlabel vid'420'/gt
-
- ltep cfrom'43258 cto'43288'gtltgpredgtproper_q_rellt
/gpredgtltlabel vid'409'/gtltvar sort'x'
vid'410'/gtlt/epgt - ltep cfrom'43258' cto'43288'gtltgpredgtnamed_rellt/gp
redgtltlabel vid'411'/gtltvar sort'x'
vid'410'/gtlt/epgt -
- ltrarggtltrargnamegtRSTRlt/rargnamegtltlabel
vid'409'/gtltvar sort'h' vid'412'/gtlt/rarggt - ltrarggtltrargnamegtBODYlt/rargnamegtltlabel
vid'409'/gtltvar sort'h' vid'413'/gtlt/rarggt - ltrarggtltrargnamegtCARGlt/rargnamegtltlabel
vid'411'/gtltconstantgtc11h18o3lt/constantgtlt/rarggt -
- lthcons hreln'qeq'gtlthigtltvar sort'h'
vid'412'/gtlt/higtltlogtltlabel vid'411'/gtlt/logtlt/hcons
gt - lt/rmrsgt
30SAF Flexibility
- The standoff supports a variety of annotation
types - Which communicate between different levels of
analysis - And between different analysis paths
- Hence it is also the main route for communication
in the architecture
31SciXML Flexibility
- A common representation for the logical structure
and essential formatting of research papers - Conversion from various publishers markup
schemes - And, also, from HTML, LaTeX and PDF
- Applied to several disciplines