Title: Illinois D-Lib Testbed: Technologies for Converting Legacy Mathematics for Display on the Web
1Illinois D-Lib Testbed Technologies for
Converting Legacy Mathematics for Display on
the Web
- Timothy W. Cole
- Thomas G. Habing
- William H. Mischo
- Grainger Engineering Library Information Center
- University of Illinois at Urbana-Champaign
- ? http//dli.grainger.uiuc.edu/Publications/MathML
Conf/ thabing_at_uiuc.edu
2Project Background Objectives
- Funded 1994-98 under DLI-I (NSF, DARPA, NASA)
Continued 1998-2001 under CNRIs D-Lib Test Suite - Objectives
- Construct Large-Scale, Multipublisher,
Markup-Based Full-Text Journal Testbed. - Investigate Processing, Indexing, Normalization,
Retrieval, Rendering and Linking. - Study End-User Searching Behavior and Needs.
- Testbed contains 60,000 Articles from 50 Journal
Titles - Received as SGML (various DTDs) converted to XML
- Content support from AIP, APS, ASCE, IEE, ASM,
ACM, Elsevier - Additional support from IEEE, NRL, NTT Learning
Systems
3Project Background (cont.)
- Accomplishments
- Process Retrieve from Multiple Publishers
Heterogeneous DTDs. - SGML to XML Conversion.
- Metadata Extraction, Representation, Merging.
- Dynamic Linking Forward/Backward, from/to A I
DBs. - Current Investigations
- Mathematics Markup Rendering Issues
- Metadata Harvesting Replicative Distributed
- E-Journal Archiving
- Local Resource Resolution
- Asynchronous Searching of Multiple Resources
4Converting Legacy Markup to MathML
- Goal Convert publisher-specific XML math markup
to standard presentation MathML - Desired result can then focus on single
rendering solution - Groundrules
- Minimize need for human intervention
- Utilize standards-based techniques (e.g., XSLT,
JavaScript, DOM) - Embed MathML in full XML document
- Validate success of conversion based on quality
of presentation - Strive for consistency across MathML viewers
- Scope
- E.g. in 17,000 APS articles, gt 2.3 M instances of
math (100 K block) - ? http//dli.grainger.uiuc.edu/MathMLStyle/math_sa
mple.htm
5Mathematics Markup Transformations
- Identify translate mathematical character
references - Identify tokenize mathematical content
- Recognize transform mathematical markup (e.g.,
embellishments, script limit schemtas, etc.)
Presentational MathML ltmath xmlnshttp//www.w3.o
rg/gt ltmsubsupgt
ltmrowgtltmigtalphalt/migtlt/mrowgt
ltmrowgtltmigtilt/migtlt/mrowgt
ltmrowgtltmngt2lt/mngtlt/mrowgt lt/msubsupgt lt/mathgt
ISO 12083 Math ltdformulagt ltggtalt/ggt
ltsupgt2lt/supgt ltinfgtilt/infgt lt/dformulagt
6Approach Algorithim
- For each XML document Identify mathematical
nodes (e.g., ltdformulagt, ltformulagt) - Recursively apply templates to every child node
within mathematical nodes - Look up entities special characters and Convert
to appropriate MathML characters tokenize
(JavaScript) - Tokenize remaining PCDATA (JavaScript)
- Convert Postfix markup to MathML (e.g., ltsupgt,
ltinfgt) - Re-tag one-to-one transformations (e.g., ltsumgt,
ltulgt, ltllgt) - Transformed mathematical nodes (ltmathgt) replace
original mathematical nodes in document - Include default namespace attribute
7Approach Algorithim (cont.)
- Illustrative XSLT
- ltxslwhen test"sup or inf"gt
- ltxslfor-each select"childnode()"gt
- ltxslchoosegt
- ltxslwhen test"name(selfnode())'su
p' and name(following-siblingnode()1)'inf'"gt - ltxslelement name"msubsup
namespacehttp//www.w3.org/gt - ltxslelement name"mrow
namespacehttp//www.w3.org/gt - ltxslapply-templates
select"preceding-siblingnode()1"/gt - lt/xslelementgt
- ltxslelement name"mrow
namespacehttp//www.w3.org/gt - ltxslapply-templates select"following-sibl
ingnode()1"/gt - lt/xslelementgt
- ltxslelement name"mrow" namespacehttp//www.w
3.org/gt - ltxslapply-templates select"selfnode()"/
gt - lt/xslelementgt
- lt/xslelementgt
- lt/xslwhengt
- . . . THERE ARE FOUR MORE CASES TO
HANDLE !
8Remaining Issues
- JavaScript from within XSLT
- Rely on MS-specific mechanisms to invoke
extension functions - Inconsistent Rendering by MathML Viewers
- Validating against TechExplorer, Amaya, Mozilla,
MS IE (w/ CSS) - Incomplete MathML implementations
- Ambiguity Overuse of ltmrowgt
- Limited impact on appearance
- Verbosity -- 60 increase for inline, 15
increase for block - Character / glyph issues
- STIX project / Unicode update will provide some
relief - Automated Checking for Errors / Problems
- Rendering System Performance
9Status
- Developing publisher-specific XSLT stylesheets
- See sample transformed issue of Physical Review
Letters ? - XSLT allows us to generate standard MathML from
publisher-dependent SGML math markup - Moves customization to pre-processing stage
- Allows for single, common rendering solution
- MathML can be rendered in some browsers / tools
without the need to style (Mozilla, techexplorer,
Mathematica)