Title: Parsing TEX into Mathematics
1Parsing TEX into Mathematics
- Richard Fateman
- Eylon Caspi
- Computer Science
- University of California at Berkeley
- ISSAC-99 Poster Session
- July, 1999
2Abstract
Communication, storage, transmission, and
searching of complex material has become
increasingly important. Mathematical computing in
a distributed environment is also becoming more
plausible as libraries and computing facilities
are connected with each other and with user
facilities. TeX is a well-known mathematical
typesetting language, and from the display
perspective it might seem that it could be used
for communication between computer systems as
well as an intermediate form for the results of
OCR (optical character recognition) of
mathematical expressions. There are flaws in this
reasoning, since exchanging mathematical
information requires a system to parse and
semantically understand the TeX, even if it is
ambiguous notationally. A program we developed
can handle 43 of 10,740 TeX formulas in a
well-known table of integrals. We expect that a
higher success rate can be achieved by further
tuning.
3Why are we doing this?
The initial idea was quite simple Read the
integral table by Gradshteyn and Rhyzik (GR), as
well as any other large tables we can find into a
computer so that we can have a very good look-up
program solve problems, including ones that
challenge computer algebra system algorithms. The
need to scan the books seemed to dissolve when we
encountered Academic Press electronic (Dynabook)
version of GR which seemed to have formulas
equivalent to TeX. All we needed to do was take
the TeX and parse it into real mathematics-- in
our case, a Lisp data structure suitable for
storing in our database. The unfortunate reality
of this is that the TeX formulas in GR do not
abide by a simple grammar. They are ambiguous, at
the very least. The Academic Press typists
concentrated on the appearance only. Had Academic
Press started with a true grammar and valid
mathematical input and macro-expanded it into
TeX, this could have been a piece of cake. Our
program is mostly successful nevertheless, but
uses heuristics based on the context.
.
4What does TeX know about Math
Surprisingly little. It doesnt know or care
that 234 is a number, and it does not really
distinguish it from xyz or cos. 234 typesets the
same as 234
5Ambiguity of spaces
6Ambiguity of precedence and associativity
We mostly believe that if you see cos(p)q that
the argument of cos is p. What about this piece
of clip-art from GR 4.384.4?
7Ad hoc parsers and a sanity check
Print the original TeX expression T. Parse T
expression into Lisp, then Macsyma and then ask
Macsyma to produce, from that, a TeX expression
S. Print S. Compare by human eye. This does not
really work though. Humans are easily fooled.
Simplifications can make identical expressions
look different. Semantically different
expressions can print the same. Much that can be
uttered in TeX cannot be said in current computer
algebra systems (cross-refs? Provenance? English
footnotes).
8GR forms which currently require intervention,
and may not ever be worth parsing by machine
Formulas with English hints in footnotes.
Formulas with (ellipses). Formulas which
require reference to a geometric figure.
9On the other hand, parsing TeX is easier than OCR
Reading TeX is reducible to OCR (Optical
Character Recognition) Just print the material
on paper. It is EASIER to read TeX because The
characters are correctly identified. The
positions can be computed (look at a DVI
file). Since there is no scanner, there is no
mechanical or optical noise.
10Despite shortcomings, heres a successful
example, 6.124 from GR
11Starting/Ending Encoding
Lisp
TeX
(stmt () ( (integrate ( (userfunc E x k) (/ 1
(power ( (- (power (sin x) 2) (power (sin u) 2))
(- (power (sin v) 2) (power (sin x) 2))) (/ 1
2)))) (x u n )) ( ( ( (/ 1 ( ( 2 (cos u))
(sin v))) (userfunc bold_E k)) (userfunc bold_K
(power (- 1 (/ ( ( t (power g 2)) u) ( ( t
(power g 2)) v))) (/ 1 2)))) ( (/ ( (power k 2)
(sin v)) ( 2 (cos u))) (userfunc bold_K (power
(- 1 (/ (power (sin ( 2 u)) 2) (power (sin (
2 v)) 2))) (/ 1 2)))))) ( (power k 2) (- 1 (
(power (ctg u) 2) (power (ctg v) 2)))))
\def\UUU\quad \int_unE(x,\tsp
k) dx\over\sqrt(\sin2 x-\sin2
u) (\sin2 v-\sin2 x) \def\UU\hphantom\U
UU\displaylines\UUU 1\over2\cos u\sin
v\mbiE(k)\mbiK \left(\sqrt1-tg2\tsp
u\over tg2\tsp v\right)\hfill\cr \UU\hphan
tomk2\sin v\over2\cos
u \mbiK\left(\sqrt1-\sin2 2u\over\sin2
2v\right)\hfill\cr \hfillk21-\ctg 2\tsp
u\ctg2\tsp v. \qquad\cr
Although this looks pretty good at first sight,
in fact there are at least two errors in GR in
this input tg mean tangent, and thus tg2v
should not be ( t (power g 2) v) but (power (tan
v) 2). Failing to use \tg instead of tg seems
like a simple typo, but it makes the formula
nonsensical in a purely mechanical form. A human
probably wouldnt even notice. Oh, the upper
limit should be v not n. But the printed table
is so smudged in my printed copy that youd have
trouble identifying that character except by
context.
12Future
Academic Press owns the rights to GR. We expect
AP will re-do their electronic version so that it
can be computer-processed in a semantically
meaningful way. This should be linked to a web
site with algorithms for lookup and computation.
(D. Zwillinger, AzTech Inc.) They may re-enter
everything by hand (again) in a suitable
semantic system, or perhaps use our program for
a first run-through. It is not clear how many
more large-scale TeX-based references or journals
may be worth attempting, but our experience
suggests that, with some effort, additional
programming could bring this or similar material
up to a 80 or more recognition rate. The last
few percents will probably require human
intelligence.