Title: Unicode Support for Mathematics
1Unicode Support for Mathematics
- Murray Sargent III
- Microsoft
2Overview
- Unicode math characters
- Semantics of math characters
- Unicode and markup
- Multiple ways of encoding math characters
- Not yet standardized math characters
- Inputting math symbols
3Unicode Math Characters
- 340 math chars exist in ASCII, U2200 U22FF,
arrows, combining marks of Unicode 3.0 - 996 math alphanumeric characters are proposed to
be added as requested by STIX project. Plane 1 - 951 new math symbols and operators are proposed
for BMP - One math variant code
- One new combining character (reverse solidus).
4Math Alphanumeric Characters
- Math needs various Latin and Greek alphabets like
normal, bold, italic, script, Fraktur, and
open-face - May appear to be font variations, but have
distinct semantics - Without these distinctions, you get gibberish,
violating Unicode rule plain text must contain
enough info to permit the text to be rendered
legibly, and nothing more - Plain-text searches should distinguish between
alphabets, e.g., search for script H shouldnt
match H, etc. - Reduces markup verbosity
5 Legibility Loss
- Without math alphabets, the Hamiltonian formula
- H ? dt eE2 µH2
- becomes an integral equation
- H ? dt eE2 µH2
6Math Alphanumeric Chars (cont)
- Bold a-z, A-Z, 0-9, ?-?, ?-O
- Italic a-z, A-Z, ?-?, ?-O
- Bold italic a-z, A-Z, ?-?, ?-O
- Script a-z, A-Z
- Bold script a-z, A-Z
- Fraktur a-z, A-Z
- Bold Fraktur a-z, A-Z
- Open-face a-z, A-Z, 0-9
- Sans-serif a-z, A-Z, 0-9
- Sans-serif bold a-z, A-Z, 0-9, ?-?, ?-O
- Sans-serif italic a-z, A-Z
- Sans-serif bold italic a-z, A-Z, ?-?, ?-O
- Monospace a-z, A-Z, 0-9
7How Display Math Alphabets?
- Can use Unicode surrogate pair mechanisms
available on OS - Alternatively, bind to standard fonts and use
corresponding BMP characters. - Second approach probably faster and to display
Unicode one needs font binding in any event. - A single math font may look more consistent.
8Multiple Character Encodings
- As with nonmath characters, math symbols can
often be encoded in multiple ways, composed and
decomposed - E.g., ? can be U003D, U0338 or U2260
- Recommendation use the fully composed symbol,
e.g., U2260 for ? - For alphabetic characters, use the fully
decomposed sequence, e.g., use U0061, U0308 for
ä, not U00E4 - Some representations use markup for the
alphabetic cases. This allows multicharacter
combining marks.
9Compatibility Holes
- Compatibility holes (reserved positions) exist in
some Unicode sequences to avoid duplicate
encodings (ugh!) - E.g., U2071-U2073 are holes for ¹²³, which are
U00B9, U00B2, and U00B3, respectively - Math alphanumerics have holes corresponding to
Letterlike symbols. - Recommendation you can use the hole codes
internally, but should import and export the
standard codes.
10Math Glyph Variants
- One approach to the math alphanumerics was to use
a set of math glyph variant tags - Such a tag follows a base character imparting a
math style - Approach was dropped since it seemed likely to be
abused - One math variant tag does exist for purposes of
offering a different line slant for some
composite symbols.
11Nonstandard Characters
- People will always invent new math characters
that arent yet standardized. - Use private use area for these with a
higher-level marking that these are for math. - This approach can lead to collisions in the math
community (unless a standard is maintained) - Cut/copy in plain text can have collisions with
other uses of the private use area
12Unicode and Markup
- Unicode was never intended to represent all
aspects of text - Language attribute sort order, word breaks
- Rich (fancy) text formatting built-up fractions
- Content tags headings, abstract, author, figure
- Glyph variants Poetica font 58 ampersands
Mantinia font novel ligatures (TT, TE, etc.) - MathML adds XML tags for math constructs, but
seems awfully wordy
13Unicode Plain Text
- Can do a lot with plain text, e.g., BiDi
- Grey zone use of embedded codes
- Unicode ascribes semantics to characters, e.g.,
paragraph mark, right-to-left mark - Lots of interesting punctuation characters in
range U2000 to U204F - Extensive character semantics/properties tables,
including mathematical, numerical
14Unicode Character Semantics
- Math characters have math property
- Math characters are numeric, variable, or
operator, but not a combination - Properties are useful in parsing math plain text
- MathML doesnt use these properties every
quantity is explicitly tagged - Properties still can be useful for inputting text
for MathML (noone wants to type all those tags!) - Sometimes default properties need to be overruled
- Might be useful to have more math properties
15Plain Text Encoding
- TEX fraction numerator is what follows a up to
keyword \over - Denominator is what follows the \over up to the
matching - are not printed
- Simple rules give unambiguous plain text, but
results dont look like math - How to make a plain text that looks like math?
16Simple plain text encoding
- Simple operand is a span of non-operator
characters - E.g., simple numerator or denominator is
terminated by any operator - Operators include arithmetic operators,
whitespace character, all U22xx, an argument
break operator (displayed as small raised dot),
sub/superscript operators - Fraction operator is given by the Unicode
fraction slash operator U2044
17Fractions
- abc/d gives
- More complicated operands use parentheses ( ),
brackets , or - Outermost parens arent displayed in built-up
form - E.g., plain text (a c)/d displays as
- Easier to read than TEXs, e.g., a c \over d
- MathML ltmfracgtltmrowgtltmigtalt/migtltmogtlt/mogt
ltmigtclt/migtlt/mrowgtltmrowgtltmigtdlt/migt lt/mrowgtlt/mfracgt - Neat feature plain text usually looks like math
18Subscripts and Superscripts
- Unicode has numeric subscripts and superscripts
along with some operators (U2070-U208E). - Others need some kind of markup like
ltmsupgtlt/msupgt - With special subscript and superscript operators
(not yet in Unicode), these scripts can be
encoded nestibly. - Use parentheses as for fractions to overrule
built-in precedence order.
19Unicode TEX Example
20Symbol Entry
- GUI PCs can display a myriad glyphs, mathematics
symbols, and international characters - Hard to input special symbols. Menu methods are
slow. Hot keys are great but hard to learn - Reexamine and improve symbol-input and storage
methods - With left/right Ctrl/Alt keys, PC keyboard gives
direct access to 600 symbols. Maximum possible
2100 1030 - Use on-screen, customizable, keyboards and symbol
boxes - Drag drop any symbol into apps or onto keyboards
21Hex to Unicode Input Method
- Type Unicode character hexadecimal code
- Make corrections as need be
- Type Altx to convert to character
- Type Altx to convert back to hex (useful
especially for missing glyph character) - Resolve ambiguities by selection
- Input higher-plane chars using 5 or 6-digit code
- New MS Office standard
22Built-Up Formula Heuristics
- Math characters identify themselves and neighbors
as math - E.g., fraction (U2044), ASCII operators,
U2200U22FF, and U20D0U20FF identify neighbors
as mathematical - Math characters include various English and Greek
alphabets - When heuristics fail, user can select math mode
WYSIWYG instead of visible math on/off codes
23Operator Precedence
- Everyone knows that multiply takes precedence
over add, e.g., 353 18, not 24 - C-language precedence is too intricate for most
programmers to use extensively - TEX doesnt use precedence relies on to
define operator scope - In general, ( ) can be used to clarify or
overrule precedence - Precedence reduces clutter, so some precedence is
desirable (else things look like LISP!) - But keep it simple enough to remember easily
24Layout Operator Precedence
- Subscript, superscript
- Integral, sum ò S P
- Functions Ö
- Times, divide /
- Other operators Space ". , - LF Tab
- Right brackets )
- Left brackets (
- End of paragraph FF CR EOP
25Mathematics as a Programming Language
- Fortran made great steps in getting computers to
understand mathematics - Java accepts Unicode variable names
- C has preprocessor and operator overloading,
but needs extensions to be really powerful - Use Unicode characters including math
alphanumerics - Use plain-text encoding of mathematical
expressions - Cant use all mathematical expressions as code,
but can go much further than current languages go - When to to multiply? In abstract, multiplication
is infinitely fast and precise, but not on a
computer
26void IHBMWM(void) gammap gammasqrt(1
I2) upsilon cmplx(gammagamma1,
Delta) alphainc alpha0(1-(gammagammaI2/gamm
ap)/(gammap upsilon)) if (!gamma1
fabs(DeltaT1) lt 0.01) alphacoh
-halfalpha0I2pow(gamma/gammap,
3) else Gamma 1/T1 gamma1 I2sF
(I2/T1)/cmplx(Gamma, Delta) betap2
upsilon(upsilon gammaI2sF) beta
sqrt(betap2) alphacoh 0.5gammaalpha0(I2sF
(gamma upsilon) /(gammapgammap -
betap2)) ((1gamma/beta)(beta -
upsilon)/(beta upsilon) -
(1gamma/gammap)(gammap - upsilon)/ (gammap
upsilon)) alpha1 alphainc alphacoh
27(No Transcript)
28(No Transcript)
29Conclusions
- Unicode provides great support for math in both
marked up and plain text - Unicode character properties facilitate
plain-text encoding of mathematics but arent
used in MathML - Heuristics allow plain text to be built up
- Need two more Unicode assignments subscript and
superscript operators - On-screen keyboards and symbol boxes aid formula
entry - Unicode math characters could be useful for
programming languages