Title: Software Localization
1Software Localization
Lecture 2
Dr. Gregory M. Shreve Institute for Applied
Linguistics
2Intercultural Communication
3Multicultural/Multilingual Information Delivery
In order to reach a large international user base
it is necessary that the globalisation of a
software system, web page or electronic document
is done properly. The culturally and
linguistically dependent parts of the software
must be isolated, a process referred to as
internationalisation. These parts include text
manipulation and display, character-encoding
methods, collation sequences, hyphenation and
morphological rules, formats used for numbers and
dates, as well as more subtle cultural
conventions such as the use of icons, symbols and
colour. The local market requirements for these
items are encapsulated in the term locale.
4Understand the Culture
Let's say your web site features a JavaScript
graphic of a cartoon character waving at the
reader. You know, just...waving. Friendly
gesture, right? Not in Greece. Not in Nigeria. In
those nations, the palm-forward wave is a nasty
gesture indeed. As is the thumbs-up signal (so
common in U.S. product reviews) in Iran. Not to
mention the thumb-and-index-finger "OK" sign,
which is most definitely "not the OK sign in
Brazil," according to Wei-Tai Kwok, president of
DAE Interactive Marketing, a San Francisco-based
full-service Web development company that
specializes in helping U.S. companies build a
global presence. Cultural assessment or cultural
auditing is often necessary to understand the
full requirements for the localization of any
product. Such assessment can produce the
localization requirements required for a
successful project.
5Cultural Attitudes and Assumptions
The credit card is the linchpin of e-commerce in
the United States. In many countries, this causes
all sorts of payment trouble. Many Germans view
plastic as a crutch for weaklings who can't
control their finances they far prefer direct
bank-account debit. Mark Lancaster, CEO and chair
of SDL International, a globalization solutions
provider based in Berkshire, England, points to
currency and date formats as two obvious but
easy-to-miss snags for global sites. "But the big
one," Lancaster says, "is examples. Most
e-commerce stuff is typically marketing oriented,
and you know they really work hard to spin that
stuff. But the right spin in the United States
typically doesn't have the same impact elsewhere.
There's a lot of baseball knowledge in the
States you might want to rethink that example
in, say, the Netherlands."
6Culture And Technology
But there's culture and then there's culture
intertwined with the use of technology. Hong Kong
lacks any sort of ZIP code equivalent. "A guy in
my office just moved here from Hong Kong," Kwok
says. "The ZIP code field is a major annoyance to
him. He gets all the way through an order, puts
in his address, and since he doesn't put in a ZIP
code, some systems ding him. Won't accept the
order. He's taken to punching in 00000 and hoping
for the best." Pointing out that China uses a
six-digit geographic code while Great Britain and
other nations use letters, Kwok advises that ZIP
codes should be assigned a free-form field that
accepts any input, rather than just five- or
nine-digit U.S.-style codes. "The technology
should be dumb enough not to be too smart," he
says.
7Some Issues are Simpler
Some Issues are Not Simple
Non-Language Symbologies and Allowed Usages
(flags, sacred symbols) Icons Meaning of
Graphics (allowed content) Meaning of Colors
Formats / Conventions Money Time Date Postal
Codes Address Formats
Some Issues are Very Complex Indeed
Language Writing System, Characters Translation
(esp. humour, idiom, allusion) Document
Structure
8Culture, Law And Design
In most countries, a company selling a product
might feature a typical "click-here" contract
page. But guess what? "In Italy and Mexico, they
don't view online contracts as legally binding,"
says Jeff Anderson, a manager for GE TradeWeb. In
those countries, potential customers are instead
shown a "please contact your local office" page.
Once they make contact, that local office faxes
them a contract. This cuts down on efficiency and
speed, of course, but the law's the law. Allison
Kurlya, senior vice president and technology
director at Digitas, says that "in Latin America
they do a lot of business by phone calls." So a
webpage that would be transaction-oriented in the
United States might be a request for a fax in
many European countries and a form that says,
"When's the best time to call you?" in South
America, Kurlya says.
9Culture Specific Design
Research by Barber Badre (2000) indicates the
existence of culture-specific web design elements
across web sites from various cultural origins,
for which they coined the term "cultural
markers". Cultural markers are design elements
that are repeatedly employed in web pages,
creating recognisable design patterns across
culturally different web sites. For instance,
Middle-eastern sites predictably tended to
orientate text and graphics from right to left.
This orientation makes sense to a Middle-eastern,
whilst it would not to a Westerner. Also, across
all Brazilian sites examined and of all genres,
there was a noticeable preference for use of many
colours, while Lebanese sites of all genres
tended to be text-oriented rather than
graphics-based.
10An Example Color
Colors, too, carry different meanings in
different places. A few years ago, Kwok recalls,
a large technology vendor was building a global
website that was entirely black, a color that
connotes hipness and sophistication in the United
States. When the site was unveiled, the vendor's
webmaster told Kwok, "Hong Kong and China
objected Black, that means death, unlucky,
morbid." DAE Interactive Marketing, like many
similar agencies and services, performs a
"cultural audit" on clients' sites to make sure
innocent mistakes like these don't kill a
company's global push.
11The selection of colours on a web page carries
subtle, but still significant amount of
information (e.g. Badre Barber, 2000 Gould
Marcus, 2000). Different cultures have different
associations for specific colours. For instance,
while in western cultures red and green represent
prohibition and permission respectively, this is
not true for the Chinese.
12Number
Number formats, despite the almost general use of
the Arabic symbols, vary across different regions
of the world. For instance, in Europe a period
(.) separates digits in large numbers (e.g.
1.000.000) and a comma (,) the whole part of a
figure from the decimal digits (1,5), whilst in
America the reversed convention is applied.
1.000.000
1,000,000
13Date and Time
Date and time formats also vary greatly as well,
across the different cultures order of
day/month/year appearance and 12- or 24
hour-based time. Furthermore, some Asian
countries don't use western calendars, or their
calendars are based on a decade counting unit.
14Icons
Images are media rich in meaning, conveying
messages that textual information may fail to
transmit. Therefore, meaningfully converting them
for use in another cultural context. For example,
the icon commonly used by Sun Microsystems to
represent an e-mail mailbox or a link sending
email, my be incomprehensible to users that have
never seen a similar real-life mailboxAn even
more common example is the widely used "home"
icon. This icon is mainly directing to the "home"
or "root" location in a web browser's taskbar or
web site's home page. However, some users may
fail to recognise its meaning if they have not
encountered "homes" like the one represented by
this icon. Assumingly, this will be more frequent
among novice users, since constant practice on
the web will result in establishing the icon's
meaning.
15Image Acceptability
Furthermore, a software or web developer should
consider the image or icons acceptability before
using it. Namely, whether a particular image is
not offensive to another culture. More
specifically, religious symbols, such as stars
and crosses, body and body parts images, women
images and hand gestures might have a negative
impact on other nations' moral perception.
Characteristically, Khaslavsky (1998) reports
that in Japan isolated bodily parts are perceived
particularly negatively.
16Symbols
Symbols, as well as graphics, might be
incomprehensible if not properly localised. For
example, whilst the cross in the Christian
western world represents prohibition, in Arab
countries it doesn't. Again, religious symbols or
ones that are subject to misinterpretation and
ambiguity should be carefully localised.
17Flow of Information
The way textual and graphical information is
comprehensively and meaningfully aligned on a web
page varies significantly among the cultures. In
America a series of columns and tables will be
aligned from left to right and top to bottom.
Nevertheless, in the Arab world that will not
make sense, because there the logical arrangement
of information is from right to left. What is
more, the Chinese read from vertically from top
to bottom and not horizontally.
18Aesthetics
Tractinsky (1997), citing Jakob Nielsen's work on
international user interfaces, ascribes usability
of a computerised system to five attributes
learnability, efficiency, memorability, errors
and satisfaction. The latter may imply that the
aesthetic impact of a web site may enhance user
performance. Indeed, Tractinsky's experiments
found strong relations between aesthetics and
user performance. Clearly, aesthetic perceptions
are culturally dependent (Fernandes, 1994). In
light of the above, one might expect that an
additional adaptation may be indeed essential. A
web page should comply with the culturally
established aesthetical conventions of the
country in question. The above general
guidelines highlight the necessity to consider a
wider array of design variables during the
internationalisation and localisation process
than the readily apparent ones.
19Geert Hofstede
A variety of factors to be considered when
localising a web site or piece of software. Text,
colours, graphics, navigation and functions need
to be carefully employed to aptly fit the target
culture's mental models, patterns of thought and
actions and expectations. The work of a Dutch
cultural anthropologist, Geert Hofstede in the
1980's, provided web developers with significant
results and suggestions, in order to develop sets
of guidelines so as to produce successfully
localised software and web sites.
- Power-distance
- Collectivism versus individualism
- Femininity versus Masculinity
- Uncertainty avoidance
- Long- versus short-term orientation
20Language, Writing and Computers
By fare the most complex cultural issue is
language. A language is a way that humans
interact. In computerised form, a text in a
written language can be expressed as a string of
characters. The same set of characters can often
be used for many written languages, and many
written languages can be expressed using
different scripts. Concepts like character set
and encoding describe the way text is stored in
computers, in files and data structures, and how
applications handle such text. When you use a
computer to write and file your master's thesis
or your mother's Black Forest cake recipe, you
produce text that you expect your computer to
store, to display on your home page, or to send
in e-mail.
21Language and Writing Text
A
A
A
rich text
plain text
Text consists of characters, mostly. Fancy text
or rich text includes display properties like
color, italics, and superscript styles, but it is
still based on characters forming plain text.
Sometimes the distinction between fancy text and
plain text is complex, and the distinction may
depend on the application. Here, we focus on
plain text. So, what is a character? Typically, a
letter. Also, a digit, a period, a hyphen,
punctuation, and math symbols. There are also
control characters (typically not visible) that
define the end of a line or paragraph. There is a
character for tabulation, and a few others in
common use.
22Language and Writing Glyph
A
A
A
glyph
character
The same characters are often shown with somewhat
different glyphs (shapes) for display of a text
depending on the font used, the automatic shaping
applied, or the automatic formation of ligatures.
In addition, the same characters can be shown
with somewhat different glyphs (shapes) for
display of a text depending on the language being
used, even within the same font or through
automatic font change. Some glyphs very
different even though they represent the same
abstract character, as for instance lowercase
cyrillic p
23Language and Writing Glyph
Characters may also take on different shapes in
different contexts. So, for example, the Arabic
character hah may have four different basic
shapes.
24Abstract Character Repertoire (ACR)
In internationalization we are concerned
primarily with representing the abstract
character, not its rich display formats nor its
font (script) variations. In any given
language-using application there is a set of
definable (in the absract) characters that can
occur. The is a non-coded character set, also
called an Abstract Character Repertoire.
abcdefghijklomopqurstuvwxyzABCDEFGHIHKLMNOPQURSTUV
WXYZ1234567890-!_at_()_ltgt?,./
25Coded Character Sets (CCS)
Thus, to design a character set, you first decide
how many and which characters you need. These
characters are the repertoire that you will work
with. Then you give each character an integer
number, and you've got a character set. The
result is called a Coded Character Set (CCS).
before you assign the numbers, the collection of
characters is called an Abstract Character
Repertoire (non-coded character set).
code point
A
65
Coded Character Set
Abstract Character Repertoire
26Character Mapping
Getting from the ACR to the CCS is accompished by
a scheme that assigns the numbers to the abstract
characters. This is a character mapping.
A
65
code point
Character Mapping
Coded Character Set
Abstract Character Repertoire
27Single-Byte 8 Bit Character Set
A CCS like US-ASCII or ISO-8859-1 with 256 or
less characters and no integer value above 255
can easily serve as a single-byte 8bit charset
where each octet of 8 bits (byte) is taken as a
binary number to look up the one coded character
it represents 01000001 -gt 65 -gt 'A'.
code point
character
code unit
A
65
01000001
8 bits octet
28Character Encoding Form (CEF)
US-ASCII maps from a set of integers to a single
code unit that is 8 bits wide. The character
encoding form (CEF) is a mapping from the set of
integers used in a CCS to the set of sequences of
code units. A code unit is an integer occupying a
specified binary width in a computer
architecture, such as a septet, an octet, or a
16-bit unit. An octet is a small unit of data
with a numerical value between 0 and 255,
inclusively. The encoding form enables character
representation as actual data in a computer.
There can be multiple code units of different
lengths.
Character Encoding Form
code point
code unit
character
integer
01000001
character
1 Code unit of length 8
29Character Encoding Form (CEF)
- A CEF can specify multiple code units of varying
length. - A character encoding form whose 1n sequences are
all of the same length is known as fixed width. - A character encoding form whose 1n sequences are
not all of the same length is known as variable
width.
The encoding form defines one of the fundamental
relations that internationalized software cares
about how many code units are there for each
character and what their size is. This used to be
expressed in terms of how many bytes each
character was represented by. With the
introduction of UCS-2, UTF-16, UCS-4, and UTF-32
with wider code units for Unicode and 10646, this
is generalized to two pieces of information a
specification of the width of the code unit, and
the number of code units used to represent each
character.
30- Examples of fixed-width encoding forms other than
ASCII - 7-bit each encoded character is represented in
a 7-bit quantity. For example, as in ISO 646 - 8-bit each encoded character is represented in
an 8-bit quantity - 8-bit EBCDIC each encoded character is
represented in an 8 bit quantity, with the EBCDIC
conventions rather than ASCII conventions - 16-bit (UCS-2) each encoded character is
represented in a 16-bit quantity - 32-bit (UCS-4) each encoded character is
represented in a 32-bit quantity within a code
space 0..7FFFFFFF - 32-bit (UTF-32) each encoded character is
represented in a 32-bit quantity within a code
space of 0..10FFFF. - Examples of variable-width encoding forms
- UTF-8 used only with Unicode/10646 a mix of
one to four 8-bit code units in Unicode and one
to six code units in 10646 - UTF-16 used only with Unicode/10646 a mix of
one to two 16 bit code units
31Character Encoding Scheme (CES)
Given a CCS and a CEF decision, then one can
construct a character encoding scheme, a mapping
of code units into serialized byte sequences. The
CES provides a set of rules for mapping the code
units into a stream of bytes (and back
again). Most fixed-width byte-oriented encoding
forms have a trivial mapping into a CES each
7-bit or 8-bit code unit maps to a byte of the
same value. A scheme based on 8-bits can
represent only 256 characters. However more
complex schemes are possible. A 16-bit mapping
allows more than 65,000 characters to be
represented. We need more complex schemes to
represent multiple languages in one character set.
32Character Set Size
How big a repertoire do you need? For the English
alphabet, with some digits and little more, maybe
around 60 characters. The Western European
Teletex standard comes with about 330 characters
for the many languages. Korean has almost 12, 000
syllables, and some comprehensive Chinese
dictionaries list far more than 50, 000 letters
in their script. There are also hundreds of other
characters in common use, such as math and
currency symbols.
33Why Character Encoding Schemes?
Inside a computer program or data file, text is
stored as a sequence of numbers, just like
everything else. These sequences are integers of
various sizes, values, and interpretations. Now
that we know what a character is, what number is
assigned to each one? A simple character such as
the letter "a" may have different integer values
in different programs or data files. In some
instances, there may not even be a number for a
certain character. The integers used for
characters have different sizes, or numbers of
bits. If the character is really an "auml", an
"a" with dots above it, then it might be stored
as two characters with two integer values one
for the "a" and one for the dots.
A
Ä
1 integer
2 integers
34Legacy Character Coded Character Sets
Historically, computers were pretty slow and had
fairly little memory. Some of our character sets
date back to that punch-card age and are designed
with these cards in mind. In fact, most of the
character sets that we have to this day are based
on those 1960s design decisions! In the early
days of computers, every computer maker invented
their own machine and memory layout. At first,
this wasn't a problem, because there was no
Internet where everything needed to fit together
-- every vendor just did what fit their
customers. As a result, there was a great variety
of bits per byte and bits per machine word (byte
groups), and different computer architectures
came with different character sets and encodings.
Characters were stored with anywhere from 5 to 9
bits each.
35Legacy CCS ACII, EBCDIC, BAUDOT
The two character set dinosaurs that are still
roaming the circuits of the networks are ASCII
and EBCDIC, both from the 1960s. Where there is
still a Telex (TTY) terminal, there is also the
much older Baudot-code. Baudot was designed for
5-bit units, ASCII for 7 bits, and EBCDIC for 8
bits. Another important legacy from those days is
the fact that some of the Internet e-mail system
is still only prepared to handle 7-bit bytes.
Fortunately, 7-bit e-mail gateways are a dying
species. Every modern computer architecture uses
bytes and machine words with at least 8 bits and
that are powers of 2 (8, 16, 32, 64, and so on).
36Encodings and Byte Streams
Inside a program, it is often best to deal
directly with fixed-length units according to the
character set so that each unit contains a single
character. When you do that, then following text
forward or backward is easy -- you just always go
to the next or previous unit. When you write text
into a file or send it over a network, then you
almost always read and write a number of bytes,
and if your units are bigger than bytes, then you
need to transform them in a defined and
reproducible way to make them fit. As we have
said, this is called a character encoding scheme
the way you get characters into byte streams,
and, more importantly, how you interpret byte
streams to get characters.
37Multiple and Variable Bytes
As we have said, the same character value can be
encoded with multiple bytes, even with different
bytes in different parts of the same byte stream.
When the character set units fit into single
bytes, the encoding is trivial and
indistinguishable from the character set itself.
For character sets with units that are larger
than bytes, there are often several encodings to
fit different needs, and one single encoding
might carry characters from more than one
character set to make them even more versatile.
ASCII is a character set using 7-bit units,
with a trivial encoding designed for 7-bit bytes.
It is the most important character set out there,
despite its limitation to very few characters,
because its design is the foundation for most
modern character sets.
38ASCII has only 95 Real Code Points
Only 95 ASCII code points are used for "real"
text-characters (or 94, not counting the space
character). These graphic characters are mostly
Latin upper- and lower-case letters, digits, and
punctuation, plus some special braces, an
underline, and some accent marks. It is a good
base for the American market, but not for
European languages with their accented letters,
and does not cover any other scripts. A code
point is identical to a character code. It is a
mapping, often presented in tabular form, which
defines one-to-one correspondence between
characters in a character repertoire and a set of
nonnegative integers. That is, it assigns a
unique numerical code, a code point, to each
character in the repertoire.
39(No Transcript)
40Character Sets for Many Characters
The most common encodings (character encoding
schemes) use a single byte per character, and
they are often called single-byte character sets
(SBCS). They are all limited to 256 characters.
Because of this, none of them can even cover all
of the accented letters for the Western European
languages. Consequently, many different such
encodings were created over time to fulfill the
needs of different user communities. The most
widely used SBCS encoding today, after ASCII, is
ISO-8859-1. It is an 8-bit superset of ASCII and
provides most of the characters necessary for
Western Europe. A modernized version,
ISO-8859-15, also has the euro symbol and some
more French and Finnish letters.
41Double and Multiple-Byte Sets
Double-byte character sets (DBCS) were developed
to provide enough space for the thousands of
ideographic characters in East Asian writing
systems. Here, the encoding is still byte-based,
but each two bytes together represent a single
character. Even in East Asia, text contains
letters from small alphabets like Latin or
Katakana. These are represented more efficiently
with single bytes. Multi-byte character sets
(MBCS) provide for this by using a variable
number of bytes per character, which
distinguishes them from the DBCS encodings. MBCSs
are often compatible with ASCII that is, the
Latin letters are represented in such encodings
with the same bytes that ASCII uses. Some less
often used characters may be encoded using three
or even four bytes. Examples of commonly used
MBCS encodings are Shift-JIS and EUC-JP (for
Japanese), with up to two and three bytes per
character, respectively.
42ISO/IEC 10646
The ISO 10646 Universal Character Set (UCS,
Unicode) is a coded character set Unicode is a
standard, by the Unicode Consortium, which
defines a character repertoire and character code
intended to be fully compatible with ISO 10646,
and an encoding for it. In principle, ISO 10646
is more general in nature and Unicode corresponds
to "Basic Multilingual Plane (BMP)" of ISO 10646
however, other "planes" haven't even been defined
yet. In practice, people usually talk about
Unicode rather than ISO 10646, partly because we
prefer names to numbers
43Why Unicode?
Hundreds of encodings have been developed, each
for small groups of languages and special
purposes. As a result, the interpretation of
text, input, sorting, display, and storage
depends on the knowledge of all the different
types of character sets and their encodings.
Programs are written to either handle one single
encoding at a time and switch between them, or to
convert between external and internal encodings.
There is no single, authoritative source of
precise definitions of many of the encodings and
their names. Transferring of text from one
machine to another one often causes some loss of
information. Also, if a program has the code and
the data to perform conversion between a
significant subset of traditional encodings, then
it carries several Megabytes of data around.
44Unicode provides a single character set that
covers the languages of the world, and a small
number of machine-friendly encoding forms and
schemes to fit the needs of existing applications
and protocols. It is designed for best
interoperability with both ASCII and ISO-8859-1,
the most widely used character sets, to make it
easier for Unicode to be used in applications and
protocols. Unicode is in use today, and it is
the preferred character set for the Internet,
especially for HTML and XML. It is slowly being
adopted for use in e-mail, too. Its most
attractive property is that it covers all the
characters of the world (with exceptions, which
will be added in the future). Unicode makes it
possible to access and manipulate characters by
unique numbers -- their Unicode code points --
and use older encodings only for input and
output, if at all.
45Unicode The Last Character Set?
The Unicode standard specifies a character set
and several encodings. As of early 2000, it
contains almost 50000 characters. It is an open
character set, which means that it keeps growing
and adding less frequently used characters. The
standard assigns numbers from 0 to 0x10FFFF,
which is more than a million possible numbers for
characters. About 5 of this space is used.
Another 5 is in preparation, about 13 is
reserved for private use (anyone can place any
character in there), and about 2 is reserved and
not to be used for characters. The remaining 75
is open for future use but not by any means
expected to be filled up. In other words, there
is finally a character set with plenty of space!
46Unicode UTF Encodings
- For single characters, 32-bit integer variables
are most appropriate for the value range of
Unicode. For strings, however, storing 32 bits
for each character takes up too much space,
especially considering that the highest value,
0x10FFFF, takes up only 21 bits. 11 bits are
always unused in a 32-bit word storing a Unicode
code point. Therefore, you will find that
software generally uses 16-bit or 8-bit units as
a compromise, with a variable number of code
units per Unicode code point. It is a trade-off
between ease of programming and storage space. As
a result, there are three common ways to store
Unicode strings - UTF-32, with 32-bit code units, each storing a
single code point - UTF-16, with one or two 16-bit code units for
each code point is extremely well designed and is
the default CES for Unicode. - UTF-8, with one to four 8-bit code units (bytes)
for each code point
47Unicode Encoding is Rich in Information
The Unicode Standard specifies a numeric value
and a name for each of its characters. In this
respect, it is similar to other character
encoding standards from ASCII onward. In addition
to character codes and names, other information
is crucial to ensure legible text a character's
case, directionality, and alphabetic properties
must be well defined. The Unicode Standard
defines this and other semantic information.
48There are Still MANY Issues UniHan
HAN (From the Han dynasty, 206 B.C.E to 25 C.E.)
One of the set of glyphs common to Chinese (where
they are called "hanzi"), Japanese (where they
are called kanji), and Korean (where they are
called hanja). Modern Korean, Chinese and
Japanese fonts may represent a given Han
character as somewhat different glyphs. However,
in the formulation of Unicode, these differences
were folded, in order to conserve the number of
code units necessary for all of CJK. This
unification is referred to as "Han Unification",
with the resulting character repertoire sometimes
referred to as "Unihan". It is a hot political
issue and has caused problems because of the
large number of ancient characters.
Examples of characters that were "unified"
49(No Transcript)