Open standards in use in localisation an engineering approach

About This Presentation

Title:

Open standards in use in localisation an engineering approach

Description:

Increase competence, focused on features (not compatibility) ... XHTML, DHTML (HTML CSS Scrpting DOM),... XML based standards: DITA, S1000D, TMX, TBX, XLIFF... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 58

Provided by: tektransla

Category:

more less

Transcript and Presenter's Notes

Title: Open standards in use in localisation an engineering approach

1
Open standards in use in localisation - an
engineering approach
Andrés Vega, DCU, Dublin, Ireland 12th June
2009
2
Agenda

Introduction Why Standards?
Part 1 Unicode and OpenType Fonts
Part 2 XML, CMS and DITA
Part 3 TMX, XLIFF, TBX and SRX
Final thoughts and QA
About the author and Tek

3
Why Standards?

Allow faster technology development
Assembling standard components
Concentrating effort on specialisation
Increase competence, focused on features (not
compatibility)
Facilitate inter-operability
Open standards allow information to be shared
(Not locked on proprietary standards)
Complementary tools may be developed
Choose tool/resource for each job
Guarantee future compatibility
Provide conformance validation mechanisms
Standard verification serves as QA procedure

4
Part 1 Encodings and Unicode

Terminology
Pre-Unicode Encodings (ASCII, ANSI, Multibyte)
Unicode
Unicode Workflow example
Unicode Transition issues (FrameMaker)
Unicode Transition issues (QuarkXpress)
OpenType fonts

5
Terminology

Coded Character Set
(Set of characters associated with codes)
Defined in RFC 2978

Code point (number associated with a
character)

Encoding / Charset
(Coded character set with a character
encoding scheme)

Character mapping
(Relation between code points of two
different encodings)

Alias (Alternate name for an encoding)

6
Chronology

Proprietary Encodings (Manufacturer dependant)
ASCII (ANSI X3.4 1968, 7-bit encoding)
ASCII national variants ISO 696
MS-DOS code pages (1980, 8-bit encoding)
Doublebyte and Multibyte encodings
ISO-8859-n many 8-bit encodings defined by ISO
Windows CPs
Unicode

7
ASCII -gt ANSI

ASCII (American Standard Code for Information
Interchange) 128 characters, US English
only
Positions 0 - 31 and 127 reserved for control
characters. They have standardized names and
descriptions, but usage varies.
American English characters range from 32
(space) to 126 (tilde ).
There are several national variants of ASCII
(only 128 characters). In such variants, some
special characters have been replaced by national
letters (and other symbols).
Positions 128 - 255 are not used in ASCII. They
belong to ANSI
ANSI codepages extend ASCII character set to
give support to specific languages/scripts. There
are five main groups
Windows CPs (and also old MS-DOS CPs)
Mac CPs
ISO-8859-n CPs (n1 Latin-1 to n16)
Other ASCII compatible CPs (KOI-8, ASMO, )
IBM EBCDIC

8
ASCII National Variants ISO 646
9
Visual comparison Western codepages

ASCII (ANSI_X3.4-1968) Windows western
(CP1252) ISO Latin1 (8859-1) UNIX
EBCDIC (Western 500V1) Mac Roman
New Line
Unix LF (0A)
Mac CR (0D)
Win/DOSCRLF

10
Examples with codepoints

11
Then came Unicode

Challenges
Too Many Character sets
Three great families (ANSI, DBCS, BiDi) three
application types
Multilingual data (storage, display, processing)
Cross-platform and character set
inter-conversion issues
Information loss WROC?AW
WROCLAW ? Fallback WROCLAW
(CE text within ASCII) Cross-Platform WROCAW
(Mac)
Misreading WROCxW (Trad Ch)
What Unicode is
Universal character encoding standard by the
Unicode Consortium
21-bit character set with 3 main encoding forms
(UTF-32, UTF-16, UTF-8)
Not just the character set
Character properties (Name, Category, Casing,
Decomposition, )
Annexes, Technical Reports (Comparison,
Sorting, Hyphenation, )
What Unicode is not
Glyph repertoire glyphs provided are examples,
not canonical!
Unicode alone does not provide language support!

12
Unicode (Benefits and Issues)

Unicode benefits
One vendor neutral encoding standard for all
languages
Stable, but it keeps evolving
Multilingual rendering/storage/transfer (No
conversion - No corruption)
Unified content processes (Globalized, Web
enabled)
Internationalisation
Easy conversion from/to/between legacy codepages
Issues or drawbacks with Unicode
Size (ANSI 1byte, DBCS 2byte, UTF-8 1-4 byte,
UTF-16 2-4 byte)
UniHan related (Font dependence, Gaiji and
variants)
Inconsistencies on implementation choices across
scripts
Several ways to generate pre-composed characters
Implementation issues
Script Enabling requires Input, Display,
Storage, Retrieval, Output
Bidirectional support, Complex Scripts issues

13
Unicode encodings

Unicode encoding forms examples
UTF-16 Little Endian (Less significant byte
first)
ÿþT h i s i s U n i c o d e t e x t
FFFE540068006900730020006900730020001C2055006E006
90063006F0064006500200074006500780074001D20
UTF-16 Big Endian (Most significant byte first)
þÿ T h i s i s U n i c o d e t e x t
FEFF00540068006900730020006900730020201C0055006E0
0690063006F0064006500200074006500780074201D
UTF-8 (byte-based encoding, uses 1, 2, 3, or 4
bytes)
ïThis is âUnicode textâ
EFBBBF5468697320697320E2809C556E69636F64652074657
874E2809D
BOM (Byte Order Mark) Character UFEFF
UTF-16LE FFFE (required)
UTF-16BE FEFF (required)
UTF-8 EFBBBF (can be omitted)

14
Unicode streamlines workflows

Pre-Unicode Workflow (FrameMaker 7)
Character corruption risks in all orange (middle
3 groups) steps
Final document presents issues in TOC and index
generation and in searches
Unicode Workflow (FrameMaker 8)

Back Conversion
File Preparation
Translation Review
DTP and Merge
Files to localize

Western RTF and fonts
CE RTF and fonts
Cyrillic RTF and fonts
Turkish RTF and fonts
Greek RTF and fonts
Baltic RTF and fonts

FM (Design font)
FM (CE font)
FM (Cyrillic font)
FM (Turkish font)
FM (Greek font)
FM (Baltic font)

English FrameMaker With Design Fonts

Multilingual
Target
Document
With several
ANSI fonts

Western RTF
CE RTF
Cyrillic RTF
Turkish RTF
Greek RTF
Baltic RTF

UTF-8 FM with original design fonts

English FrameMaker Design Fonts
Multilingual Document Design Fonts

UTF-8 XML

UTF-16 TTX and fonts
15
Example 1 ANSI codepages RTF issues

Trados saves Doc files as RTF files before
processing
Word .Doc segmented file Apie i vadova
(correct Lithuanian)
.RTF saved on English PC Apie ðá vadovà
Header \rtf1\adeflang1025\ansi\ansicpg1252\uc1
\adeff0\deff0\stshfdbch13\stshfloch0\stshfhich0\s
tshfbi0\deflang1033\deflangfe1042
\fonttbl\f0\froman\fcharset0\fprq2\\panose
02020603050405020304Times New Roman
\f1\fswiss\fcharset0\fprq2\\panose
020b0604020202020204Arial
()
100\gt\rtlch\fcs1 \af0 \ltrch\fcs0
\cf1\lang1063\langfe1033\loch\af1015
\hich\af1015\dbch\af0\langnp1063\insrsid13789330\
charrsid337550 Apie \'f0\'e1 vadov\'e0
.RTF saved on Baltic PC Apie i vadova
Header \rtf1\ansi\ansicpg1252\uc1
\deff0\deflang1033\deflangfe1033\fonttbl
\f0\froman\fcharset186\fprq2\\panose
02020603050405020304Times New Roman
\f1\fswiss\fcharset186\fprq2\\panose
020b0604020202020204Arial ()
()
100\gt\cf1\lang1063\langfe1033\loch\af2462\hich
\af2462\dbch\af0\langnp1063
Apie \'f0\'e1 vadov\'e0

16
Unicode simplifies processes
Hardware VPN for Mac OSX Japanese and Mac OS9
Chinese
Client
TEK
Software VPN
VPN client
Internet
PC Setup (Quark 7 OpenType)
STEP Server
Router
Specific setup for Chinese and Japanese. Removed
after Quark migration to Unicode
Hardware VPN

STEPXpress
STEPXpress
Mac OS9 Chinese
Mac OSX Japanese
17
Example 2 Quark western PC mapping issue

Quark 6 imported Polish text on PC File
Displays OK after CE font is applied (lower half)

18
Example 2 Quark western PC mapping issue

But when opened on a Mac
Extended characters are read as if they were
Windows Western.
Some can be mapped to Mac Roman , but they
do not have
the same corresponding CE character on the Mac
Latin II encoding.
Other characters cannot be mapped, are replaced
by fallbacks

19
Unicode transition issues

Transition issues
Mixed content legacy and UTF8 (FrameMaker)
FM7 FM8 update Import old
corrupted Filter version
English seen OK vars template
variables corrupts ANSI
Localisation tools, filters, etc not fully
adapted or tested
Example Style names containing extended
characters
New filter for FrameMaker 8 English names
are OK (UTF-8 ASCII)
German designed file Filter does
not accept UTF-8 Style names
Backwards conversions Unicode version saved as
non-Unicode version

UTF-8 Content ANSI Variables ANSI Template
ANSI Content ANSI Variables ANSI Template
UTF-8 Content ANSI Variables ANSI Template
UTF-8 Content Corrupt Vars ANSI Template
TTX
20
Example 3 Trados 7 TM imported in Trados 6

Trados 7 export is UTF-8,
but Trados 6 does not
recognize it and
imports it as ANSI
Issue seen as UTF-8

Issue seen as ANSI

21
OpenType fonts

Challenges
Two font families (TrueType and PostScript),
two font technologies
Inter-platform issues
Benefits of Open Type
Support large character sets (Unicode,
multiscript)
Glyph variants supported Solves Unicode UniHan
ambiguities
Supports advanced typography
Font embedding control
Features
Contain both TrueType and PostScript outline
data
Glyph substitution
Glyph positioning
Script and language information

22
Part 2 XML and CMS

Markup languages
XML
CMS
DITA

23
Markup languages SGML, HTML, XML

Markup text
Plain text tags
Tags define the structure, layout and/or
formatting of the text
Markup languages timeline
GML IBM 1978
SGML ISO standard 1985 (Meta-language to create
markup languages)
HTML Hyper Text Markup Language (Hyper text
Links)
- Derivated from SGML 1980-90
- HTML 2.0 First proper HTML specification
(1995)
XML eXtensible Markup Language (1998)
Other markup languages
XHTML, DHTML (HTMLCSSScrptingDOM),...
XML based standards DITA, S1000D, TMX, TBX,
XLIFF...
Other RTF, MIF, DocBook, TeX,...

24
HTML vs XML Visual comparison of markup

HTML
Declaration Does not exist
Doctype HTML or none
Elements
HTML element can be ommited
Defined by pairs of start-end tags. Some tags
may not have closing couple (ltpgt, lthrgt)
Names are case insensitive
Tag pairs can be interwoven
ltbgtltIgtbold and italiclt/Bgtlt/igt
Attributes
Names and sometimes values already defined by
standard
Quotes around values are optional

XML
Declaration Required
Doctype May link to a DTD or Schema
Elements
Only one root element and its required
Al tags must be closed (or self-closed)
ltLineBreak/gt
- Element names are case sensitive
- Tags have to be correctly nested
ltbgtltigtbold and italiclt/igtlt/bgt
Attributes
Any names and values can be defined
- All attributes must use quotes, single or double

25
HTML example Translators view
Edit view within Trados
TagEditor
26
XML example Translators view
Edit view within Trados
TagEditor
27
XML

eXtensible Markup Language (Meta-language for
markup languages)
Used to define, share and validate information
(data and structure)
An XML document contains
XML declaration lt?xml version'1.1'
encoding'UTF-8' standalone'yes'?gt
Document Type declaration(s) lt!DOCTYPE root
SYSTEM rootDTD.dtd" gt
Elements ltelement attributevaluegtContentlt/
elementgt or ltelement/gt
Other comments, entities/NCRs, instructions,
conditional sections
Specific Syntax (well-formed XML)
Only one root element
Tags in nested open/close pairs lttaggt lt/taggt
Element names obey certain conventions
Elements may contain attributes
DTD (Valid XML)
Defines rules on structure, valid tags and
attributes and valid data
Guarantees reliable data exchange between
different systems
Can be included in each XML, but is normally
external

28
XML (General benefits)

Simple (XML is plain text) but can embed any
content type
Platform independent, Unicode encoded
Content is easily validated cross-platform data
transfer is safer
Structured (defines structural relationships
within data)
Open and Extensible well supported standard
Metadata and version control capable
Format independent
Powerful data transformation tools (XSL)
Multiple outputs

29
XML (Localisation benefits and issues)

Localisation benefits
Structured Content detached merged (updates
handling)
XML support easily implemented on Localisation
processes/tools
Easy validation versus DTD
Extensible XML based localisation standards
XLIFF, TMX, TBX,...
Metadata (source/target version control,
updates, element status)
Format independent
Single-sourcing (localized once, published into
many formats)
Source content and formatting changes are not
inter-dependant
Content localisation and proofreading before
formatting (DTP)
Issues
Transition needs to be well planned and
performed
Segmentation issues (DTD needs to be
multilingual aware)
Source For more information see page ltxref
refpagexxxgt
Japanese ????ltxref refpagexxxgt??????????

30
Content Management Systems

What are Content Management Systems?
Sets of tools configured around a data
repository (database)
Designed to manage information in small
meaningful bits
Product based
Topic based
Information is isolated from format
Store localized content layers (as other
alternative content layers)
Provide tools for
Consistent content authoring (Style and
Terminology)
Version control
Change tracking
Workflow capabilities

31
CMS (Benefits)

General benefits
Granularity (no redundancy)
Reuse (content reuse and multi output)
Improved Quality and Consistency
Single-source and multi-publishing
Easy rebranding/reformatting
Metadata info and version control
Workflow and Automation
Localisation benefits
Workflow status control features
Localisation of updates via content deltas
improved time-to-market
Localisation independent from output format
(better matching)

32
CMS (Issues)

Issues
Authoring for reuse (topic model, single-source,
cross-reference)
Segmentation issues
LF Chars (0A) No Validation! Segmentation
issue
Localisation readiness
CMS must be multilingual enabled (storage, I/O,
processing)
Localisation workflow support
Strong version control and version rollback
Capability to export up-to-date paired TM
content
Integration with LQA tools
Not to increase ROI in the short run (DTP is
still needed!!)

CMS
Quark Xxxx Xxxx Xxxx xxxx Xxxx xxxx
33
CMS Localisation Workflow
Client
Tek
Client Validators
Select only delta content
Translation (TTX format)
Revision (TTX format)
XML
CMS
Content Validation in Tracked-changes RTF
Prepared for Proofreading (Colour-coded RTF
format)
Insertion of Validation changes (TTX TMs)
XML
XML
Full document in XML
Preprocessing of XML
Layout Consistence Validation in PDF file
Import to FrameMaker
DTP in FrameMaker
Delivery in FrameMaker
34
DITA

DITA (Darwin Information Typing Architecture)
Topic-based XML framework for writing and
delivering information
Developed by IBM (19992000) to replace the
complex IBMDoc format
Later became a public OASIS standard (2005)
Fast implementation on Authoring and Content
Management
DITA model consists of
A Document Type Definition (DTD)
Specifies base DITA types, their elements and how
they can be defined
(base DITA information types are Topic and
Concept, Task, Reference).
A set of XSLT stylesheets that control the
output.
Writers use them in conjunction with an XML
processor to convert
DITA documents to more usable formats, such as
HTML or PDF.

35
DITA (Components)

DITA topics
XML elements that contain the information of each
information 'topic'. Each topic can consist of a
concept, a related task with its action steps and
a set of references to other topics.
DITA Maps
XML elements that establish hierarchical
relationships among topics.
Relationship Tables
XML elements that establish non-hierarchical
relationships among topics.

36
DITA (Benefits)

DITA aims for
Reuse Not only of contents, but also of design
and processes.
Content reuse being topic based, each element
has complete meaning and can be separately
created and maintained yet it can be combined
with other topics for different outputs.
SingleSourcing as form is separated from
content.
Design reuse allow information sharing while
making it easy to develop to cover specific
needs.
Processes reuse Uses overrides to inherit all
basic and intermediate processes and still allow
for custom processes when needed.
Standardization intended to last without major
reworks.
Strongly typed Strong but generic core that can
be used as a fall back for light implementations.
Flexible through specialization Allows to
create new types based on the core types, thus
specializations can be defined and implemented
for specific uses.

37
DITA Example
38
Part 3 Interchange formats

TMX
XLIFF
TBX
SRX

39
TMX

What is TMX?
Translation Memory eXchange
Standard by LISA (Localisation Standards
Industry Association)
Provides a standard method for TM data
description
XML-compliant (validated against its TMX DTD)
Uses other ISO standards for date, time, lang,
country
Consists of
Container format specification
Translation unit elements lttugt
Optional format description elements (font
change,...)
Subflows (footnotes, index entries)
Low-level meta-markup format for segment content
Segment element ltseggt

40
TMX (Benefits)

Transfer TM assets across tools/vendors
Prevents character corruption (Unicode)
Provides clients with control over their
translated assets
Non-proprietary and vendor neutral
Can be integrated with LQA tools
Provides Translators/Vendors with freedom of
tool choice
Specialized tools share TM assets
Tools may be outdated, assets will not
Facilitates work distribution/outsourcing

41
TMX (Issues)

Issues
Tag handling issues
TMX DTD cannot validate inline codes
TMX compliance level varies
Segmentation issues
Different segmentation rules on different CAT
tools
Sentence based (TM) vs Field based (CMS,
Database export)
Consecuence reduced translation leverage

42
TMX (Examples)
TMX Version 1.4b (exported from Trados 7)
43
TMX (Examples)
Translation unit in TMX 1.1 format from Trados
7 Translation unit in TMX 1.4b format from
Trados 7
44
XLIFF

Xml Localisation Inter-exchange File Format
Standard by OASIS
Tool-neutral XML-based standard localisation
resource container format
To store/transfer/manipulate localizable
content, context and other info
Has Built-in support for CAT tools and related
standards (TBX, TMX)
Features
Translation suggestions (TM, Glossary, MT) to
approve or edit
Metadata Translate, notes, context info,
version
Hierarchical data structures
Abstraction of formatting and inline codes
Structural formatting stored in the skeleton
file
Inline formatting can be dealt with two ways
Replaced by g (paired) and x (isolated) tags
(OpenTag style)
Encapsulated into bpt, ept (paired), it or ph
(isolated) tags

45
XLIFF (Description)

Separates localizable and non-localizable
content
Non-localisable Skeleton (separate or embedded)
Localizable 'file' Elements with Header
(metadata) and Body
Body can contain 'trans-unit' and 'bin-unit'
elements
Each trans-unit can have
lttrans-unit id"abc123" resname"resourceID"
restype"string" translate"yes"gt
unique id, resource id, resource type,
translate yes/no
ltsource xmllang"en-US"gtTranslatable
content.lt/sourcegt
Translatable content source and language
lttarget xmllang"es" state"needs-review-trans
lation"gtTraducción.lt/targetgt
Currently validated translation
ltalt-trans match-quality"100" tool"TM"gt
ltsourcegtTranslatable content.lt/sourcegt
lttarget xmllang"es"gtContenido
traducible.lt/targetgt lt/alt-transgt
alt-trans translation suggestion(s)
lt/trans-unitgt (closing tag)

46
XLIFF (Benefits for translation)

Benefits For the translation process
One common format on which to translate
One (or few) translatable document
Control on Translatable/Non-translatable content
Better information handling (context, notes,
metadata)
Better TM matching due to formatting abstraction
Concurrent tool processing visible at review
stage
Support for all localisation phases
Supports metrics info on each trans-unit

47
XLIFF (Other Benefits and Drawbacks)

Benefits For localisation tool developers
Common platform for tool developers to write to
Easy adoption of new formats (new filters to
XLIFF)
All generic XML processing benefits
Drawbacks
Conversion tools needed into XLIFF and back
Many XLIFF features are not implemented by most
tools
Segmentation is inherent to XLIFF file
generation
As opposed to tailored tools, WYSIWYG is
difficult to attain

48
XLIFF Workflow

No XLIFF Scenario
XLIFF Scenario

Many Formats!
SGML Editor
.mif
.xml
.htm
.rtf
Software Editor
.dll
.rc
.resx
SGML Editor
Many Filters!
XLIFF
.mif
.xml
.htm
.rtf
.dll
.rc
Software Editor
.resx
LQA
49
LISA terminology exchange standard TBX

What is TBX?
Term Base eXchange standard by LISA
XML based, vendor-neutral, open standard
Why TBX?
Terminology handled using proprietary standards
Difficult to share
Difficult to develop tools to enhance term
adherance
Glossary format choice linked to translation
tool
Glossary usually mantained by LSP
Limited client control

50
TBX (Benefits and Implementation status)

Benefits
Better control of terminology (source
consistency)
Improved quality
Improved consistency
Improved terminology control at target
Reduced glossarisation effort (localisation
phase)
Master provided with source
Allows automated QA checks
Platform and tool independent glossaries (global
consistency)
Unify terminology across platforms/formats/vendors
Current status
TBX Basic (Lighter approach)
TBX Checker

51
TBX Example

TBX Basic example from LISA

52
LISA segmentation rules standard SRX

What is SRX?
Segmentation Rules eXchange format
Describes how localisation tools segment text
for processing
Benefits
Standardises segmentation process (avoid
segmentation issues)
Structure and Elements
ltsrxgt root element, contains one of each
ltheadergt, ltbodygt
ltheadergt attrs segmentsubflows, cascade may
contain ltformathandlegt
ltformathandlegt define how to handle boundary
formatting
ltbodygt contains one of each ltlanguagerulesgt,
ltmaprulesgt
ltlanguagerulesgt contains one or more
ltlanguagerulegt
ltlanguagerulesgt contains one or more ltrulegt
ltrulegt attrs break contains a pair
ltbeforebreakgt, ltafterbreakgt
ltbeforebreakgt, ltafterbreakgt contain the
segmentation regular expresions
ltmaprulesgt encloses several ltlanguagemapgt
defining rules precedence

53
SRX example

SRX tool within Passolo

54
Final Thoughts

Unicode
As a rule, use it. If delivery uses other
encoding, convert at final stage
XML
Powerful for single-source, multi-output
requirements
CMS
Costly. Depends on volume. First consider XML
model only then migrate
DITA
Use it if it matches your data model. It will
reduce migration effort to CMS
TMX
Use for safe TM tool to tool transfer, specially
software into doc
XLIFF
Still not fully implemented. Good alternative
for Java and Web content.
Use it to unify side processes (LQA)
TBX
Use to exchange glossary info. Good for clients
SRX
Very much need, but still few implementations.

55
About the Author - Andrés Vega

9 years of experience as a Localisation Engineer
with Tek Translation International.
Specializing in complex project engineering with
special focus on CMS, encodings and complex
scripts.
Previous work as a programming languages teacher
OO programming, C and Java.
Background in Chemistry and Healthcare.

56
About Tek Multilingual translation and
localisation business solutions designed to meet
the needs of Life Sciences, IT and Manufacturing