Title: Project Management in the Language Industry: Lecture 3
1Project Management in the Language Industry
Lecture 3
- Dr. Gregory M. Shreve
- Kent State University
- Institute for Applied Linguistics
2Documents and LPM
- Language projects always involve the document
pool of the client. Translation and localization
can become part of a companys document life
cycle if it markets products in other linguistic
and cultural locales. A document pool is the
logical object representing all documents,
created, acquired, stored, and otherwise used by
a client. The effective language project manager
has to understand the nature of the clients
document pool and the processes associated with
it. More importantly, the manager has to
understand what can be done with a document pool.
That subset of the pool that is machine-readable
and residing on a single structured data source
is a corpus (unified corpus).
MR subset
document pool
corpus
3Documents
- Documents are information organized and
presented for human understanding an interface - Documents are where information meets with
people and their work a conduit - Documents are a reflection of social relations
an interaction structure
4Document Life Cycle
- The major phases of a document's life cycle are
process-oriented. Document Processes include - creation
- storing
- rendering
- distribution
- acquisition
- retrieval
5Document Creation
- Document creation is also known as Authoring.
Computer-Assisted Authoring is (will be) a
significant trend in the next decade. Authoring
tools are used in the document creation stage of
the document life cycle.
documents
6Computer Assisted Authoring
- Existing tools do not exhaust the range of
language-based technologies which can help in
document creation
style checking
grammar checking
spell checking
7Document Storage
Space, speed and ease of access are the most
important parameters for document storage
technologies. Space constraints can be overcome
by compression, but speed and ease of access are
the trade-off. (Why?)
storage medium
uncompressed
compressed
8Uncompressed data need not be decompressed
upon retrieval
Some forms of document storage use document
images, and the compression techniques are image
compression algorithms.
storage medium
compression
access
There are also algorithms for compressing textual
data, e.g. linguistically-based compression
algorithms.
decompression
9Rendering
- Rendering is the process of reproducing the
document in multiple desired formats. In the past
such activity was restricted to copying or
printing, but now other output formats are
possible.
Print PDA Fax HTML PDF
storage medium
10Distribution
- Distribution consists of filtering and routing.
With the increase in electronically available
information, the demand for automatic filtering
and routing has become critical. Current e-mail
and work group support systems have rudimentary
capabilities for filtering and routing.
Distribution can be supported by representations
of document workflow .
document workflow agent
11A workflow agent allows supervisors to visualize
the movement of workfolders (documents) through a
process. The document work flow is developed in
a Work Flow Representation. An agent implements
the representatation by moving the work folder to
the next cell automatically upon completion of a
defined task. Workflow representation systems
allow supervisors to design visual document work
flows in minutes, with defined identification
numbers, names, allowable task times,
requirements, constraints and priorities.
12Acquisition
- The difficulty of integrating the world of paper
documents into the world of electronic document
management is a proven productivity sink. The
role of natural language models in improving
optical character recognition and document
reconstruction is highly under-exploited and
just now being reflected in commercial products.
13Access
- An organization's cost for accessing a document
far dominates the cost of filing it in the first
place. The integration of work flow systems with
content-based document access systems promises to
expand one of the fastest growing segments of the
enterprise level software market (work flow) from
the niche of highly structured and transaction
oriented organizations (e.g., insurance claim
processing), to the general office which traffics
in free text documents, and not just forms. The
access phase is a ripe area for the productivity
enhancing injection of language processing
technology.
14DocumentManagement
- A precondition for document access is storage
followed by accession to a document management
system. Most current systems provide - scanning
- image compression
- storage
- OCR
- DBMS support
- hierarchical archival
- indexing by special field
- keyword extraction
- search (keyword, full-text)
- import / export (acquisition/rendering)
- security and access control
- workflow
DMS
15Many systems provide retrieval access to the DMS
only on the basis of indexes of meta-document
data that have been provided about the document
by DMS system administrators and users or by the
DMS system itself. Workflow is also meta-document
data. Increasingly systems are providing access
on the basis of content-based data, e.g., data
extracted from the document
keywords indexes summary language physical data
(fonts)
content-based document data
size location author creation date revision
date name department
meta-document data
workflow
16Corpus
- A corpus is a collection of documents it is the
logical entity (object) corresponding to the
physical body of documents in the DMS. It is the
digital incarnation of the document pool. A
collection of all corpora is a unified corpus. A
unified corpus does not exist when separate(d)
corpora exist in MR format in multiple locations.
corpus
DMS
17Structured / Free Text
- The documents in a corpus range in a cline from
structured / transaction oriented to free text.
Structured texts are the easiest to extract
content from. Why?
more structured
more free
essay
patent
form
Because a document representation is easier to
construct.
18Document Retrieval
- Document retrieval is defined as the matching of
some stated user query against documents and/or
the useful parts of documents. These records
could be any type of mainly unstructured text,
such as bibliographic records, newspaper
articles, or paragraphs in a manual. User queries
could range from multi-sentence full descriptions
of an information need to a few words. The vast
majority of retrieval systems currently in use
apply simple Boolean systems on keyword or
full-text searching, though some use statistical
or natural language processing techniques. Many
systems will just retrieve the document(s)
satisfying the query. Of greater value would be
the retrieval of document segments and/or other
objects derivable from the document set retrieved
by the query.
19DMS
keyword phrase
? ? ? ? ?
QUERY
query language query interface
document
document segment
document segment
document segment
document segment
20? ? ? ? ?
QUERY mechanism
DMS
term / keyword full-text string search
phrase full-text string search / enhanced by NP
tagging via POS tagger
concept structure
21? ? ? ? ?
QUERY Retrieval mechanisms
parse the corpus to locate content-carrying
terms, discover relationships between these
terms, and then use these terms to expand or
modify the queries
NLP-based using NLP resources lexicons,
dictionaries, thesauri, proper name recognizers
traditional statistically based
future research looks to integrate statistical
and NLP approaches
22Other Corpus-derived Objects
concept structures term lists context sets syntax
trees tagged phrase sets summaries /
abstracts . . .
Other information can be extracted from corpora.
This process is called extraction and involves
algorithms for document interpretation and
analysis.
23Extraction
structured formats DBMS TDB MD
free text
Content-based document retrieval depends on
information extraction methods being applied to
the corpus of texts in the DMS. Keyword
extraction is a common example, but terminology
mining might be another useful application.
Extraction produces structured data from free
text. A common technique is text skimming where
automatic routines isolate identifying key
artifacts in the text, such as proper names,
dates, times, and locations, and then use a
combination of linguistic constraints and domain
knowledge to identify the important content of
each relevant text.
24Data Mining
- Extraction is related to data mining. Data mining
is the process of discovering and extracting
knowledge (patterns, associations, changes,
anomalies, significant structures) from document
databases. It is sometimes referred to as
knowledge discovery in databases. Data mining
consists of several steps - data cleaning (handles noisy, erroneous, missing
irrelevant data) - data integration (multiple data sources merged
into one DMS/corpus) - data selection (data relevant to the task
retrieved from the DMS) - data transformation (data consolidated into
forms appropriate for mining by performing
aggregation or summary techniques) - data mining (intelligent methods applied to
extract data patterns) - pattern evaluation (identify interesting
patterns representing knowledge based on
interest measures) - knowledge presentation (visualization and
knowledge representation techniques are used to
present results to the user)
25Terminology Mining
- We can see terminology mining as an important
subset of data mining, where the same basic steps
apply, including the need for selection,
transformation and mining methods to discover and
represent terminological (and the underlying
conceptual) structures in documents and document
corpora.
document
document
t1
DMS
t2
cn
document
document
26Leveraging the Document Pool
Data mining and terminology mining emphasize the
fact that it is possible to leverage a document
pool. Leveraging means exploiting the document
pool as a business resource with the objectives
of increasing benefit while reducing costs. A
Language PM can sell strategies, methods and
tools for leveraging to a client, or,
alternatively, retain these resources and sell
the client language services which will leverage
the document pool.
27Two Business Approaches
Language Service Company
2 apply the tools, methods, strategies sell the
service
1 sell tools, methods strategies
Client
28Language Client-Server Model
Language Services Server
DMS
C l i e n t
Document Server Authoring Server Terminology
Server Translation Server
29- In general, certain principles apply when
leveraging the document pool - principle of corpus structure
- principle of corpus scope
- principle of corpus size
- principle of document structure
- principle of domain shift
- principle of linguistic variation
- principle of terminological control
- principle of reusability
- principle of validation
30Corpus Structure
- Corpora may have a segmented or partitioned
structure. The structure may be hierarchical or
associative. A hierarchical structure may
involved division of the corpus into - subcorpora based on language
- subcorpora based on domain
- subcorpora based on document type
- subcorpora based on transaction type or workflow
- In general, general some partitioning of the
corpus will be useful or necessary during the
leveraging process.
31Multilingual Corpora
- A DMS can store documents in multiple languages.
In such cases we can consider the two or more
document sets as partitioned sub-corpora of the
unified corpus that represents all MR documents
in the company residing on a single DMS.
Parallel Corpora
If the documents in sub-corpora are translations
of one another they are referred as parallel
corpora. Typically we expect such corpora to be
translation unit aligned. Two such aligned
corpora are a translation memory.
Comparable Corpora
If the documents sub-corpora are in similar
domains they are referred as comparable corpora.
Such corpora could be concept aligned and
structurally aligned.
32corpus
- Sub-corpora and corpora may divide into a variety
of partitions. Some of these maybe constructed ad
hoc based on real time processing of the corpus,
or based on pre-existing tags relating to
specific user views or schemes. Corpora may be
more finely structured than just having
subcorpora. The granularity of the partitioning
of corpora may vary. Partitioning below the
document level involves document structure. In
general the finer the granularity of structuring
the greater the control over retrieval, access,
data mining and other operations.
sub corpora
sub corpora
document
document element
sentence
phrase
lexeme
33Corpus Scope
- The goal should be to make the clients document
pool and the corpus isomorphic -- e.g., the more
documents that are in MR format the better. This
implies acquisition and import strategies to
convert existing documents and to ensure that
incoming documents enter the corpus. Corpus scope
also implies that the range of documents in the
corpus be co-extensive with the range of
documents in the pool. Why?
external pools
document pool
corpus
acquisition
import
34Corpus Size
- The larger the corpus the more valuable the
resource. As corpus size increases the pattern
strength of consistent features of the corpus
grows. Statistical analysis methods, neural
networks and pattern recognition algorithms may
be applied profitably. The corpus size of
aligned corpora is a significant factor in the
usability of translation memory and of comparable
corpora.
corpus
corpus
35Document Structure
- The more structured the documents in the corpus,
the easier they are to process. While it is
possible to discover document structure via
real-time processing -- during the construction
of new corpora for a client (e.g., when one has
the opportunity) it might be useful to introduce
document structuring strategies. SGML-based
methods including the TEI conventions could be
used. Structuring documents allows for easier
extraction, and can be used, as well, in
authoring and translation applications to enforce
or apply styles, identify document internal
structures, identify terms, tag reusable
components, etc.
Te
st Document Document
Test This is a
test paragraph.This is a
footnote. typerangestart indexterm
The title of
this article is
ldquoTest
Documentrdquo. Here's a
program listing int
main(int argc, char argv) if (argc ... int i argc ... startrefiterm typerangeend The program
in is meaningless.
36Domain Shift
- The more constrained the domain shift(s), the
easier knowledge-based techniques will be to
implement. Domain shift is a measure of the range
of variation of the knowledge domains contained
in a corpus or sub-corpus. Controlling domain
shift is a major factor in the ability to apply
machine translation and in knowledge-based
methods.
domain
domain
domain
domain
domain
corpus
corpus
37Linguistic Variation
- The more constrained the language variation the
easier language-based processes will be to
implement over a corpus. The primary tools in
controlling linguistic variation are the
application of style guides, the use of document
validation methods (e.g., DTD validation) and,
especially, controlled language approaches. - Writers, especially technical writers, tend to
develop special vocabularies (jargons), styles,
and grammatical constructions. Technical language
becomes opaque not just to ordinary readers, but
to experts as well. The problem becomes
particularly acute when such text is translated
into another language, since the translator may
not even be an expert in the technical domain.
Controlled Languages (CL) have been developed to
counter the tendency of writers to use unusual or
overly-specialized, inconsistent language.
38A CL is a form of language with special
restrictions on grammar, style, and vocabulary
usage. Typically, the restrictions are placed on
technical documents or other specialized,
including instructions, procedures, descriptions,
reports, and cautions. Where formal written
English applies to society as a whole, CLs apply
to the specialized sublanguages of particular
domains. By now, hundreds of companies have
turned to CLs as a means of improving readability
or facilitating translation to other languages.
FormalwrittenEnglish
ControlledEnglish
Rule set syntactic correctness semantic
constraints
39- The original CL was Caterpillar Fundamental
English (CFE), created by the Caterpillar Tractor
Company (USA) in the 1960s. Perhaps the best
known recent controlled language is AECMA
Simplified English AEC95, which is unique in
that it has been adopted by an entire industry,
namely, the aerospace industry. The standard was
developed to facilitate the use of maintenance
manuals by non-native speakers of English.
Aerospace manufacturers are required to write
aircraft maintenance documentation in Simplified
English. Some other well-known CLs are Smart's
Plain English Program (PEP), White's
International Language for Serving and
Maintenance (ILSAM), Perkins Approved Clear
English (PACE), and COGRAM. Many CL standards are
considered proprietary by the companies that have - developed them. Controlling languages has
beneficial effects on the prospects for both
Computer-Assisted Authoring and MT.
40Terminological Control
- Controlling terminology is as important as
controlling linguistic variation and document
structure. Terminological control strategies need
to be applied early in the LI-PMs dealings with
the client. Terminology documenting protocols,
standards for machine-readable terminology
databases, and terminology databases are
important tools in the quest for terminology
control. Terminology checker / validators and the
connection of MT, CAT and CAA systems to
controlled terminology resources is an essential
strategy.
41Reusability
- Documents, portions of documents, sentences,
translations, phrases and terms, e.g., any object
in the corpus or derivable from it can be re-used
once it is validated and once its relationship to
other objects and their object instances is
understood. - In most corporate corpora there is a relatively
restricted range of domains (low domain shift),
low linguistic variation, and a finite
terminology set. If document structures are
understood then relationships between document
elements can be defined and represented and many
reusable document objects may be specified.
42variables
corpus
term set
document elements
Assembled Document
document structure
translations
Reusable Objects
43Validation
- Validation/verification methods need to apply to
all objects added to a corpus (e.g., documents),
added to a document (e.g., structure, tags) or
extracted from a corpus (e.g., terminology,
concept maps). Validation is a requirement for
corpus integrity and a component of corpus and
document quality assurance.
corpus
44Internationalization
- Many of the principles of corpora management
enunciated here, in particular terminology
control, domain shifting, controlled languages
and document structuring are key issues in
engineering a corpus for easier and more accurate
localization / translation.
corpus
45A Complex Equation
- Leveraging the clients corpus for the best cost
benefit ratio implies that you, the project
manager understand the nature of the corpus and
the strategies, methods and tools that can be
brought to bear on it.
acquisition authoring distribution extraction impo
rting rendering
representation retrieval storage structuring termi
nology translation validation
Document Processes
standardize represent automate
corpus structure corpus scope corpus
size document structure domain shift linguistic
variation terminological control reusability
validation
Corpora Principles Parameters
corpus
manage control exploit