Project Management in the Language Industry: Lecture 3 - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Project Management in the Language Industry: Lecture 3

Description:

Language projects always involve the document pool of the client. ... A document pool is the logical object representing all documents, created, ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 46
Provided by: Gregory8
Category:

less

Transcript and Presenter's Notes

Title: Project Management in the Language Industry: Lecture 3


1
Project Management in the Language Industry
Lecture 3
  • Dr. Gregory M. Shreve
  • Kent State University
  • Institute for Applied Linguistics

2
Documents and LPM
  • Language projects always involve the document
    pool of the client. Translation and localization
    can become part of a companys document life
    cycle if it markets products in other linguistic
    and cultural locales. A document pool is the
    logical object representing all documents,
    created, acquired, stored, and otherwise used by
    a client. The effective language project manager
    has to understand the nature of the clients
    document pool and the processes associated with
    it. More importantly, the manager has to
    understand what can be done with a document pool.
    That subset of the pool that is machine-readable
    and residing on a single structured data source
    is a corpus (unified corpus).

MR subset
document pool
corpus
3
Documents
  • Documents are information organized and
    presented for human understanding an interface
  • Documents are where information meets with
    people and their work a conduit
  • Documents are a reflection of social relations
    an interaction structure

4
Document Life Cycle
  • The major phases of a document's life cycle are
    process-oriented. Document Processes include
  • creation
  • storing
  • rendering
  • distribution
  • acquisition
  • retrieval

5
Document Creation
  • Document creation is also known as Authoring.
    Computer-Assisted Authoring is (will be) a
    significant trend in the next decade. Authoring
    tools are used in the document creation stage of
    the document life cycle.

documents
6
Computer Assisted Authoring
  • Existing tools do not exhaust the range of
    language-based technologies which can help in
    document creation

style checking
grammar checking
spell checking
7
Document Storage
Space, speed and ease of access are the most
important parameters for document storage
technologies. Space constraints can be overcome
by compression, but speed and ease of access are
the trade-off. (Why?)
storage medium
uncompressed
compressed
8
Uncompressed data need not be decompressed
upon retrieval
Some forms of document storage use document
images, and the compression techniques are image
compression algorithms.
storage medium
compression
access
There are also algorithms for compressing textual
data, e.g. linguistically-based compression
algorithms.
decompression
9
Rendering
  • Rendering is the process of reproducing the
    document in multiple desired formats. In the past
    such activity was restricted to copying or
    printing, but now other output formats are
    possible.

Print PDA Fax HTML PDF
storage medium
10
Distribution
  • Distribution consists of filtering and routing.
    With the increase in electronically available
    information, the demand for automatic filtering
    and routing has become critical. Current e-mail
    and work group support systems have rudimentary
    capabilities for filtering and routing.
    Distribution can be supported by representations
    of document workflow .

document workflow agent
11
A workflow agent allows supervisors to visualize
the movement of workfolders (documents) through a
process. The document work flow is developed in
a Work Flow Representation. An agent implements
the representatation by moving the work folder to
the next cell automatically upon completion of a
defined task. Workflow representation systems
allow supervisors to design visual document work
flows in minutes, with defined identification
numbers, names, allowable task times,
requirements, constraints and priorities.
12
Acquisition
  • The difficulty of integrating the world of paper
    documents into the world of electronic document
    management is a proven productivity sink. The
    role of natural language models in improving
    optical character recognition and document
    reconstruction is highly under-exploited and
    just now being reflected in commercial products.

13
Access
  • An organization's cost for accessing a document
    far dominates the cost of filing it in the first
    place. The integration of work flow systems with
    content-based document access systems promises to
    expand one of the fastest growing segments of the
    enterprise level software market (work flow) from
    the niche of highly structured and transaction
    oriented organizations (e.g., insurance claim
    processing), to the general office which traffics
    in free text documents, and not just forms. The
    access phase is a ripe area for the productivity
    enhancing injection of language processing
    technology.

14
DocumentManagement
  • A precondition for document access is storage
    followed by accession to a document management
    system. Most current systems provide
  • scanning
  • image compression
  • storage
  • OCR
  • DBMS support
  • hierarchical archival
  • indexing by special field
  • keyword extraction
  • search (keyword, full-text)
  • import / export (acquisition/rendering)
  • security and access control
  • workflow

DMS
15
Many systems provide retrieval access to the DMS
only on the basis of indexes of meta-document
data that have been provided about the document
by DMS system administrators and users or by the
DMS system itself. Workflow is also meta-document
data. Increasingly systems are providing access
on the basis of content-based data, e.g., data
extracted from the document
keywords indexes summary language physical data
(fonts)
content-based document data
size location author creation date revision
date name department
meta-document data
workflow
16
Corpus
  • A corpus is a collection of documents it is the
    logical entity (object) corresponding to the
    physical body of documents in the DMS. It is the
    digital incarnation of the document pool. A
    collection of all corpora is a unified corpus. A
    unified corpus does not exist when separate(d)
    corpora exist in MR format in multiple locations.

corpus
DMS
17
Structured / Free Text
  • The documents in a corpus range in a cline from
    structured / transaction oriented to free text.
    Structured texts are the easiest to extract
    content from. Why?

more structured
more free
essay
patent
form
Because a document representation is easier to
construct.
18
Document Retrieval
  • Document retrieval is defined as the matching of
    some stated user query against documents and/or
    the useful parts of documents. These records
    could be any type of mainly unstructured text,
    such as bibliographic records, newspaper
    articles, or paragraphs in a manual. User queries
    could range from multi-sentence full descriptions
    of an information need to a few words. The vast
    majority of retrieval systems currently in use
    apply simple Boolean systems on keyword or
    full-text searching, though some use statistical
    or natural language processing techniques. Many
    systems will just retrieve the document(s)
    satisfying the query. Of greater value would be
    the retrieval of document segments and/or other
    objects derivable from the document set retrieved
    by the query.

19
DMS
keyword phrase
? ? ? ? ?
QUERY
query language query interface
document
document segment
document segment
document segment
document segment
20
? ? ? ? ?
QUERY mechanism
DMS
term / keyword full-text string search
phrase full-text string search / enhanced by NP
tagging via POS tagger
concept structure
21
? ? ? ? ?
QUERY Retrieval mechanisms
parse the corpus to locate content-carrying
terms, discover relationships between these
terms, and then use these terms to expand or
modify the queries
NLP-based using NLP resources lexicons,
dictionaries, thesauri, proper name recognizers
traditional statistically based
future research looks to integrate statistical
and NLP approaches
22
Other Corpus-derived Objects
concept structures term lists context sets syntax
trees tagged phrase sets summaries /
abstracts . . .
Other information can be extracted from corpora.
This process is called extraction and involves
algorithms for document interpretation and
analysis.
23
Extraction
structured formats DBMS TDB MD
free text
Content-based document retrieval depends on
information extraction methods being applied to
the corpus of texts in the DMS. Keyword
extraction is a common example, but terminology
mining might be another useful application.
Extraction produces structured data from free
text. A common technique is text skimming where
automatic routines isolate identifying key
artifacts in the text, such as proper names,
dates, times, and locations, and then use a
combination of linguistic constraints and domain
knowledge to identify the important content of
each relevant text.
24
Data Mining
  • Extraction is related to data mining. Data mining
    is the process of discovering and extracting
    knowledge (patterns, associations, changes,
    anomalies, significant structures) from document
    databases. It is sometimes referred to as
    knowledge discovery in databases. Data mining
    consists of several steps
  • data cleaning (handles noisy, erroneous, missing
    irrelevant data)
  • data integration (multiple data sources merged
    into one DMS/corpus)
  • data selection (data relevant to the task
    retrieved from the DMS)
  • data transformation (data consolidated into
    forms appropriate for mining by performing
    aggregation or summary techniques)
  • data mining (intelligent methods applied to
    extract data patterns)
  • pattern evaluation (identify interesting
    patterns representing knowledge based on
    interest measures)
  • knowledge presentation (visualization and
    knowledge representation techniques are used to
    present results to the user)

25
Terminology Mining
  • We can see terminology mining as an important
    subset of data mining, where the same basic steps
    apply, including the need for selection,
    transformation and mining methods to discover and
    represent terminological (and the underlying
    conceptual) structures in documents and document
    corpora.

document
document
t1
DMS
t2
cn
document
document
26
Leveraging the Document Pool
Data mining and terminology mining emphasize the
fact that it is possible to leverage a document
pool. Leveraging means exploiting the document
pool as a business resource with the objectives
of increasing benefit while reducing costs. A
Language PM can sell strategies, methods and
tools for leveraging to a client, or,
alternatively, retain these resources and sell
the client language services which will leverage
the document pool.
27
Two Business Approaches
Language Service Company
2 apply the tools, methods, strategies sell the
service
1 sell tools, methods strategies
Client
28
Language Client-Server Model
Language Services Server
DMS
C l i e n t
Document Server Authoring Server Terminology
Server Translation Server
29
  • In general, certain principles apply when
    leveraging the document pool
  • principle of corpus structure
  • principle of corpus scope
  • principle of corpus size
  • principle of document structure
  • principle of domain shift
  • principle of linguistic variation
  • principle of terminological control
  • principle of reusability
  • principle of validation

30
Corpus Structure
  • Corpora may have a segmented or partitioned
    structure. The structure may be hierarchical or
    associative. A hierarchical structure may
    involved division of the corpus into
  • subcorpora based on language
  • subcorpora based on domain
  • subcorpora based on document type
  • subcorpora based on transaction type or workflow
  • In general, general some partitioning of the
    corpus will be useful or necessary during the
    leveraging process.

31
Multilingual Corpora
  • A DMS can store documents in multiple languages.
    In such cases we can consider the two or more
    document sets as partitioned sub-corpora of the
    unified corpus that represents all MR documents
    in the company residing on a single DMS.

Parallel Corpora
If the documents in sub-corpora are translations
of one another they are referred as parallel
corpora. Typically we expect such corpora to be
translation unit aligned. Two such aligned
corpora are a translation memory.
Comparable Corpora
If the documents sub-corpora are in similar
domains they are referred as comparable corpora.
Such corpora could be concept aligned and
structurally aligned.
32
corpus
  • Sub-corpora and corpora may divide into a variety
    of partitions. Some of these maybe constructed ad
    hoc based on real time processing of the corpus,
    or based on pre-existing tags relating to
    specific user views or schemes. Corpora may be
    more finely structured than just having
    subcorpora. The granularity of the partitioning
    of corpora may vary. Partitioning below the
    document level involves document structure. In
    general the finer the granularity of structuring
    the greater the control over retrieval, access,
    data mining and other operations.

sub corpora
sub corpora
document
document element
sentence
phrase
lexeme
33
Corpus Scope
  • The goal should be to make the clients document
    pool and the corpus isomorphic -- e.g., the more
    documents that are in MR format the better. This
    implies acquisition and import strategies to
    convert existing documents and to ensure that
    incoming documents enter the corpus. Corpus scope
    also implies that the range of documents in the
    corpus be co-extensive with the range of
    documents in the pool. Why?

external pools
document pool
corpus
acquisition
import
34
Corpus Size
  • The larger the corpus the more valuable the
    resource. As corpus size increases the pattern
    strength of consistent features of the corpus
    grows. Statistical analysis methods, neural
    networks and pattern recognition algorithms may
    be applied profitably. The corpus size of
    aligned corpora is a significant factor in the
    usability of translation memory and of comparable
    corpora.

corpus
corpus
35
Document Structure
  • The more structured the documents in the corpus,
    the easier they are to process. While it is
    possible to discover document structure via
    real-time processing -- during the construction
    of new corpora for a client (e.g., when one has
    the opportunity) it might be useful to introduce
    document structuring strategies. SGML-based
    methods including the TEI conventions could be
    used. Structuring documents allows for easier
    extraction, and can be used, as well, in
    authoring and translation applications to enforce
    or apply styles, identify document internal
    structures, identify terms, tag reusable
    components, etc.

Te
st Document Document
Test This is a
test paragraph.This is a
footnote. typerangestart indexterm
The title of
this article is
ldquoTest
Documentrdquo. Here's a
program listing int
main(int argc, char argv) if (argc ... int i argc ... startrefiterm typerangeend The program
in is meaningless.
36
Domain Shift
  • The more constrained the domain shift(s), the
    easier knowledge-based techniques will be to
    implement. Domain shift is a measure of the range
    of variation of the knowledge domains contained
    in a corpus or sub-corpus. Controlling domain
    shift is a major factor in the ability to apply
    machine translation and in knowledge-based
    methods.

domain
domain
domain
domain
domain
corpus
corpus
37
Linguistic Variation
  • The more constrained the language variation the
    easier language-based processes will be to
    implement over a corpus. The primary tools in
    controlling linguistic variation are the
    application of style guides, the use of document
    validation methods (e.g., DTD validation) and,
    especially, controlled language approaches.
  • Writers, especially technical writers, tend to
    develop special vocabularies (jargons), styles,
    and grammatical constructions. Technical language
    becomes opaque not just to ordinary readers, but
    to experts as well. The problem becomes
    particularly acute when such text is translated
    into another language, since the translator may
    not even be an expert in the technical domain.
    Controlled Languages (CL) have been developed to
    counter the tendency of writers to use unusual or
    overly-specialized, inconsistent language.

38
A CL is a form of language with special
restrictions on grammar, style, and vocabulary
usage. Typically, the restrictions are placed on
technical documents or other specialized,
including instructions, procedures, descriptions,
reports, and cautions. Where formal written
English applies to society as a whole, CLs apply
to the specialized sublanguages of particular
domains. By now, hundreds of companies have
turned to CLs as a means of improving readability
or facilitating translation to other languages.
FormalwrittenEnglish
ControlledEnglish
Rule set syntactic correctness semantic
constraints
39
  • The original CL was Caterpillar Fundamental
    English (CFE), created by the Caterpillar Tractor
    Company (USA) in the 1960s. Perhaps the best
    known recent controlled language is AECMA
    Simplified English AEC95, which is unique in
    that it has been adopted by an entire industry,
    namely, the aerospace industry. The standard was
    developed to facilitate the use of maintenance
    manuals by non-native speakers of English.
    Aerospace manufacturers are required to write
    aircraft maintenance documentation in Simplified
    English. Some other well-known CLs are Smart's
    Plain English Program (PEP), White's
    International Language for Serving and
    Maintenance (ILSAM), Perkins Approved Clear
    English (PACE), and COGRAM. Many CL standards are
    considered proprietary by the companies that have
  • developed them. Controlling languages has
    beneficial effects on the prospects for both
    Computer-Assisted Authoring and MT.

40
Terminological Control
  • Controlling terminology is as important as
    controlling linguistic variation and document
    structure. Terminological control strategies need
    to be applied early in the LI-PMs dealings with
    the client. Terminology documenting protocols,
    standards for machine-readable terminology
    databases, and terminology databases are
    important tools in the quest for terminology
    control. Terminology checker / validators and the
    connection of MT, CAT and CAA systems to
    controlled terminology resources is an essential
    strategy.

41
Reusability
  • Documents, portions of documents, sentences,
    translations, phrases and terms, e.g., any object
    in the corpus or derivable from it can be re-used
    once it is validated and once its relationship to
    other objects and their object instances is
    understood.
  • In most corporate corpora there is a relatively
    restricted range of domains (low domain shift),
    low linguistic variation, and a finite
    terminology set. If document structures are
    understood then relationships between document
    elements can be defined and represented and many
    reusable document objects may be specified.

42
variables
corpus
term set
document elements
Assembled Document
document structure
translations
Reusable Objects
43
Validation
  • Validation/verification methods need to apply to
    all objects added to a corpus (e.g., documents),
    added to a document (e.g., structure, tags) or
    extracted from a corpus (e.g., terminology,
    concept maps). Validation is a requirement for
    corpus integrity and a component of corpus and
    document quality assurance.

corpus
44
Internationalization
  • Many of the principles of corpora management
    enunciated here, in particular terminology
    control, domain shifting, controlled languages
    and document structuring are key issues in
    engineering a corpus for easier and more accurate
    localization / translation.

corpus
45
A Complex Equation
  • Leveraging the clients corpus for the best cost
    benefit ratio implies that you, the project
    manager understand the nature of the corpus and
    the strategies, methods and tools that can be
    brought to bear on it.

acquisition authoring distribution extraction impo
rting rendering
representation retrieval storage structuring termi
nology translation validation
Document Processes
standardize represent automate
corpus structure corpus scope corpus
size document structure domain shift linguistic
variation terminological control reusability
validation
Corpora Principles Parameters
corpus
manage control exploit
Write a Comment
User Comments (0)
About PowerShow.com