Title: ISO TC 37 / SC4 Language Resources
1ISO TC 37 / SC4Language Resources
- An overview
- (Ammended 2-5 février 2002)
- Laurent Romary
2Standards for language processing
Access protocols Corba, SOAP
Primary resources (text, dialogues) Structural
mark-up Basic annotations TEI, MPEG7,
TMX (XHTML), etc.
Knowledge structures Hierarchies of
types Relations between concepts (subjects/topics
etc.) Links to primary resources Topic Maps,
NLP structures (annotations) POS tagging Chunks
(cf. Named Entities) Deep Syntactic
structures Co-references etc. Eagles/ISLE, CES,
Lexical structures (Language models) Terminologies
Transfer lexica LTAG/HPSG/LFG lexica TBX, OLIF,
Eagles/ ISLE (Genelex)
Meta-data Dublin core, OLAC, ISLE, MPEG7, RDF
- ISO TC37 - Terminology and other language
resources - SC3 - Computer applications in terminology
- ISO 12200 - Martif
- Latest version of TEI Terminology chapter
- ISO 12620 - Data categories
- ISO CD (DIS under ballot) 16642 - TMF
(Terminological Markup Framework) - SC4 - Language resources
4TC37/SC4 details
- Scope Platform for designing and implementing
linguistic resource formats and processes - Multi-layer annotation of linguistic resources
- Exchange of information between NLP modules
- General strategy
- Involve a wide community from academia and
industry - Identification of experts in the various work
items - Involvment through national standardizing bodies
- Agenda
- Current identification of possible work items
and working groups - Constituancy meeting and technical workshop at
LREC (May 2002)
- Secretary
- Prof. Key-Sun Choi, Korea
- Chair
- Laurent Romary, France
- International Advisory Committee
- Permanent Chair Prof. Antonio Zampolli, Italy
6SC4 and other standardizing bodies
Contributing organizations
- text representation
- Reference for primary sources
- e.g. text archives
- W3C
- basic protocols and formats
- XML (Schemas)
- XPath
- XPointer
ISO TC37/SC4 - language resources, NLP
perspective e.g. linguistic annotations, lexical
Technical background
- What about gestures?
- Kinetic in the TEI
MPEG - Multimedia, XML based e.g. MPEG7-4 Word
and phone lattices
7Working groups
- WG1 Basic descriptors and mechanisms for
language resources - Convener Laurent Romary
- WG2 Representation schemes
- Convener Kiyong Lee
- WG3 Multilingual text representation
- Convener Alan K. Melby
- WG4 Lexical databases
- Convener ??
- WG5 Workflow of language Resource Management
- Convener Christian Galinski
8TC37/SC4 Work Items
- WG1/WI-0 Terminology of Language Resources
- WG1/WI-1 Linguistic annotation framework
- WG1/WI-2 Meta-data for multimodal and
multilingual information - WG2/WI-3 Structural content representation
scheme - WG2/WI-4 Multimodal content representation sheme
- WG2/WI-5 Discourse level representation scheme
9TC37/SC4 Work Items - cont.
- WG3/WI-6a Translation Memory, Alignment of
parallel corpora - WG3/WI-6a Segmentation and counting algorithms
(characters, words, sentences etc.) - WG3/WI-6a Meta-markup for GIL (Globalization,
Internationalization and Localization) - WG4/WI-7 NLP Lexica
- WG5/WI-8 Validation of language resources
- WG5/WI-9 Net-based distributed cooperative work
for the creation of LRs
- Terminology of Language Resources
- Basic terminology of the various sub-fields of
language resources and general methodology - Project leader Klaus-Dirk Schmitz
- Sources
- ISO 1087
- LREC proceedings KAIST
- English dictionaries in Linguistics?
- Support from GTW
- Linguistic annotation framework
- Basic mechanisms and data structures for
linguistic annotation and representation data
architecture - Methods and principles for the design of an
annotation scheme - Structural nodes and information units, Data
category specification - Linking and pointing mechanisms, Feature
Structures, Meta-Markup -  Stand-off and  in-line views -
equivalences, combining levels. - Administrative data categories
12WI-1 - cont.
- Project leader Nancy Ide (TBC)
- Contributors Alan Melby, Koiti Hasida, Lee
Gillam, Yves Savourel, Laurent Romary - Possible sources
- TMF, iso12620-revised, Mate (general methodology)
- TEI (Linking mechanisms, feature structures)
- Link with Linguistic DS
- Meta-data for multimodal and multilingual
information - Description of a meta-data representation scheme
to document linguistic information structures and
processes - General content description
- Local content description
- Project leader Peter Wittenburg, MPI (Nijmegen,
NL) - Participants Steven Bird, TEI aware person
- Possible sources
- OLAC, Mile, TEI Header
- Liaison TC46 (SC9), MPEG7/MDS, SCORM
- Structural content representation scheme
- Definition of annotation/representation scheme(s)
for morpho-syntax and syntax, to be used for
annotation and interchange purposes - Meta-model for morpho-syntactic annotation
- Meta-model(s) for syntactic annotation
(lexicalized grammar, elementary trees,
dependancy structures) - corresponding Data category registries
15WI-3 - cont.
- Project leaderJohn Carroll ??
- Participants Nuria Bell, representatives from
existing TreeBanks initiatives - Possible sources
- Eagles, TAGML, Linguistic DS
- Multimodal meaning representation scheme
- Representation scheme for the semantic content of
multimodal information (textual, spoken,
graphical and gestural) - Meta-modal for content representation (Events,
participants, etc.) - Data category registry for multimodal content
- Project leader Harry Bunt (id1)
- Possible sources
- SIGSEM working group on semantic content
- Chair 1
- Â LiaisonÂ
- Semantic web activities
- Discourse level representation scheme
- Meta-model for discourse and dialogue
representation - Meta-model for discourse level annotation (e.g.
reference annotation) - corresponding DatCat registry
- Possible sources
- DRI - Discourse Resource Initiative
- Mate
18WI 6a
- Translation Memory, Alignment of parallel corpora
- Provides formats for the representation of
multilingual textual data as produced in
translation activities or constructed from
existing primary sources - Sources
- OSCAR/TMX for translation memories
- TEI based linking mechanism (or see WI-1) for
Parallel texts
19WI 6b
- Segmentation and counting algorithms (characters,
words, sentences etc.) - Provide methods for segmenting streams of text
with markup and means to for counting the
corresponding segments - Possible sources
20WI 6c
- Meta-markup for GIL (Globalization,
Internationalization and Localization) - Identification of the specific markup modules
needed to perform GIL activities - Possible sources
- OSCAR/OpenTag
- NLP lexica
- Lexicon representation formats for the various
types of NLP applications (Machine Readable
Lexica) - Define a set of meta-models (classes of
applications) - Specific data categories (derivation, phonology,
etc.) - Based on the work done in other work items
- Possible sources
- Eagles
- Multext
- ISLE Computational lexicon Working group
- Validation of language resources
- Defines guidelines and requirements for producing
and distributing high quality language resources - Contacts
- Possibles sources
- To be defined
- Net-based distributed cooperative work for the
creation of LRs - Principles and methods for designing
collaborative and cooperative compilation of LRs - Define what is specific to LRs with regards
- Tracability of resources, version control,
validation, quality management - Protocols (Corba, SOAP), Workflow standards, Data
management - Contacts Christian Galinski, Remi Zajac,
- Sources To be defined
24Liaison - OSCAR (AKM)
- Brief history of LR exchange standards
- Parallel events since 1997
- Open Tag - meta-markup (XML vs. Others)
- Major current OSCAR activities
- TMX - Translation Memory eXchange
- Counting and segmentation algorithms
- TBX (Terminologies) and OLIF (MT lexica)
- XLIFF and CGS - Annotation of source code and
localisation of web sites - xmllang etc. J. DeCamp and S.-E. Wright
25Liaison - TEI (LR)
- General architecture and data modeling
- WI-1
- Annotations (paragraph level, external
annotations) - WI-1
- TEI Header
- WI-2
- NLP lexica (with regards Terminologies and
dictionaries) - WI-7
- Feature structures
- WI-1