Title: Standards, Use and Prospects for Language Resource Management
1Standards, Use and Prospects for Language
Resource Management
- Key-Sun Choi
- 16 Aug. 2008
- TII, Moscow
2MOTIVATION
3Wikipedia
- Web-based collaborative authoring multi-lingual
encyclopedia - 8.29 M pages/ 253 languages (2007/9)
- 2.0 M pages/ English (2007/9) now 5.0 M pages
Computer science
Category Classification
Databases
Computer scientists
Algorithms
Category Page
Martic Kay
Robert Watson
Parallel database
SQL
Divide Conquer
4Problem IS-A Relation Extraction from Wikipedia
- Relation Classification from Category System
- By Term Formation Rule, Wikipedia Structure
(Ponzetto Strube, 2007)
Relation Classification
IS-A relation
Upper-lower level Category relation
Not IS-A relation
Computer science
Not IS-A
IS-A
IS-A
Databases
Computer scientists
Algorithms
5Relation Extraction by Pattern
- (Ryu Choi, 2007)
- http//cseight.kaist.ac.kr8080/RelExt
Computer display mode
IS-A
Text mode
6Problem IS-A Relation Extraction from Wiktonary
- Web-based Collaborative Multilingual Dictionary
- 617,639 entries/401 languages
- ISA relation extraction from Definition Pattern
- http//cseight.kaist.ac.kr8080/Wiktionary
IS-A
IS-A
7Problem IS-A Relation Extraction from WordNet
- Semantic Word Net (English)
- 117,798 nouns, 82,115 synset (Ver. 3.0)
- ISA relation extraction through ISA between
Synsets
Synset 12
engineering, applied science
IS-A
IS-A
Synset 22
Synset 23
Synset 33
chemical engineering
computer science, computing
electrical engineering
8LMF
9Wikipedia IS-A Annotation
IS-A (Entry, Term in Page)
IS-A (Term in Page, Term in Page)
Synonymy (Entry, Term in Page)
10What is common representation?
11Linguistic Annotation Framework
- ISO-GrAF Graph Structure-based Annotation
- GrAF XML schema type hierarchy
- graphElementType Attributes ID, type
- edgeType extends graphElementType
- nodeType extends graphElementType
- spanType extends nodeType Attributes start, end
- graphElementSetType
- edgeSetType extends graphElementSetType
- nodeSetType extends graphElementSetType
- featureStructureType
- featureType
- annotationSetType
12Problem Causality between Terms
- Causal relation between terms
- Term clustering based on inter-term causality
- Terms with similar causality tend to be similar
concept. - Realization Evaluation
Skin cancer usually appears in adulthood, but
it is caused by sun exposure and sunburns
that began in childhood .
TG
Stat5
Interleukin-2
IL-2
Egr-1
IFN-gamma
13Is it true?Terms with similar causality tend to
be similar concept.
- The oral bacteria that cause gum disease appear
to be the culprit. Cigarette smoking and use of
smokeless tobacco products may also cause gum
disease. Gum disease is the second most common
cause of toothache
Periodontal disease can lead to toothache.
Cigarette smoking is the number one environmental
risk for periodontal disease.
14What to do
- Is it true?
- Terms with similar causality tend to be similar
concept - We try to test the term clustering based on
causal information - Prove that causality is one of effective features
for term clustering. - Focus on
- Causal NP pair extraction (Chang and Choi, 2004)
- Causal term pair extraction
- Term clustering based on causal similarity
- Term clustering evaluation
15Features on term clustering (1/3)
- Useful features for Term clustering
- Internal feature
- Word lexicon/structure in terms
- (Bourigault and Jacquemin,1999) POS sequences
including insertion - NPDNInsAj NOunl ((Adv? Adj)0-3 Prep Det? (Adv?
Adj)0-3 ) Noun3 - 9398 precision
- Outer-term feature
- Structural modifier/modifiee of term
- Some words nearby term
- (Maynard et al., 2000)
- Hand-made semantic frame information
16Feature Structure Representation
- (1) Employee
- ltSEX, femalegt, ltNAME, Sandy Jonesgt, ltAGE, 30gt
- (2) Sound segment /p/
- ltCONSONANTAL, gt, ltANTERIOR, gt, ltVOICED, -gt,
ltCONTINUANT, -gt - (3) Grammatical features of the verb love
- ltPOS, verbgt, ltVALENCE, transitivegt,
ltSEMANTIC_RELATION, lovinggt,
17FSR Graph vs. Matrix Notation
18Related Works on term clustering (3/3)
- Discussion
- Causal information is one of long-distance
contextual information
Cigarette smoking and use of smokeless tobacco
products may also cause gum disease.
cause
use
Gum disease
Smokeless tobacco product
19Event ternary extraction
Skin cancer usually appears in adulthood , but it
is caused by sun exposure and sunburns that began
in childhood .
Dependency Structure
appears
caused
by
and
usually
in
but
Skin cancer
it
is
Sun exposure
adulthood
Sunburns
began
in
that
child
NP chunking
Reference finding
Cue phrases filtering
Verb selection
Causal event pair candidate ltcause event, cue
phrase, effect eventgt
Skin cancer RNP caused by CNP sun
exposure Skin cancer RNP caused by CNP
sunburns
20Representation Scheme
- Morpho-syntactic Annotation Framework
- Syntactic Annotation Framework
21Morpho-Syntactic Annotation Framework MAF
- lttoken id" t1 "gttolt/ tokengt
- lttoken id" t2 "gteventuallylt/ tokengt
- 3 lttoken id" t3 "gtdecidelt/ tokengt
- ltwordForm lemma" to_decide " tokens" t1 t3 "/gt
- 5 ltwordForm lemma" eventually " tokens" t2 "/gt
22MAF token
- lttoken id" t1 "gtThelt/ tokengt
- lttoken id" t2 "gtvi c t imlt/ tokengt
- lttoken id" t3 "gt slt/ tokengt
- lttoken id" t4 "gtf r i e n d slt/ tokengt
- lttoken id" t5 "gtt o ldlt/ tokengt
- lttoken id" t6 "gtp o l i c elt/ tokengt
- lttoken id" t7 "gtthatlt/ tokengt
- lttoken id" t8 "gtKruegerlt/ tokengt
- lttoken id" t9 "gtdrovelt/ tokengt
- lttoken id" t10 "gtint olt/ tokengt
- lttoken id" t11 "gtthelt/ tokengt
- lttoken id" t12 "gtquar rylt/ tokengt
- lttoken id" t13 "gtandlt/ tokengt
- lttoken id" t14 "gtneverlt/ tokengt
- lttoken id" t15 "gtsur f a c edlt/ tokengt
- lttoken id" t16 "gt.lt/ tokengt
23Syntactic Annotation Framework
24Semantic Annotation Framework TimeML
- no more than 60 days
- ltTIMEX3 tid"t1" type"DURATION" value"P60D"
mod"EQUAL_OR_LESS"gt no more than 60 days
lt/TIMEX3gt - the dawn of 2000
- ltTIMEX3 tid"t2" type"DATE" value"2000"
mod"START"gt the dawn of 2000 lt/TIMEX3gt
25ONTOLOGY EXTRACTION/LEARNING AND
QUESTION-ANSWERING
26(No Transcript)
27Word Segmentation
28MULTILINGUAL INFORMATION FRAMEWORK
29IT Ontology
IT Core Ontology
30A Scenario
Control Server
Ontology Reasoner
Rule Reasoner
User
What is the best RTOS Vendor?
Do you know?
No
What is RTOS?
Real-time Operating System
What are instances?
VxWorks
Vendor?
Wind River
. .
Microsoft
Which is better?
31Dialogue acts
- Well-known examples of communicative functions
(core dialogue acts) - question
- WH-question
- YN-question
- check/verification
- statement/inform
- answer (WH-answer. YN-answer)
- confirmation, disconfirmation
- request
- instruct
- promise
- acknowledgement
- greeting
32General-purpose functions
- Applicable in any dimension are
- Information-seeking functions
- WH-question, YN-question,
Alternatives-question, Check,.. - Information-providing functions
- Inform, WH-Answer, YN-Answer, Confirmation,
Disconfirmation, Agreement, Correction,.. - Commissive functions
- Offer, Promise, AcceptRequest,..
- Directive functions
- Instruct, Request, Suggest,..
-
33DiaML concrete syntax
- ltdiaML idd2 speakers addresseea
markablem1 commfunctionscfs1gt - ltsourceText idm1 sb1..se1blabla
sb3..se3blablagt - ltcfs idcfs1 taskFunf1 feedbackFunf2gt
- ltcomfun idf1 functionanwer
respTod1gt - ltcomfun idf2 functionpositiv
respTod1gt - lt/cfsgt
- lt/diaMLgt
34From sentence to ontologies
artifact
contents
device
ontology
camera
video
(camera, ISA, device) (camera, hasPropertyOf,
that AND (take video))
Triplets extraction
Dependency analysis
camera
is
device
takes
that
video
A camera is a device that takes video.
Term recognition
Sentence
A camera is a device that takes video.
35Standards for language processing
Access protocols Corba, SOAP
Primary resources (text, dialogues) Structural
mark-up Basic annotations TEI, MPEG7,
TMX (XHTML), etc.
Knowledge structures Hierarchies of
types Relations between concepts (subjects/topics
etc.) Links to primary resources Topic Maps,
OIL, RDF
Links
NLP structures (annotations) POS tagging Chunks
(cf. Named Entities) Deep Syntactic
structures Co-references etc. Eagles/ISLE, CES,
MATE,
Lexical structures (Language models) Terminologies
Transfer lexica LTAG/HPSG/LFG lexica TBX, OLIF,
Eagles/ ISLE (Genelex)
Meta-data Dublin core, OLAC, ISLE, MPEG7, RDF
36Context
- ISO TC37 - Terminology and other language
resources - SC3 - Computer applications in terminology
- ISO 12200 - Martif
- Latest version of TEI Terminology chapter
- ISO 12620 - Data categories
- ISO CD (DIS under ballot) 16642 - TMF
(Terminological Markup Framework) - SC4 - Language resources
37TC37/SC4 details
- Scope Platform for designing and implementing
linguistic resource formats and processes - Multi-layer annotation of linguistic resources
- Exchange of information between NLP modules
- General strategy
- Involve a wide community from academia and
industry - Identification of experts in the various work
items - Involvment through national standardizing bodies
- Agenda
- Current identification of possible work items
and working groups - Constituancy meeting and technical workshop at
LREC (May 2002)
38Organization
- Chair
- Laurent Romary, France
- Secretary
- Key-Sun Choi, Korea
- International Advisory Committee
- Chair Prof. Antonio Zampolli, Italy
39SC4 and other standardizing bodies
- TEI
- text representation
- Reference for primary sources
- e.g. text archives
Oscar
Text
- W3C
- basic protocols and formats
- XML (Schemas)
- XPath
- XPointer
- RDF, SVG, SMIL, SOAP
ISO TC37/SC4 - language resources, NLP
perspective e.g. linguistic annotations, lexical
formats
Technical background
- What about gestures?
- Kinetic in the TEI
- SMIL?
MPEG - Multimedia, XML based e.g. MPEG7-4 Word
and phone lattices
Audio/Speech
40TC37/SC4 Work Items
- WG1/WI-0 Terminology of Language Resources
- WG1/WI-1 Linguistic annotation framework
- WG1/WI-2 Meta-data for multimodal and
multilingual information - WG2/WI-3 Structural content representation
scheme - WG2/WI-4 Multimodal content representation sheme
- WG2/WI-5 Discourse level representation scheme
41TC37/SC4 Work Items - cont.
- WG3/WI-6a Multilingual text representation
- WG4/WI-7 NLP Lexica
- WG5/WI-8 Net-based distributed cooperative work
for the creation of LRs
42WI-0
- Terminology of Language Resources
- Basic terminology of the various sub-fields of
language resources and general methodology - Project leader Klaus-Dirk Schmitz
- Sources
- ISO 1087
- LREC proceedings KAIST
- English dictionaries in Linguistics?
- Support from GTW
43WI-1
- Linguistic annotation framework
- Basic mechanisms and data structures for
linguistic annotation and representation data
architecture - Methods and principles for the design of an
annotation scheme - Structural nodes and information units, Data
category specification - Linking and pointing mechanisms, Feature
Structures, Meta-Markup - Stand-off and in-line views -
equivalences, combining levels. - Administrative data categories
44WI-1 - cont.
- Project leader Nancy Ide (TBC)
- Contributors Alan Melby, Koiti Hasida, Lee
Gillam, Yves Savourel, Laurent Romary - Possible sources
- TMF, iso12620-revised, Mate (general methodology)
- TEI (Linking mechanisms, feature structures)
- Link with Linguistic DS
45WI-2
- Meta-data for multimodal and multilingual
information - Description of a meta-data representation scheme
to document linguistic information structures and
processes - General content description
- Local content description
- Project leader Peter Wittenburg, MPI (Nijmegen,
NL) - Participants Steven Bird, TEI aware person
- Possible sources
- OLAC, Mile, TEI Header
- Liaison TC46 (SC9), MPEG7/MDS, SCORM
46WI-3
- Structural content representation scheme
- Definition of annotation/representation scheme(s)
for morpho-syntax and syntax, to be used for
annotation and interchange purposes - Meta-model for morpho-syntactic annotation
- Meta-model(s) for syntactic annotation
(lexicalized grammar, elementary trees,
dependancy structures) - corresponding Data category registries
- Project leaderJohn Carroll ??
- Participants Nuria Bell
- Possible sources
- Eagles, TAGML, Linguistic DS
- SIGPARSE
- Working group with representatives from existing
TreeBanks initiatives
47WI-4
- Multimodal meaning representation scheme
- Representation scheme for the semantic content of
multimodal information (textual, spoken,
graphical and gestural) - Meta-modal for content representation (Events,
participants, etc.) - Data category registry for multimodal content
- Project leader Harry Bunt (id1)
- Possible sources
- SIGSEM working group on semantic content
- Chair 1
- Liaison
- Semantic web activities
48WI-5
- Discourse level representation scheme
- Meta-model for discourse and dialogue
representation - Meta-model for discourse level annotation (e.g.
reference annotation) - corresponding DatCat registry
- Possible sources
- SIGDIAL
- DRI - Discourse Resource Initiative
- Mate
49WI-6
- Multilingual text representation scheme
- Framework for representing language specific and
multi-lingual textual information - Translation Memory
- Alignment Parallel Corpora
- Word count algorithms (characters, words,
segments) - Possible sources
- TMX for translation memories
- TEI based linking mechanism (or see WI-1) for
Parallel texts
50- WI 6A
- Translation Memory, Alignment of parallel corpora
- Sources
- OSCAR/TMX for translation memories
- TEI based linking mechanism (or see WI-1) for
Parallel texts
51- WI 6b
- Segmentation and counting algorithms (characters,
words, sentences etc.) - Sources
- OSCAR
52- WI 6C
- Meta-markup for GIL (Globalization,
Internationalization and Localization) - Possible sources
- OSCAR/OpenTag
53WI-7
- NLP lexica
- Lexicon representation formats for the various
types of NLP applications (Machine Readable
Lexica) - Define a set of meta-models (classes of
applications) - Specific data categories (derivation, phonology,
etc.) - Based on the work done in other work items
- Sources
- Eagles/multext
- ISLE Computational Working group/Genelex
- OLIF
54WI-8
- Net-based distributed cooperative work for the
creation of LRs - Principles and methods for designing
collaborative and cooperative compilation of LRs - Define what is specific to LRs with regards
- Tracability of resources, version control,
validation, quality management - Protocols (Corba, SOAP), Workflow standards, Data
management - Contacts Christian Galinski, Remi Zajac
- Sources
55Liaison - OSCAR
- Brief history of LR exchange standards
- Parallel events since 1997
- Open Tag - meta-markup (XML vs. Others)
- Major current OSCAR activities
- TMX - Translation Memory eXchange
- Counting and segmentation algorithms
- TBX (Terminologies) and OLIF (MT lexica)
- XLIFF and CGS - Annotation of source code and
localisation of web sites - xmllang etc. J. DeCamp and S.-E. Wright
56Liaison - TEI
- General architecture and data modeling
- WI-1
- Annotations (paragraph level, external
annotations) - WI-1
- TEI Header
- WI-2
- NLP lexica
- WI-7
- Feature structures
- WI-1