Title: Aucun titre de diapositive
1European Language Resource Association
A European Infrastructure for Language Resource
distribution And HL Technology evaluation
Khalid CHOUKRI ELRA/ELDA 55 Rue Brillat-Savarin,
F-75013 Paris, France Tel. 33 1 43 13 33 33 --
Fax. 33 1 43 13 33 30 Email choukri_at_elda.fr Web
http//www.elda.fr/
2Outline
- Rational behind ELRA
- ELRAs Mission Structure .. Services
- Membership
- Identification
- Distribution
- Legal issues
- Market forecasts Needs - requirements
- Promotion
- .
- ELRA Catalogue -- A quick overview BLARK .
- Activities in Europe / European National scenes
Role of ELRA - The ENABLER Initiative
- Conclusion
3European Language Resource Association An
Improved infrastructure for Data sharing
Centralized Not-for-profit organization for the
collection, distribution, and validation of
Language Resources and tools.
Operational agency ? ELDA
Evaluation Language Resources Distribution
Agency
4The Association
- Membership Drive
- ELRA is Open to European Non-European
Institutions - Resources are available to Members Non-Members
- Pay per Resource
- Some of the benefits of becoming a member
- Substantial discounts on LR prices (over 70)
- Legal and contractual assistance with respect to
LR matters - Access to Validation and production manuals
(Quality assessment) - Figures and facts about the Market (results of
ELRA surveys) - Newsletter and other publications
5European Language Resource Association An
Improved infrastructure for Data sharing
An Association of users of Language Resources
- A Repository Center?
- Technical Logistic issues
- Commercial issues (prices, fees, royalties)
- Legal issues (Licensing, IPR)
- Information Dissemination
Application to Norwegian
Infrastructure for the evaluation of Human
Language Technologies providing resources,
tools, methodologies, logistics, Exit strategies
/ Capitalization on evaluation packages
6ELRA Offer
7 Membership Drive
- Colleges Speech, Written, Terminology
- Membership fees gt 4 categories
8Legal Issues- Licensing
Provider-User Agreements
9Legal Issues- Licensing
Distribution Agreement
10Quick Overview Basic Language Resources ---
Spoken Written Resources
- What should be available for all languages
- Lexicons Based on
- Parole
- Simple
- (Euro) WordNet
- and more generally EAGLES/ISLE
- Corpus ---
- (Country/language) National Corpus
- (.) Business/scientific Corpus
- (.) Broadcast News - Transcriptions
- Multilingual/BiLingual
- Lexica
- Corpora (Comparable / Aligned / Parallel)
11Quick Overview Basic Language Resources ---
Spoken Written Resources
- What should be available for all languages
- Articulatory databases (e.g. ACCOR)
- Basic speech data
- (some phonetic material and
- some phonetic sequences, by a small number of
speakers, - recorded in a quiet environment (EUROM 1 BABEL)
- Pronunciation lexicon (BDLEX, PHONOLEX)
- Proper names pronunciation lexicon (ONOMASTICA)
- Newspaper read text (BREF, Siemens-100, Apasci)
- Basic telephone speech (SPEECHDAT)
- Telephone-based speaker verification. (PolyVar)
- Text corpora for language models (MLCC, Le
Monde )
12BLARK ..Basic LAnguage Resource Kit
13Basic Speech resources -- (Europe)
A Available through ELRA S Available through ELRA
within the next quarter E Exist/identified but
not (never!) available "blank" Probably Not
available / has not been identified U Under
completion/Well advanced project with
distribution plans We exclude the lexicon
that come with SpeechDat Available through
German telecoms
14Languages
15Funding(s) of Language Resources
- Public Funding
- Commission of the European Union(e.g. RD FPs)
- National agencies Authorities
- Private Funding
Criteria for Language Resources funding .. !
16Brief Overview of recent activities in
EuropeEuropean Union Level
Some Projects within FP5 and previous FPs .
Related to our concerns
- Resources production Speechdat Family
- Specifications of new types of resources Natural
Interaction and MultiModality - within ISLE (International Standards for
Language Engineering) project - Standards/metadata Eagles and its extension
- the EU/US collaborative project ISLE,
- Coming Soon INTERA
- Coordination ENABLER, Coming soon NEMLAR
- Information gathering Dissemination Euromap
and its follow-up Hope
17SpeechDat Family
?
- SpeechDat(M) --- Fixed Telephone network --
1K Speakers - SpeechDat-II Fixed, Mobile, 1-5Kspeakers
- SpeechDat-II Speaker Verification
- SpeechDat-E (CEE - Polish Czech Slovak Russian
Hungarian) - SALA (Speech Across Latin America) and Now
SALA-II - SpeechDat-Car (inc. cellular)
??
?
?
- SpeeCon (Consumer products)
- Orientel
18SpeechDat Family
19SpeeCon Project
20SpeechDat Family SALA-II what you may get
with PRIVATE Funding
SALA II cellular/Mobile Network (1000
speakers) Partner Latin
America US and Canada
21Brief Overview of recent activities at National
level
Top-down vs Bottom-up approches
22Examples of National Projects/programs
OVER Nine National projects, among which
Netherlands Belgium Continue Now Release
5 Data Available via ELRA, Release of April2002
France Action Techno-Langue
Italy Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento
automatico della lingua naturale parlata e
scritta
Norway Norwegian Language Bank
23Dutch Flemish
24National Projects/programs Example of Italy
Example of Italy
With Contribution from N. Calzollari and A.
Zampolli
- 2 National projects under 2 different Programs.
- The Programs were not specific for HLT, but
general - one for industrial RD
- and the other for the South of Italy.
- Both projects are coordinated by A. Zampolli in
Pisa. - Goal to extend core resources built in EU
projects, create new LR, the tools needed to
manage the resources, a platform for NLP
development, and technology transfer towards SME.
25National Projects/programs Example of Italy
TAL - Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento
automatico della lingua naturale parlata e
scritta with 13 partner of private
organisations). Duration 2 years, finished in
2002.
Partners CPR - Consorzio Pisa Ricerche ITC -
Istituto Trentino di Cultura CSELT - Centro
Studi e Laboratori Telecomunicazioni SYNTHEMA
CVR - Consorzio Venezia Ricerche CERTIA - Centro
per la Ricerca, Sviluppo, Formazione nelle
Tecnologie e Applicazioni Informatiche QUINARY
ALCEOCOMPUTER SHARING DELCO GST - Gruppo
Soluzioni Tecnologiche INTERACTIVE MEDIA NECSY
- Network Control Systems
26National Projects/programs Example of Italy
- LCRMM
- Linguistica computazionale ricerche monolingui
e multilingui - (cluster "Linguistica", legge 488, with 16
partners of private and public organisations). - Duration 3 years will finish in 2003.
- Partners
- CPR, Pisa CIRASS, Napoli THAMUS, Salerno
ILC-CNR, Pisa SYNTHEMA, Pisa - Istituto Universitario Orientale, Napoli
Dipartimento di Scienze Storiche del Mondo
Antico, Università di Pisa Sportello per la
Cooperazione Scientifica e Tecnologica con i
Paesi del Mediterraneo (SMED) del CNR, Napoli.
27National Projects/programs Example of Italy
Italy Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento
automatico della lingua naturale parlata e
scritta
- ItalWordNet (50.000 entries).
- Corpus di italiano parlato --- 100 Hours of
speech consisting of - a) 10h Radio-TV broadcast data (notiziari,
interviste, talk show), - b) 60h Map task like collection
- c) 5h Lab data for lexical coverage
- d) 10h telephone conversational speech
- e) 10h Domain specific (finances, touristic
information etc.) - Annotated dialogues for speech interfaces (H-H
and H-M interactions) - ( Dialoghi annotati per applicazioni di
interfacce vocali avanzate) - 450 dialogues annotated at all levels
(morphological ProsodySemantics .)
Bergen 2002/10/24-25 Norwegian Language Bank
28National Projects/programs Example of Italy
to extend core resources built in EU projects,
created new LR, the tools needed to manage the
resources, a platform for NLP development, and
technology transfer towards SME. The total cost
was about 7 million euro and funding for almost
5 million euro The costs were equally divided
between Spoken Written areas. In both projects
the consortia agreed to distribute the LR through
ELRA (with special price for Italian
users). Now, after the conference TIPI in Roma,
under the sponsorship of the Ministry of
Communications, the topic of HLT has been
inserted in the Framework Programme for the
financing of RD in Italy. It was also decided
to constitute a Forum for HLT, of which Zampolli
is president. The Forum will start working soon,
also to prepare new national initiatives, to
maintain LR, to write a white book on HLT in
Italy, to coordinate with national activities in
other EU countries, etc.
Bergen 2002/10/24-25 Norwegian Language Bank
29Example of France National Projects/programs
France Technolangue Action
With Contribution from J. Mariani
30Ministère de la Culture et de la
Communication Ministère de la Jeunesse, de
lEducation Nationale et de la Recherche Ministère
de lEconomie, des Finances et de
lIndustrie Language Technologies TechnoLangue
Action
31 TechnoLangue action
- Report to Prime Minister (November 2000)
- Meeting Min. Industry, Research, Culture June
2001 - Action Technology survey and evaluation
- Basic Technological Research
- Articulate with present actions
- Research Innovation Technological Networks
- 4 ICT RRIT Telecommunications, Software,
Micronanotechnologies, Audiovisual multimedia - Ministry of Research action on Technological
Survey (VSE)
32 TechnoLangue structure
- Infrastructure program to support technological
innovation, while existing RD projects stay with
RRIT VSE (120 M / year) -
TECHNOLANGUE
RNRT
RNTL
RIAM
VSE
33Usage Evaluation
Meeting points with technology development
Quantitative Evaluation
Basic Research
Technology Development
Application Development
Technologies necessitated for applications
Bottleneck Identification
Research results in quantitative evaluation
Technologies which have been validated for
applications.
Long term / high risk Large return of investment
Usability Acceptability
Evolutionary
34 TechnoLangue action
- Organization
- Executive Committee (EC) chaired by C. Fluhr
(CEA) - Comprising 15 members
- 3 RRIT representatives B. Bachimont (INA -
RIAM), C. Sedogbo (Thalès - RNTL), C. Waast (IBM
- RNRT) - 3 Public research C. Fluhr (CEA), E. Geoffrois
(DGA) P. Paroubek (Limsi-CNRS) - 5 Industrials K. Choukri (ELDA), B. Normier
(Lingway), J.-J. Rigoni (Elan Informatique ), F.
Segond (Xerox) C. Sorin (FT RD) - 4 Administrations S. Chaudiron (MR), J. Mariani
(MR), D. Malbert (MCC), J. Mathieu (MinEFI) - Good balance between research industry -
written/spoken -
35 TechnoLangue action
- Install a User Committee
- Ministry of Foreign Affairs
- Automatic translation, multilingualism
- Ministry of public administration
- Simplification of the administrative language...
- Ministry of National Education
- Training technologies, language traning...
-
36 TechnoLangue Call
- International cooperation
- Cooperation mechanisms within TechnoLangue
- foreign entities may participate in the projects
- financing from their own funds
- Future cooperation among similar national
programs - EU Countries (Italy, Germany, Norway, Spain,
Greece, The Netherlands, Switzerland) - Prepare the construction of the European Research
Area (ERA) - The EC supports the coordination and generic
technologies cost - Each country supports the cost for covering its
language(s) specific technology
development/adaptation (annnotated) corpus
(spoken/written), lexicon (incl. pronun.),
dictionaries... - USA, Japan, South Africa
37 TechnoLangue Call
- 4 meetings of the Executive Committee
- A Call for Proposals with 4 parts
- Part 1 Language resources
- Part 2 Evaluation
- Part 3 Norms standards
- Part 4 Technological survey
- Calendar
- Launched April 15, 2002
- Deadline May 31 / June 10 (Electronic) - June
17 (Paper) - Results July 19, 2002
38 TechnoLangue Call
- Language resources
- Spoken/written data (corpus, dictionaries,
terminological data) - Basic Language Processing Tools (Open Source)
- Production, validation, distribution (incl.
legal, economical aspects) - For a large use by a large community (education,
training) - Evaluation
- Technology (evaluation campaign)
- Applications (evaluation toolkits)
- Methodology (metrics / protocols)
- Norms standards
- Shared effort to improve French participation
- Technological survey
- In relationship with on-going actions (Euromap...)
39Part 1 Language Resources
- Stimulate the production and the distribution of
language resources for - answering minimal needs (Basic LAnguage Resource
Kit) for the french language - promoting resources reusabilty
- supporting research
- helping industrial applications development
- decreasing the cost of entering the sector for
new comers - Should include the French language, eventually in
connection with other languages -
40Part 1 Language Resources
- Spoken and written data
- oral corpus, pronunciation lexicons, etc.
- databases for speech synthesis
- monolingual and multilingual text corpus
(parallel, comparable...) - lexicons, terminology, grammars,...
- Lexical semantic resources ontologies,
thesauri,... - Multimodal corpus,...etc
- Basic sofware tools
- morphosyntactic taggers, syntactic parsers,
semantic tools, - teminology extractors,
- language identifiers,
- corpus annotations tools,
- lemmatizers, etc.
41Part 1 Language Resources
- Encourage and facilitate the use of those
resources - Putting them in new (young) user hands
- Same approach as for GUIs VUIs
- Language Technology Kits with Users guide
- Distribution towards specialized education
entities (NLP, Document Engineering) and more
largely towards training centers (Universities,
Technical Universities, Engineering schools...) - While insuring a feedback from experience
- Open Source software economical model
-
42Part 2 Evaluation
- 3 areas
- Technology evaluation
- Application evaluation
- Evaluation methodologies
43Part 2 Evaluation
- Technology evaluation
- Organization of comparative evaluation campaigns
for technologies presently not covered by
european or international programs, or with a
complementary approach - Includes the production of the data necessary for
the evaluation, in a monolingual, multilingual or
crosslingual context - Scientific and industrial interest of the
evaluation should appear (large enough number of
participants) - The projects must define the evaluation
methodology and justify the practical
organization aspects
44Part 2 Evaluation
- Application evaluation
- The objective is to develop evaluation
mehodologies for industrial or pre-industrial
products - The methodologies may result in toolboxes, also
regrouping user-oriented methodologies and
protocols, or in test software packages - The methodologies should be generic (class of
applications) - The proposals should demonstrate the project
economical and industrial interest, and the
modalities of the distribution of the toolboxes
45Part 2 Evaluation
- Evaluation methodologies
- Improve the present evaluation methodologies
- Identify new (quantitative and qualitative)
approaches for already evaluated technologies - socio-technical and psycho-cognitive aspects
- cognitive modeling of evaluation
- Identify protocols for new technologies and
applications - Virtual Reality, Multimodal interaction, Language
on the Internet...
46Part 3 Standards
- Support the participation of French actors in
normalization and standardization bodies - Presently weak participation of French actors in
normalization and standardization bodies - Of strategic importance
- Variety of places where the normalization
activities are taking place official or
non-official committees, forums, projects,...
47Part 3 Standards
- Actions
- Support the creation of consortia to reinforce
the french presence in various bodies (ISO, CEN,
W3C,...) - Help the share of efforts among French
participants - Identify a topic and ensure a permanent
participation in all related bodies character
sets, exchange format, phonetic alphabet
transcription, etc. - Necessity of articulating the project with French
bodies already implied AFNOR, W3C French
Chapter,...
48Part 4 Survey
- Part 4 - Install an information survey
- Create a portal on Language Engineering in order
to give access to - panorama of the industrial and technological
offer - state-of-the-art in science and technology
- identification of language resources
- identification of technological bottlenecks
- a list of Call for Proposals
- a presentation of the market key numbers
- an information on norms and standards (with
Internet links) - Should be linked with existing sites
(Euromap,...) -
49Results
- 52 proposals submitted
- Total proposal costs 35,9 M
- Total requested support 21,7 M
- Clustering within each of the 4 topics
- 26 projects selected
- 173 participations, 94 participants
- 33 industry
- 39 public research
- 11 other (Associations, CEA, DGA)
- 11 foreign (Bell Labs, NII, EPFL, LATL)
- Budget 6,2 M
50Results
- 26 selected projects
- 8 on Language resources
- BLARK (Cf BNC), Fr-En, G, Sp, It, Arabic
dictionaries - Specialized (aerospace, automotive), proper
names dictionaries - Aligned corpus (7 novels 19th century litterature
in 4 languages) - 6 on Tools (Open source)
- Lemmatizer, Chunker, Guesser, Tagger, Parser,
Speaker recogn., Topic NE detector, summarizer,
term. extractor, Search engine... - 3 on Standards (Spoken / Written)
- 1 on Technological survey (Portal)
- 8 on Evaluation 7 on technology, 1 on usage
evaluation
51Technology Evaluation
- Written language
- Machine translation
- Text alignment
- Syntactic parsing
- Information query
- Spoken Language
- Speech transcription / indexing (incl. Named
Entity) - Speech synthesis
- Spoken dialog
52French Techno-Langue Conclusions
- Launch a large national program on Language
Technology (TechnoLangue) - In the perspective of installing a permanent
infrastructure for Language Resources,
Evaluation, Standards and Survey - Hope that it can participate in the construction
of the European Research Area - And articulates well with international activities
53Example of NORWAY National Projects/programs
Norway Norwegian Language Bank
- language technology resources in Norway
- Launch conference 24-25 October 2002 (Bergen,
Norway) - The language bank will contain three types of
data spoken data, text and lexical resources. - It will be organized as a foundation with state
ownership, - The estimated budget is about NOK 100 million,
(12 M)
54ENABLER European National Activities for Basic
Language Engineering Resources
55Information Dissemination
(Bilingual English/French issued each quarter)