The Basic Language Resources Kit BLARK - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

The Basic Language Resources Kit BLARK

Description:

Define the minimal set of language resources that is necessary to do any ... For formant synthesis: sama as above, with hand-labelled formant. Hamburg, 22-11-2004 ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 44
Provided by: stevenk9
Category:

less

Transcript and Presenter's Notes

Title: The Basic Language Resources Kit BLARK


1
The Basic Language Resources Kit (BLARK)
  • Steven Krauwer
  • Utrecht Institute of Linguistics UiL OTS / ELSNET

2
Overview
  • The BLARK Enterprise
  • How to arrive at it
  • The Dutch Language Union approach
  • Refining the concept
  • Defining a BLARK
  • Main beneficiaries
  • References
  • Concluding remarks

3
The BLARK Enterprise
  • Define the minimal set of language resources that
    is necessary to do any precompetitive RD and
    professional education at all for a language (the
    Basic Language Resource Kit or BLARK)
  • Determine for each language which components are
    already available
  • Make a priority plan to complete the BLARK for
    each language
  • Ensure funding to get the work done

4
What are the componentsof a BLARK
  • Lexicons (monolingual, multilingual, )
  • Corpora (language, speech annotated,
    unannotated mono- and multilingual mono- and
    multimodal )
  • Tools (annotation, exploration, )
  • Modules (lemmatizers, parsers, speech
    recognizers, tts, transcribers, translation, )

5
What makes the BLARK Enterprise special?
  • The idea is to make a common generic BLARK
    definition, in principle applicable to all
    languages
  • The common definition will be based on the
    experience with different languages, and will
    prevent reinvention of wheels
  • The common definition will ensure
    interoperability and interconnectivity
    (especially for multilingual or cross-lingual
    applications)

6
Other benefits
  • Experience from other languages will help making
    cost estimations
  • Adoption of a BLARK common to all languages may
    help in persuading funders to support the
    creation of the BLARK
  • Adoption of a common BLARK may facilitate porting
    of knowledge and expertise between languages

7
Words of caution
  • A BLARK definition will evolve over time, as new
    applications, application environment and
    technologies come up
  • A BLARK definition should be seen as a template
    rather than a dictate, as different languages may
    have different specific requirements
  • BLARK completion priorities may differ from
    language to language (on e.g. economic, social or
    political grounds)

8
How to define a BLARK and assign priorities
  • Methodology proposed by the Dutch Language Union
    DLU (Binnenpoorte et al, LREC 2002)
  • Identify a number of typical applications
  • Determine for each of them which technologies
    (modules) are needed to make them (-, , , )
  • Identify for each module which resources they
    require (-, , , )
  • Assign the highest priority to the resources that
    support most applications

9
Proposed DLU priorities for NLP
  • treebank
  • robust parsers
  • tokenisation and named entity recognition
  • semantic annotations for the treebank
  • translation equivalents
  • evaluation benchmarks

10
Proposed DLU priorities for speech
  • automatic speech recognition
  • application-specific speech corpora
  • multi-media speech corpora
  • tools for transcription of speech data
  • speech synthesis
  • benchmarks for evaluation

11
Next steps by DLU
  • Make a survey of what exists and to what extent
    it is available (0-9 availability score)
  • Assign priorities (not just resources but also an
    infrastructure for maintenance and distribution)
  • Secure funding from Dutch and Flemish government
    for a national programme
  • Issue calls for proposals for collaborative
    resources projects (1st call closed Nov 2 2004)

12
Refining the concept
  • Items not really covered by the DLU teams
  • definition vs specification
  • availability
  • quality
  • quantity
  • standards
  • support
  • Addressed in the NEMLAR project

13
Definition / specification
  • Not enough to say a written language corpus,
    what about
  • size (types, tokens)
  • encoding
  • annotation
  • text types
  • representativity
  • domains
  • i.e. we need full specs

14
Availability
  • DLU 0-9 scale, very impressionistic
  • Our proposal 3 dimensions
  • accessibility
  • cost
  • modifiability
  • to each we assign a penalty score (0 is best)

15
Accessibility
  • 3 classes, with associated penalties
  • (3) existing, but only company-internal
  • (2) existing and freely usable for precompetitive
    research
  • (1) existing and freely usable for all RD

16
Cost
  • 4 cost categories
  • (4) price over 10 keuro
  • (3) price between 1 and 10 keuro
  • (2) price between 100 and 1000 euro
  • (1) less than 100 euro

17
Modifiability
  • 3 categories
  • (3) black box you get them as they are, but you
    cannot change or even inspect its internals
  • (2) glass box you cant change them but you can
    see what is inside)
  • (1) open resources freely manipulable

18
Comments on availability
  • we can now express availability in a 3 digit
    score (accessibility, cost, modifiability) which
    should be rather easy to assign objectively
  • the lowest scores are the best
  • if the accessibility score is 3, the other scores
    dont mean very much

19
Quality
  • We distinguish two types of quality absolute
    (I.e. an inherent property of the resource) and
    relative (I.e. in relation to how you want to use
    it)
  • Absolute standard-compliance and soundness
  • Relative task-relevance and environment-relevance

20
Standard-compliance
  • criterion to what extent is the resource based
    on a common standard (formal or de facto)
  • possible values (penalty based)
  • (3) no standard
  • (2) standard, but not fully compliant
  • (1) standard and fully compliant

21
Soundness
  • criterion to what extent is the resource based
    on well-defined specifications
  • values
  • (3) no specifications provided
  • (2) specs provided, but not fully compliant
  • (1) specs provided, fully compliant

22
Task-relevance
  • criterion (relative) to what extent is the
    resources suited for a specific task X
  • values (3 binary values)
  • contains all information needed for X (yes/no)
  • has the proper size for X(yes/no)
  • based on a relevant selection of items for X
    (yes/no)

23
Environment-relevance
  • criterion to what extent is the resource
    interoperable with its environment (other
    resources)
  • values (3 binary valuas)
  • information matches (yes/no)
  • size matches (yes/no)
  • selection matches (yes/no)

24
Comments on quality
  • We can now express absolute quality objectively
    in terms of a pair of scores (standard-compliance,
    soundness) this score can be assigned by the
    provider
  • and relative quality (for our own purposes) in
    terms of two triples of yes/no answers
    (task-relevance, environment-relevance) this
    score can only be assigned by the user
  • other attributes may be added as long as they can
    be objectively assigned

25
Quantity
  • The DLU team did not try to formulate any
    quantitative requirements
  • We have tried to do this in the context of the
    NEMLAR project, see below for our tentative
    figures
  • Statistical approaches can swallow any amount of
    resources, and minimal figures are very hard to
    find
  • Our figure finding exercise has been very much
    example driven

26
Standards
  • Very few existing formal standards around,
    although some exist (cf Romary Ide at LREC2004
    workshop, Monachini et al, 2003)
  • Evolving de facto standards include
  • Bottom-up work by committees (TEI)
  • Top-down actions
  • Projects aiming at standards (e.g. EAGLES, ISLE)
  • Example setting RD projects (e.g. Wordnet,
    Speechdat, Multext)
  • Our position any standard is better than no
    standard at all

27
Defining a BLARK
  • Work carried out in the context of the NEMLAR
    project (www.nemlar.org), aimed at Arabic
    resources
  • Work described here based on project deliverables
    (see site), summarized in article by Maegaard,
    Krauwer, Choukri, Damsgaard presented at NEMLAR
    conference in Cairo (Sep 2004)

28
Approach adopted
  • Same strategy as Dutch Language Union
    (applications gt modules gt resources)
  • But with different results because of differences
    in social/economic situation and in language
    structure
  • Results follow, in terms of global definitions
    and tentative size indications (no specs provided
    at this stage, but project is still ongoing)
  • Feedback is welcome!!!!!!!!

29
Written resources (1)
  • Lexicon
  • For all components 40 000 stems with POS
    morphology
  • For sentence boundary detection list of
    conjunctions and other sentence starters/stoppers
  • For named entity recognition 50 000 human proper
    names
  • For semantic analysis same 40 000, with
    subcategorization, shallow lexical semantic info
    possibly a WordNet

30
Written resources (2)
  • Bi-/Multilingual lexicon
  • Same size as monolingual
  • Thesauri, ontologies, wordnets
  • Thesaurus subtree with ca 200-300 nodes for each
    domain
  • Ontologies and wordnets ideally same size as
    lexicon

31
Written resources (3)
  • Corpora
  • For term extraction 100 million words
    unannoteted
  • For small applications 0.5 million words
    annotated
  • For statistical POS tagger 1-3 million (ann)
  • Sentence boundary 0.5-1.5 million (ann)
  • Named entity (stat based) 1.5 million (ann)
  • Term extraction 100 million (ann)
  • Co-reference resolution 1 million (ann)
  • WSD 2-3 million (ann)

32
Written resources (4)
  • Multilingual corpora
  • For alignment 0.5 million (tagged)
  • Multimodal corpora
  • For OCR (printed) ??
  • For OCR (hand-written) ??

33
Spoken resources (1)
  • Acoustic data
  • For dictation 50-100 speakers, 20 min each,
    fully transcribed, plus 10 speakers for testing
  • For telephony 500 speakers uttering 50 different
    sentences (speechdat, orientel based)
  • For embedded speech recognition data similar to
    Speecon
  • For broadcast news transcription 50-100 hours
    well-annotated, plus 1000 hours of
    non-transcribed data should come with 300
    million words of non-annotated written text

34
Spoken resources (2)
  • Acoustic data (contd)
  • For conversational speech data similar to
    CallHome/CallFriends from LDC
  • For speaker recognition 500 speakers for
    training, 3 minutes each, transcribed, plus 100
    speakers for testing
  • For language/dialect identification data similar
    to CallFriend, or from Broadcast News (esp for
    variants of Arabic)
  • For speech synthesis male and female speakers,
    15 hours, using a read text, phonetically
    balanced
  • For formant synthesis sama as above, with
    hand-labelled formant

35
Spoken resources (3)
  • Multimodal corpora
  • For lips movement reading similar to M2VTS, with
    some 50 faces
  • Written corpora for speech technologies
  • General 300 million words unannotated,
    preferably broadcast news or other press and
    media sources
  • For phonetic lexicon and language models 1-5
    million words, annotated
  • For Arabic vowelized and non-vowelized corpus

36
What next? (1)
  • Check definition and quantification for
    completeness and consistency and correct
  • Try to provide specs for every single item
  • Try to differentiate between general and Arabic
    in definitions and specs

37
What next? (2)
  • For each language
  • Take the BLARK definition and specs
  • Adapt to local conditions
  • Make a survey of what exists and what has to be
    made
  • Find the funds and build the BLARK for your
    language

38
Prescriptive / descriptive
  • Prescriptive
  • the BLARK definition tells you which ingredients
    you need
  • the specification tells you what they should look
    like
  • Descriptive
  • a BLARK instantiation comes with a description of
    its components

39
Main beneficiaries (1)
  • academic and industrial researchers material to
    try out ideas and conduct pilot studies
  • industrial developers only for generic
    activities, since specific applications require
    more user and domain orientation
  • educators material for experimental work by
    students in labs

40
Main beneficiaries (2)
  • probably not the main languages in Europe (EN,
    FR, GE) as they are pretty well covered anyway
  • mostly the languages that are not supported by a
    strong market (because of small size or poor
    economy)

41
References
  • Binnenpoorte et al at LREC 2002 (see also
    www.elsnet.org/dox/lrec2002-binnenpoorte.pdf
  • ELRA Newsletter vol 3, n 2, 1998 (see also
    www.elsnet.org/blark.html)
  • NEMLAR see www.nemlar.org for
  • Arabic BLARK Report
  • NEMLAR presentation at Cairo conference
  • Romary Ide at LREC 2004 (see also
    www.elsnet.org/lrec2004-roadmap/Romary-Ide.ppt)

42
Concluding remarks
  • The BLARK aims at providing a common definition
    of the notion minimal set of resources
  • It should help language communities to come
    closer to the idea of creating an equal playing
    field, in spite of market forces
  • It should facilitate porting of expertise
  • It is necessarily dynamic, as technologies evolve
    rapidly

43
Thanks!
  • Contact
  • steven.krauwer_at_elsnet.org
Write a Comment
User Comments (0)
About PowerShow.com