PowerPoint-presentatie - PowerPoint PPT Presentation

About This Presentation
Title:

PowerPoint-presentatie

Description:

A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch D. Binnenpoorte, F. de Vriend, J. Sturm, W. Daelemans, H. Strik, C. Cucchiarini – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 2
Provided by: Stev1379
Category:

less

Transcript and Presenter's Notes

Title: PowerPoint-presentatie


1
A Field Survey for Establishing Priorities in the
Development of HLT Resources for Dutch D.
Binnenpoorte, F. de Vriend, J. Sturm, W.
Daelemans, H. Strik, C. Cucchiarini
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse Taalunie) with the aim of strengthening the position of Dutch in Human Language Technologies (HLT). This field survey was done in three stages. 1 BLARK The Basic Language Resource Kit (BLARK) is a wish list for an ideal HLT field. 2 Inventory of available HLT resources 3 Priority list The priority list indicates which materials need to be developed to complete the BLARK. It was drawn up by comparing the inventory with the definition of the BLARK. The method described can be adopted for languages other than Dutch. Conclusions The current HLT infrastructure is scattered, incomplete, and not sufficiently accessible. The available modules and applications are often poorly documented. There is a great need for objective and methodologically sound comparisons and benchmarking of the materials. The components that constitute the BLARK should be available at low cost or free of charge. Recommendations Establish an HLT agency to collect, document and maintain existing parts of the BLARK. Complete the BLARK by encouraging funding bodies to finance the development of the prioritized resources. Make the BLARK available to academia and HLT industry under the conditions of open source development. Develop benchmarks, test corpora, and a methodology for objective comparison, evaluation and validation of parts of the BLARK. Promote HLT education. Ensure that enough funding is assigned to fundamental research.
Feedback Feedback of the HLT field (academia and
industry) was collected at a workshop with about
100 participants.
1 BLARK
2 Inventory of available resources
Figure 1 (
important, very important)
Based on the full matrix a BLARK was defined.
  • In defining the BLARK a distinction was made
    between
  • Applications
  • Modules
  • Data
  • A matrix (fragment in Figure 1) was drawn up
    describing
  • which modules are required for which
    applications
  • which data are required for which modules
  • what the relative importance is of the modules
    and data.

A second matrix (fragment in figure 2) describes
the availability of the components in the BLARK.
Figure 2 (1 module or data
set is unavailable to 10 module or data set
is easily obtainable).
  • An inventory was made to establish which of the
    components - modules and data - that make up the
    BLARK are
  • available i.e. can be bought or are freely
    obtainable e.g. by open source
  • (re-)usable.
  • Inventory based on
  • expert knowledge
  • information found on the internet and in the
    literature
  • personal communication with actors in the
    field.
  • Components can only be considered usable if they
    are of sufficient quality ? quality evaluation.
  • Limited to a descriptive level modules and data
    were checked against a list of evaluation
    criteria.
  • BLARK for speech technology
  • Modules
  • Automatic speech recognition
  • Speech synthesis
  • Tools for calculating confidence measures
  • Tools for identification
  • Tools for (semi-) automatic annotation of speech
    corpora
  • Data
  • Speech corpora for specific applications
  • Multi-modal speech corpora
  • Multi-media speech corpora
  • Multi-lingual speech corpora
  • Benchmarks for evaluation
  • BLARK for language technology
  • Modules
  • Robust modular text pre-processing
  • Morphological analysis and morpho-syntactic
    disambiguation
  • Syntactic analysis
  • Semantic analysis
  • Data
  • Monolingual lexicon
  • Annotated corpus of text (a treebank
  • Benchmarks for evaluation

Comparison
3 Priority list
Speech technology 1. Automatic speech
recognition (including non-native ASR, robust
ASR, adaptation, and prosody recognition) 2.
Speech corpora for specific applications (e.g.
directory assistance, CALL) 3. Multi-media
speech corpora (speech corpora that also contain
information from other media such as newspapers,
WWW, etc.). 4. Tools for (semi-) automatic
transcription of speech data 5. Speech synthesis
(including tools for unit selection) 6.
Benchmarks for evaluation
Language technology 1. Annotated corpus
written Dutch 2. Syntactic analysis robust
recognition of sentence structure 3. Robust text
pre-processing tokenization and named entity
recognition 4. Semantic annotations for the
treebank mentioned above 5. Translation
equivalents 6. Benchmarks for evaluation
  • Requirements for prioritization
  • the components should be relevant for a large
    number of applications
  • the components should currently be either
    unavailable, inaccessible, or have insufficient
    quality
  • developing the components should be feasible in
    the short term.
Write a Comment
User Comments (0)
About PowerShow.com