Technology is an effective tool to promote use of Basque Strategies to develop HLT for minority languages - PowerPoint PPT Presentation

About This Presentation
Title:

Technology is an effective tool to promote use of Basque Strategies to develop HLT for minority languages

Description:

Technology is an effective tool to promote use of Basque Strategies to develop HLT for minority languages IXA Research Group on NLP University of the Basque Country – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 44
Provided by: dcu63
Category:

less

Transcript and Presenter's Notes

Title: Technology is an effective tool to promote use of Basque Strategies to develop HLT for minority languages


1
Technology is an effective tool to promote use
of BasqueStrategies to develop HLT for minority
languages
  • IXA Research Group on NLP
  • University of the Basque Country
  • Dublin 2006

2
Outline
  • Basque language
  • Ixa Group
  • Strategy to develop HLT
  • Applications
  • Tools
  • Linguistic resources

01/04/06
2
3
History of Basque
4
Basque nowadays
1,033,900 Speakers (First lang.
700,000) Non homogeneus distribution !
Six different dialects !
5
Main reasons of Basque regression.
  • No official language
  • Out of the education system
  • 6 dialects!
  • Out of media
  • Out of industry

6
Main reasons of Basque regression
  • No official language
  • Out of the education system
  • 6 dialects!
  • Out of media
  • Out of industry
  • But since 1980...
  • Coofficial language
  • Integrated in education (even at university)
  • Unified Basque (1966)
  • TV, newspaper...
  • Out of new ICTs ???

7
Basque. Linguistic features
  • Case suffixes and free order of sentence
    components
  • The dog brought the newspaper in his mouth
  • Txakur-rak egunkari-a aho-an
    zekarren.
  • The-dog the-newspaper
    in-his-mouth brought
  • ergative-3-s absolutive-3-s
    inessive-3-s
  • Subject Object
    Modifier Verb
  • Alternative possible orders
  • Txakur-rak aho-an
    egunkari-a zekarren.
  • Txakur-rak aho-an zekarren
    egunkari-a.
  • Egunkari-a txakur-rak zekarren
    aho-an.
  • ...

8
Basque. Linguistic features
  • Ergative case. Subject of transitive verbs
  • I am Ni naiz
    (absolutive)
  • I saw the cat Nik katua ikusi nuen
    (ergative)
  • Agreement in number and person between
  • verb and (subject, object and indirect object)
  • I saw the cat Nik katua ikusi nuen
  • I saw the cats Nik katuak ikusi nituen
  • I saw you Nik zu ikusi zintudan

9
Outline
  • Basque language
  • Ixa Group
  • Strategy to develop HLT
  • Applications
  • Tools
  • Linguistic resources

01/04/06
9
10
IXA Research Groupon NLP (UPV/EHU) (I)
  • Main research fields NLP, computational
    linguistics, language engineering.
  • Goal to collaborate on
  • laying foundations for research
  • the development of language processing software.
  • Application language mainly Basque.

01/04/06
10
11
IXA Research Groupon NLP (UPV/EHU) (II)
  • 1986/1987 4-5 university lecturers (computer
    science)
  • 2005/2006 Interdisciplinary team
  • 33 computer scientists
  • 19 lecturers (11 doctors)
  • 13 PhD students (research grants)
  • 17 linguists
  • 6 lecturers (4 doctors)
  • 11 PhD students (research grants)
  • 2 research assistants assigned to projects

01/04/06
11
12
IXA Group. Milestones
  • 1987 1990 1995
    2000 2006

ArgazkiPress
Irion
MT-system
01/04/06
13
IXA Research Groupon NLP (UPV/EHU) (III)
  • Relationships with other universities in Euskal
    Herria Madrid Toulouse Barcelona Alicante
    Vigo Maryland, Las Cruces (USA) Sidney
    (Australia) Palmerston North (New Zealand) Rome
    (Italy) Helsinki (Finland) ...
  • And companies
  • Microsoft, Xerox, Lexiquest, LingSoft, Conexor,
    Eatoni,
  • Hizkia, Jalgi, Egunkaria...
  • Funding local government, University of the
    Basque Country, Spanish Government, European
    Community, ...

01/04/06
13
14
Outline
  • Basque language
  • Ixa Group
  • Strategy to develop HLT
  • Applications
  • Tools
  • Linguistic resources

01/04/06
14
15
Underlying strategy
  • Need of standardization of resources to be useful
    in different
  • researches
  • tools
  • applications
  • Need of incremental design and development of
    language foundations, tools, and applications
  • in a parallel and coordinated way
  • in order to get the best benefit from them

01/04/06
15
16
Strategic priorities from basic research to
application development
01/04/06
16
17
Linguistic foundations resources, tools and
applications
  • Linguistic foundations and resources necessary
    infrastructure for the automatic processing of a
    language.
  • Tools mainly intended to application developers.
  • Applications commercial or non-commercial, for
    non-specialised end-users.

01/04/06
17
18
Phase I laying foundations

Phonetics Lexicon Morphology
Syntax Semantics
01/04/06
18
19
Phase II first basic tools and applications
Phonetics Lexicon Morphology
Syntax Semantics
01/04/06
19
20
Phase III more advanced tools and applications
Lemmatiser/Tagger
Morphological analyser
Statistical tools for the treatment of corpora
Phonetics Lexicon Morphology
Syntax Semantics
01/04/06
20
21
Phase IV multilinguality and general
applications
Electronic dictionaries
Grammar checker
Web crawler
Environment for linguistic tools integration
Lemmatiser/Tagger
Morphological analyser
Statistical tools for the treatment of corpora
Comp. description of morphology
Comp. grammar
MRD's
Lexical Database
Phonetics Lexicon Morphology
Syntax Semantics
01/04/06
21
22
Outline
  • Basque language
  • Ixa Group
  • Strategy to develop HLT
  • Applications
  • Tools
  • Linguistic resources

01/04/06
22
23
Commercial applications
  • Spelling checker/corrector
  • 3 lemmatization based on-line bilingual
    /monolingual dictionaries
  • Lemmatization based on-line dictionary of
    synonyms
  • Lemmatization based search machine
  • The Basque component of a generator of weather
    reports
  • Spanish-Basque transfer based MT system

01/04/06
23
24
Spelling checker/corrector
  • Late standardization (Unified Basque)
  • Morphology and verbs , 1966
  • Lexical standardization process is still going on
  • Adult speakers did not learn Basque at shoolgt
    Many doubts in writing 'tree' zuhaitz?
    zugaitz? zuhaitx? zuhaitsa? sugatza? gt Give
    up! Do it in Spanish or French!
    ?? The spell-checker is an EFFECTIVE TOOL in
    the ongoing standardization of Basque

01/04/06
24
25
Basque spelling checker/corrector More than
20.000 downloads Versions Office, OOffice, PC,
Mac, web service... Not just a list of possible
word-forms!
26
-Open Code -No lexical desanbiguation, but
yes idioms! -No use of corpus
Spanish-Basque transfer MT
27
Future work - Hybrid SMT, EBMT RBMT -
Lexical desambiguation - Verb subcategorization -
English
Spanish-Basque transfer MT
28
  • Search engine
  • (based on lemmas)
  • Not looking for saguarekin
  • but sagu
  • Not relevant similar words removed
  • Those beginning with sagu
  • but with a different lemma
  • i.e. saguzar
  • Found word forms with other suffixes
  • saguen, saguaren, sagua, saguetan

29
Lemmatization based on-line Basque-Spanish
bilingual dictionary
30
Thesaurus (lemmatizer inside)
31
Basque monolingual dictionary. Advanced
electronic version
32
Diccionario Básico Escolar Cubano
33
Integration of consults to different dictionaries
34
meaning number
Synset
BasqueWORDNET. Ontology of word synsets
35
Title
paragraph
body
MultimeteoThe Basque component of a generator of
weather reports
36
Second language learning system (learner and
error corpora based)
37

Second language learning system (Automatic
generation of exercises)
38
Outline
  • Basque language
  • Ixa Group
  • Strategy to develop HLT
  • Applications
  • Tools
  • Linguistic resources

01/04/06
38
39
Methodology for stand-off corpus tagging (TEI,
feature structures and XML)
40
HDBLren pantailazoahemen
EULIA tool for corpus tagging
41
CORPUSGILE tool for consulting corpus
42
ERAUZTERM Terminology extraction from corpus
43
EDBL lexical data-base
  • Lexical basis of the automatic proccessing of
    Basque
  • 80.000 entries
  • Dictionnary entries
  • Verb-forms
  • Affixes
  • updated and consistent.
  • built with ORACLE V7 and UNIX

01/04/06
43
44
Format of the lemmatizer output
Morphosyntax
Plain text
Morhological analyzer
Linguistic disambiguation
CG
/ltNoizean_behingt/ltHAUL_EDBLgt/
("noizean_behin" ADB ADOARR _at_ADLG) /lt,gt/ltPUNT_KOM
Agt/ /ltInformatikagt/ltHAS_MAIgt/
("informatika" IZE ARR _at_KMgt) /ltFakultatearengt/ltHA
S_MAIgt/ ("fakultate" IZE ARR DEK GEN
NUMS MUGM _at_IZLGgt _at_ltIZLG) /ltaurrekogt/
("aurre" IZE ARR DEK NUMS MUGM DEK GEL
_at_IZLGgt _at_ltIZLG) /ltzuhaitzakgt/ ("zuhaitz"
IZE ARR DEK ABS NUMP MUGM _at_OBJ
_at_SUBJ _at_PRED) /ltinaustengt/ ("inausi"
ADI SIN AMM ADOIN ASP EZBU _at_-JADNAG) /ltdiragt/
("izan" ADL A1 NR_HK _at_JADLAG) /lt.gt/ltPUNT_PUN
Tgt/
Surface syntax
Statistical d isambiguation

Parsing
Surface syntax
CG
EUSTAGGER
EIHERA entities
xfst
Chunker
CG
Postpositions
CG
NP, PP, VP
Dependencies
Deep syntax
Syntactic dependencies
CG
Parsed text
Architecture of the lemmatizer/parser
45
EIHERA Entity recognition
46
Zuhaitza
Noizean behin, Informatika Fakultatearen aurreko
zuhaitzak inausten dira.
ZUHAITZA analizatzaile sintaktiko sakona
47
Dictionary consult by PDA
48
Conclusion
  • A language that seeks to survive in the modern
    information society requires language technology
    products.
  • "Minority" languages have to do a great effort to
    face this challenge.
  • Need of high standardization
  • Reusing language foundations, tools, and
    applications
  • Incremental design and development of them

01/04/06
48
49
Technology is an effective tool to promote use
of BasqueStrategies to develop HLT for minority
languages
  • IXA Research Group on NLP
  • University of the Basque Country

IXA Research Group on NLP University of the
Basque Country
Write a Comment
User Comments (0)
About PowerShow.com