Title: Technology is an effective tool to promote use of Basque Strategies to develop HLT for minority languages
1Technology is an effective tool to promote use
of BasqueStrategies to develop HLT for minority
languages
- IXA Research Group on NLP
- University of the Basque Country
- Dublin 2006
2Outline
- Basque language
- Ixa Group
- Strategy to develop HLT
- Applications
- Tools
- Linguistic resources
01/04/06
2
3History of Basque
4Basque nowadays
1,033,900 Speakers (First lang.
700,000) Non homogeneus distribution !
Six different dialects !
5Main reasons of Basque regression.
- No official language
- Out of the education system
- 6 dialects!
- Out of media
- Out of industry
6Main reasons of Basque regression
- No official language
- Out of the education system
- 6 dialects!
- Out of media
- Out of industry
- But since 1980...
- Coofficial language
- Integrated in education (even at university)
- Unified Basque (1966)
- TV, newspaper...
- Out of new ICTs ???
7Basque. Linguistic features
- Case suffixes and free order of sentence
components - The dog brought the newspaper in his mouth
- Txakur-rak egunkari-a aho-an
zekarren. - The-dog the-newspaper
in-his-mouth brought - ergative-3-s absolutive-3-s
inessive-3-s - Subject Object
Modifier Verb - Alternative possible orders
- Txakur-rak aho-an
egunkari-a zekarren. - Txakur-rak aho-an zekarren
egunkari-a. - Egunkari-a txakur-rak zekarren
aho-an. - ...
8Basque. Linguistic features
- Ergative case. Subject of transitive verbs
- I am Ni naiz
(absolutive) - I saw the cat Nik katua ikusi nuen
(ergative) - Agreement in number and person between
- verb and (subject, object and indirect object)
- I saw the cat Nik katua ikusi nuen
- I saw the cats Nik katuak ikusi nituen
- I saw you Nik zu ikusi zintudan
9Outline
- Basque language
- Ixa Group
- Strategy to develop HLT
- Applications
- Tools
- Linguistic resources
01/04/06
9
10IXA Research Groupon NLP (UPV/EHU) (I)
- Main research fields NLP, computational
linguistics, language engineering. - Goal to collaborate on
- laying foundations for research
- the development of language processing software.
- Application language mainly Basque.
01/04/06
10
11IXA Research Groupon NLP (UPV/EHU) (II)
- 1986/1987 4-5 university lecturers (computer
science) - 2005/2006 Interdisciplinary team
- 33 computer scientists
- 19 lecturers (11 doctors)
- 13 PhD students (research grants)
- 17 linguists
- 6 lecturers (4 doctors)
- 11 PhD students (research grants)
- 2 research assistants assigned to projects
01/04/06
11
12IXA Group. Milestones
ArgazkiPress
Irion
MT-system
01/04/06
13IXA Research Groupon NLP (UPV/EHU) (III)
- Relationships with other universities in Euskal
Herria Madrid Toulouse Barcelona Alicante
Vigo Maryland, Las Cruces (USA) Sidney
(Australia) Palmerston North (New Zealand) Rome
(Italy) Helsinki (Finland) ... - And companies
- Microsoft, Xerox, Lexiquest, LingSoft, Conexor,
Eatoni, - Hizkia, Jalgi, Egunkaria...
- Funding local government, University of the
Basque Country, Spanish Government, European
Community, ...
01/04/06
13
14Outline
- Basque language
- Ixa Group
- Strategy to develop HLT
- Applications
- Tools
- Linguistic resources
01/04/06
14
15Underlying strategy
- Need of standardization of resources to be useful
in different - researches
- tools
- applications
- Need of incremental design and development of
language foundations, tools, and applications - in a parallel and coordinated way
- in order to get the best benefit from them
01/04/06
15
16Strategic priorities from basic research to
application development
01/04/06
16
17Linguistic foundations resources, tools and
applications
- Linguistic foundations and resources necessary
infrastructure for the automatic processing of a
language. - Tools mainly intended to application developers.
- Applications commercial or non-commercial, for
non-specialised end-users.
01/04/06
17
18Phase I laying foundations
Phonetics Lexicon Morphology
Syntax Semantics
01/04/06
18
19Phase II first basic tools and applications
Phonetics Lexicon Morphology
Syntax Semantics
01/04/06
19
20Phase III more advanced tools and applications
Lemmatiser/Tagger
Morphological analyser
Statistical tools for the treatment of corpora
Phonetics Lexicon Morphology
Syntax Semantics
01/04/06
20
21Phase IV multilinguality and general
applications
Electronic dictionaries
Grammar checker
Web crawler
Environment for linguistic tools integration
Lemmatiser/Tagger
Morphological analyser
Statistical tools for the treatment of corpora
Comp. description of morphology
Comp. grammar
MRD's
Lexical Database
Phonetics Lexicon Morphology
Syntax Semantics
01/04/06
21
22Outline
- Basque language
- Ixa Group
- Strategy to develop HLT
- Applications
- Tools
- Linguistic resources
01/04/06
22
23 Commercial applications
- Spelling checker/corrector
- 3 lemmatization based on-line bilingual
/monolingual dictionaries - Lemmatization based on-line dictionary of
synonyms - Lemmatization based search machine
- The Basque component of a generator of weather
reports - Spanish-Basque transfer based MT system
01/04/06
23
24 Spelling checker/corrector
- Late standardization (Unified Basque)
- Morphology and verbs , 1966
- Lexical standardization process is still going on
- Adult speakers did not learn Basque at shoolgt
Many doubts in writing 'tree' zuhaitz?
zugaitz? zuhaitx? zuhaitsa? sugatza? gt Give
up! Do it in Spanish or French!
?? The spell-checker is an EFFECTIVE TOOL in
the ongoing standardization of Basque
01/04/06
24
25Basque spelling checker/corrector More than
20.000 downloads Versions Office, OOffice, PC,
Mac, web service... Not just a list of possible
word-forms!
26 -Open Code -No lexical desanbiguation, but
yes idioms! -No use of corpus
Spanish-Basque transfer MT
27 Future work - Hybrid SMT, EBMT RBMT -
Lexical desambiguation - Verb subcategorization -
English
Spanish-Basque transfer MT
28- Search engine
- (based on lemmas)
- Not looking for saguarekin
- but sagu
- Not relevant similar words removed
- Those beginning with sagu
- but with a different lemma
- i.e. saguzar
- Found word forms with other suffixes
- saguen, saguaren, sagua, saguetan
29Lemmatization based on-line Basque-Spanish
bilingual dictionary
30Thesaurus (lemmatizer inside)
31Basque monolingual dictionary. Advanced
electronic version
32Diccionario Básico Escolar Cubano
33 Integration of consults to different dictionaries
34meaning number
Synset
BasqueWORDNET. Ontology of word synsets
35Title
paragraph
body
MultimeteoThe Basque component of a generator of
weather reports
36Second language learning system (learner and
error corpora based)
37 Second language learning system (Automatic
generation of exercises)
38Outline
- Basque language
- Ixa Group
- Strategy to develop HLT
- Applications
- Tools
- Linguistic resources
01/04/06
38
39Methodology for stand-off corpus tagging (TEI,
feature structures and XML)
40HDBLren pantailazoahemen
EULIA tool for corpus tagging
41CORPUSGILE tool for consulting corpus
42ERAUZTERM Terminology extraction from corpus
43EDBL lexical data-base
- Lexical basis of the automatic proccessing of
Basque - 80.000 entries
- Dictionnary entries
- Verb-forms
- Affixes
- updated and consistent.
- built with ORACLE V7 and UNIX
01/04/06
43
44Format of the lemmatizer output
Morphosyntax
Plain text
Morhological analyzer
Linguistic disambiguation
CG
/ltNoizean_behingt/ltHAUL_EDBLgt/
("noizean_behin" ADB ADOARR _at_ADLG) /lt,gt/ltPUNT_KOM
Agt/ /ltInformatikagt/ltHAS_MAIgt/
("informatika" IZE ARR _at_KMgt) /ltFakultatearengt/ltHA
S_MAIgt/ ("fakultate" IZE ARR DEK GEN
NUMS MUGM _at_IZLGgt _at_ltIZLG) /ltaurrekogt/
("aurre" IZE ARR DEK NUMS MUGM DEK GEL
_at_IZLGgt _at_ltIZLG) /ltzuhaitzakgt/ ("zuhaitz"
IZE ARR DEK ABS NUMP MUGM _at_OBJ
_at_SUBJ _at_PRED) /ltinaustengt/ ("inausi"
ADI SIN AMM ADOIN ASP EZBU _at_-JADNAG) /ltdiragt/
("izan" ADL A1 NR_HK _at_JADLAG) /lt.gt/ltPUNT_PUN
Tgt/
Surface syntax
Statistical d isambiguation
Parsing
Surface syntax
CG
EUSTAGGER
EIHERA entities
xfst
Chunker
CG
Postpositions
CG
NP, PP, VP
Dependencies
Deep syntax
Syntactic dependencies
CG
Parsed text
Architecture of the lemmatizer/parser
45EIHERA Entity recognition
46Zuhaitza
Noizean behin, Informatika Fakultatearen aurreko
zuhaitzak inausten dira.
ZUHAITZA analizatzaile sintaktiko sakona
47Dictionary consult by PDA
48Conclusion
- A language that seeks to survive in the modern
information society requires language technology
products. - "Minority" languages have to do a great effort to
face this challenge. - Need of high standardization
- Reusing language foundations, tools, and
applications - Incremental design and development of them
01/04/06
48
49Technology is an effective tool to promote use
of BasqueStrategies to develop HLT for minority
languages
- IXA Research Group on NLP
- University of the Basque Country
IXA Research Group on NLP University of the
Basque Country