Title: The Corp
1The Corpógrafo
- Belinda Maia Luís Sarmento
- PoloFLUP
- LINGUATECA
2A bit of history
- PALC 97 'Do-it-yourself corpora ... with a
little bit of help from your friends!' - CULT 1998 - Making corpora a learning process
- Contrastive linguistics
- Corpora linguistics
- Translation teaching
- General gt specific language
3A bit of history
- 2000 1st Masters in Terminology and
Translation at FLUP - PALC 2001 - Training Translators in Terminology
and Information Retrieval using Comparable and
Parallel Corpora
- Specialized translation and terminology
- Contact with domain experts
- Importance of IT
- Need for technical help for more ambitious
students!
4A bit of history
- LREC 2002 - Corpora for terminology extraction
the differing perspectives and objectives of
researchers, teachers and language services
providers - 2002 2nd Masters in Terminology and
Translation at FLUP
- Plea for help to Diana Santos
- October 2002
- LINGUATECA - Polo FLUP
5LINGUATECA
- See http//www.linguateca.pt
- Leader gt Diana Santos (SINTEF Oslo)
- Objective - to create resources and tools for the
computational processing of Portuguese - Nodes at Oslo, Lisbon, Braga and Porto
- Porto - Polo CLUP/FLUP
6Polo CLUP/FLUPGeneral focus
- See http//www.linguateca.pt/poloclup/
- On constructing resources specific to the needs
of FLUP/CLUP - For researchers, teachers and students
- For teaching methodology at FLUP
- BNC Reuters corpora on intranet
- A small chat corpus
- Comparable corpora
7More history
- 2003 Poster of the GC at CL2003
- 2003 What are comparable corpora? CL2003
- 2003 Experimentation with evaluation of Machine
Translation - 2003 Experimentation with GC
- 2003 3rd Masters in Terminology and
Translation at FLUP
8Polo CLUP/FLUPResearch focus
- See http//www.linguateca.pt/poloclup/
- On-line suite of corpora tools to work with
comparable corpora with emphasis on bilingual
research - Focus on special domains
- Construction of terminology databases, ontologies
and domain models - Corpógrafo
9And ...
- Evaluation of Machine Translation
- Experimentation with evaluation
- Teaching research focus
- Tools for collecting empirical data
- Results
- TrAva MT evaluation tool
- CorTA Corpus of 1 EN input 4 MT output
sentences
10The Corpógrafo results from
- Terminology, translation and language study and
research (Belinda) - Computational linguistics research and production
of resources (Diana) - Information retrieval and artificial intelligence
(Luís) - Terminology data (Domain experts)
- Discussions on priorities!
11GC Integrated Web Environment for Corpora
Linguistics
- Motivation
- Lack of Comprehensive, wide-scope Corpora Tools
- Commercial Packages are usually difficult to
Integrate/Customize - Tools are not prepared to support cooperative
work. - Linguistic knowledge is not usually integrated
in tools.
BNC
CETEM Público
COMPARA
Others
Custom Interface
Custom Interface
Custom Interface
Custom Interface
- Concordance Engine
- Taggers
- Aligner (Semi-Auto)
- Corpora Bot
- Statistics
- Custom Tools
DEV
Internet
Tool Pool
Terminology DB
Personal Corpora
Inter-user Communication
Virtual Desktop
Terminology Extraction Tool (Auto/Semi-Auto)
ADM
USER
PDF
PS
RTF
TXT
HTML
DOC
12Working with the Corpógrafo
- Corpógrafo is a suite of integrated tools for
INDIVIDUAL or GROUP research - All research done ONLINE
- Each username/password separate space on our
server - At present gt anyone can work with it using 10 MB
space for FREE - BUT - you get an empty space tools tutorial!
13Corpora and Terminology
- Special Domain Corpora
- Terminology extraction
- Terminology databases
- Structuring of domain knowledge
- Further corpora and information retrieval
14 Internet
Corpora
Corpora Analysis
Terminology Database
Text details
Text details
Text details
15Terminology Prescription or Description?
- Prescriptive gt descriptive
- Paper gt digital form
- Static gt dynamic resources
- Democratization of terminology
- ISO standards gt socioterminology
- Knowledge structures increasingly recognized as
structured but dynamic
16Perspectives of terminology users
- Domain experts and vested interests
- Translators
- Information retrieval
- Knowledge engineering
- Standardized terminology
- The right word
- Finding information
- Perfecting Google
- Structuring knowledge
- Finding it fast
17Bridging the Gap
- General linguists
- Translation teachers
- Translation students
- Corpus linguists
- Computational linguists
- Computer engineers
- Computer-phobia
- Computer-worship
-
18Focus of Corpógrafo
- Design priorities are to
- See the Big Picture
- Create the Overall Framework
- Get feedback from users
- Develop according to real research needs
- Fill in details and improve techniques as needed
19(No Transcript)
20File Manager
- Area where each individual or group can
- Upload texts to space on server
- Convert various text formats to .txt
- Clean them of unnecessary material
- Check tokenization and sentence divisions
- Register full information on source, domain and
text type - Group and re-group - texts into corpora
21General corpus analysis
- Concordancing tools allowing for
- Concordancing at sentence level
- KWIC concordancing
- Collocations
- N-gram tool
- Case-sensitive
- Alphabetical or frequency ordering
22Corpora TDB
- Choose corpus
- Choose related TDB
- All terms, examples, definitions extracted
(semi) automatically from corpus and transferred
to TDB - All metadata on texts providing data can be
automatically transferred to TDB
23Term extraction
- N-grams
- Unfiltered
- Filtered with restrictions on term in PT EN FR IT
ES DE - Filtered with restrictions on term and context in
PT EN FR IT ES DE - Singular plural terms can be combined
- Existing terms in TDB need not appear
24Term selection from n-grams
- Consultation of list of n-grams
- Check term status of each n-gram via underlying
concordances - Check sources
- Send to TDB
25Search for Candidates for Definitions and/or
Semantic Relations
- Already possible via TDB
- Under development
- Research areas for Mestrado dissertations and
research assistants - Expressions that find definitions
- Expressions that find semantic relations
26TDB - Terminology database
- Databases are designed to be multilingual
- Terms listed alphabetically language tag
- General data
- Morphological data
- Source metadata Authors, texts etc
- Definitions search for candidates
- Translation equivalents
- Semantic relations
27Future developments
- General testing and improvement
- Development of new ideas or functions
- Isomorphic relationship between
- Research possibilities
- Researchers needs
- Our skills
- Coordination of individual corpus projects into
bigger projects, when possible or necessary
28Theoretical questions / problems
- How large is a good domain corpus?
- Comparable corpora v. Parallel corpora?
- How much information does a database need for
information retrieval and knowledge engineering? - How much does the user of a database need for
translation, teaching etc.?
29Corpógrafo and special domains
- Masters in Terminology and Translation
- Terminology projects with the support of domain
specialists in - Engineering Electronics, Mechanical Engineering
- Geography - Population Geography, Natural Hazards
Fire, Floods, Earthquakes, Coastal Erosion, - Medicine - Kidney support machines, Neurology
- Science Genetics
- Technology GPS Geographical Positioning
Systems
30Corpógrafo and terminology/translation research
- Ongoing dissertations on aspects of
- Terminology neologisms, definition searches,
semantic relations, conceptual analysis - Corpora text analysis, corpora construction
- Technical writing gt Electrical Appliances
- Localization
- Terminology in documentaries
- Translation of Multimedia
31Linguateca
- Linguatecas policy - all resources and tools
freely available online - Primary users - Portuguese and Brazilian
- Other users also welcome
32Polo CLUP/FLUP
- Bi- or multi-lingual in interest
- Corpógrafo available for experiments on a small
scale to the general public - Possibilities of future work on projects with
users from other universities and other countries
33Corpógrafo team
- Belinda Maia - FLUP -Associate Professor
- Luís Sarmento - Linguateca, FCCN Computer
Engineer - Researcher-in-charge - Luís Miguel Cabral - Linguateca, FCCN Computer
Engineer, Research assistant - Débora Oliveira - Linguateca, FCCN Research
assistant - Ana Sofia Pinto FLUP technical assistant
34Contacts
- If you are interested is finding out more, please
contact me - Belinda Maia at bmaia_at_mail.telepac.pt
- Or
- Luís Sarmento at las_at_letras.up.pt
- The Corpógrafo can be used
- (with a username and password) at
- http//www.linguateca.pt/corpografo and
- http//poloclup.linguateca.pt/ferramentas/gc
35(No Transcript)