The%20Corp - PowerPoint PPT Presentation

About This Presentation
Title:

The%20Corp

Description:

Title: The Corp grafo Theory and Practice Author: bmaia Last modified by: belinda Created Date: 1/21/2004 4:44:13 AM Document presentation format – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 36
Provided by: bmai8
Category:

less

Transcript and Presenter's Notes

Title: The%20Corp


1
The Corpógrafo
  • Belinda Maia Luís Sarmento
  • PoloFLUP
  • LINGUATECA

2
A bit of history
  • PALC 97 'Do-it-yourself corpora ... with a
    little bit of help from your friends!'
  • CULT 1998 - Making corpora a learning process
  • Contrastive linguistics
  • Corpora linguistics
  • Translation teaching
  • General gt specific language

3
A bit of history
  • 2000 1st Masters in Terminology and
    Translation at FLUP
  • PALC 2001 - Training Translators in Terminology
    and Information Retrieval using Comparable and
    Parallel Corpora
  • Specialized translation and terminology
  • Contact with domain experts
  • Importance of IT
  • Need for technical help for more ambitious
    students!

4
A bit of history
  • LREC 2002 - Corpora for terminology extraction
    the differing perspectives and objectives of
    researchers, teachers and language services
    providers
  • 2002 2nd Masters in Terminology and
    Translation at FLUP
  • Plea for help to Diana Santos
  • October 2002
  • LINGUATECA - Polo FLUP

5
LINGUATECA
  • See http//www.linguateca.pt
  • Leader gt Diana Santos (SINTEF Oslo)
  • Objective - to create resources and tools for the
    computational processing of Portuguese
  • Nodes at Oslo, Lisbon, Braga and Porto
  • Porto - Polo CLUP/FLUP

6
Polo CLUP/FLUPGeneral focus
  • See http//www.linguateca.pt/poloclup/
  • On constructing resources specific to the needs
    of FLUP/CLUP
  • For researchers, teachers and students
  • For teaching methodology at FLUP
  • BNC Reuters corpora on intranet
  • A small chat corpus
  • Comparable corpora

7
More history
  • 2003 Poster of the GC at CL2003
  • 2003 What are comparable corpora? CL2003
  • 2003 Experimentation with evaluation of Machine
    Translation
  • 2003 Experimentation with GC
  • 2003 3rd Masters in Terminology and
    Translation at FLUP

8
Polo CLUP/FLUPResearch focus
  • See http//www.linguateca.pt/poloclup/
  • On-line suite of corpora tools to work with
    comparable corpora with emphasis on bilingual
    research
  • Focus on special domains
  • Construction of terminology databases, ontologies
    and domain models
  • Corpógrafo

9
And ...
  • Evaluation of Machine Translation
  • Experimentation with evaluation
  • Teaching research focus
  • Tools for collecting empirical data
  • Results
  • TrAva MT evaluation tool
  • CorTA Corpus of 1 EN input 4 MT output
    sentences

10
The Corpógrafo results from
  • Terminology, translation and language study and
    research (Belinda)
  • Computational linguistics research and production
    of resources (Diana)
  • Information retrieval and artificial intelligence
    (Luís)
  • Terminology data (Domain experts)
  • Discussions on priorities!

11
GC Integrated Web Environment for Corpora
Linguistics
  • Motivation
  • Lack of Comprehensive, wide-scope Corpora Tools
  • Commercial Packages are usually difficult to
    Integrate/Customize
  • Tools are not prepared to support cooperative
    work.
  • Linguistic knowledge is not usually integrated
    in tools.

BNC
CETEM Público
COMPARA
Others
Custom Interface
Custom Interface
Custom Interface
Custom Interface
  • Concordance Engine
  • Taggers
  • Aligner (Semi-Auto)
  • Corpora Bot
  • Statistics
  • Custom Tools

DEV
Internet
Tool Pool
Terminology DB
Personal Corpora
Inter-user Communication
Virtual Desktop
Terminology Extraction Tool (Auto/Semi-Auto)
ADM
USER
PDF
PS
RTF
TXT
HTML
DOC
12
Working with the Corpógrafo
  • Corpógrafo is a suite of integrated tools for
    INDIVIDUAL or GROUP research
  • All research done ONLINE
  • Each username/password separate space on our
    server
  • At present gt anyone can work with it using 10 MB
    space for FREE
  • BUT - you get an empty space tools tutorial!

13
Corpora and Terminology
  • Special Domain Corpora
  • Terminology extraction
  • Terminology databases
  • Structuring of domain knowledge
  • Further corpora and information retrieval

14
Internet
Corpora
Corpora Analysis
Terminology Database
Text details
Text details
Text details
15
Terminology Prescription or Description?
  • Prescriptive gt descriptive
  • Paper gt digital form
  • Static gt dynamic resources
  • Democratization of terminology
  • ISO standards gt socioterminology
  • Knowledge structures increasingly recognized as
    structured but dynamic

16
Perspectives of terminology users
  • Domain experts and vested interests
  • Translators
  • Information retrieval
  • Knowledge engineering
  • Standardized terminology
  • The right word
  • Finding information
  • Perfecting Google
  • Structuring knowledge
  • Finding it fast

17
Bridging the Gap
  • General linguists
  • Translation teachers
  • Translation students
  • Corpus linguists
  • Computational linguists
  • Computer engineers
  • Computer-phobia
  • Computer-worship

18
Focus of Corpógrafo
  • Design priorities are to
  • See the Big Picture
  • Create the Overall Framework
  • Get feedback from users
  • Develop according to real research needs
  • Fill in details and improve techniques as needed

19
(No Transcript)
20
File Manager
  • Area where each individual or group can
  • Upload texts to space on server
  • Convert various text formats to .txt
  • Clean them of unnecessary material
  • Check tokenization and sentence divisions
  • Register full information on source, domain and
    text type
  • Group and re-group - texts into corpora

21
General corpus analysis
  • Concordancing tools allowing for
  • Concordancing at sentence level
  • KWIC concordancing
  • Collocations
  • N-gram tool
  • Case-sensitive
  • Alphabetical or frequency ordering

22
Corpora TDB
  • Choose corpus
  • Choose related TDB
  • All terms, examples, definitions extracted
    (semi) automatically from corpus and transferred
    to TDB
  • All metadata on texts providing data can be
    automatically transferred to TDB

23
Term extraction
  • N-grams
  • Unfiltered
  • Filtered with restrictions on term in PT EN FR IT
    ES DE
  • Filtered with restrictions on term and context in
    PT EN FR IT ES DE
  • Singular plural terms can be combined
  • Existing terms in TDB need not appear

24
Term selection from n-grams
  • Consultation of list of n-grams
  • Check term status of each n-gram via underlying
    concordances
  • Check sources
  • Send to TDB

25
Search for Candidates for Definitions and/or
Semantic Relations
  • Already possible via TDB
  • Under development
  • Research areas for Mestrado dissertations and
    research assistants
  • Expressions that find definitions
  • Expressions that find semantic relations

26
TDB - Terminology database
  • Databases are designed to be multilingual
  • Terms listed alphabetically language tag
  • General data
  • Morphological data
  • Source metadata Authors, texts etc
  • Definitions search for candidates
  • Translation equivalents
  • Semantic relations

27
Future developments
  • General testing and improvement
  • Development of new ideas or functions
  • Isomorphic relationship between
  • Research possibilities
  • Researchers needs
  • Our skills
  • Coordination of individual corpus projects into
    bigger projects, when possible or necessary

28
Theoretical questions / problems
  • How large is a good domain corpus?
  • Comparable corpora v. Parallel corpora?
  • How much information does a database need for
    information retrieval and knowledge engineering?
  • How much does the user of a database need for
    translation, teaching etc.?

29
Corpógrafo and special domains
  • Masters in Terminology and Translation
  • Terminology projects with the support of domain
    specialists in
  • Engineering Electronics, Mechanical Engineering
  • Geography - Population Geography, Natural Hazards
    Fire, Floods, Earthquakes, Coastal Erosion,
  • Medicine - Kidney support machines, Neurology
  • Science Genetics
  • Technology GPS Geographical Positioning
    Systems

30
Corpógrafo and terminology/translation research
  • Ongoing dissertations on aspects of
  • Terminology neologisms, definition searches,
    semantic relations, conceptual analysis
  • Corpora text analysis, corpora construction
  • Technical writing gt Electrical Appliances
  • Localization
  • Terminology in documentaries
  • Translation of Multimedia

31
Linguateca
  • Linguatecas policy - all resources and tools
    freely available online
  • Primary users - Portuguese and Brazilian
  • Other users also welcome

32
Polo CLUP/FLUP
  • Bi- or multi-lingual in interest
  • Corpógrafo available for experiments on a small
    scale to the general public
  • Possibilities of future work on projects with
    users from other universities and other countries

33
Corpógrafo team
  • Belinda Maia - FLUP -Associate Professor
  • Luís Sarmento - Linguateca, FCCN Computer
    Engineer - Researcher-in-charge
  • Luís Miguel Cabral - Linguateca, FCCN Computer
    Engineer, Research assistant
  • Débora Oliveira - Linguateca, FCCN Research
    assistant
  • Ana Sofia Pinto FLUP technical assistant

34
Contacts
  • If you are interested is finding out more, please
    contact me
  • Belinda Maia at bmaia_at_mail.telepac.pt
  • Or
  • Luís Sarmento at las_at_letras.up.pt
  • The Corpógrafo can be used
  • (with a username and password) at
  • http//www.linguateca.pt/corpografo and
  • http//poloclup.linguateca.pt/ferramentas/gc

35
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com