Dictionaries - PowerPoint PPT Presentation

About This Presentation
Title:

Dictionaries

Description:

Dictionaries See Patrick Hanks Lexicography chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004. – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 20
Provided by: Harold112
Category:

less

Transcript and Presenter's Notes

Title: Dictionaries


1
Dictionaries
  • See
  • Patrick Hanks Lexicography chapter 3 of Mitkov,
    R. (ed.) The Oxford Handbook of Computational
    Linguistics, Oxford OUP, 2004.

2
Dictionaries/Lexicons
  • Lexicography and the computer
  • Corpus-based lexicography
  • MRDs
  • Dictionaries for NLP
  • Thesauri structured lexicons

3
Computational lexicography
  • Restructuring and exploiting human dictionaries
    for use by computer programs
  • Using computational techniques to compile (new)
    dictionaries
  • Focus on English (and other well established
    languages)
  • Significant different issues for other languages,
    especially
  • Alphabetization and arrangement
  • Compilation from scratch for previously unstudied
    languages

4
Human dictionaries
  • Traditional view of what a dictionary is
  • List of words, arranged (usually) alphabetically
  • Inclusion in dictionary lends authority, even
    proscriptively
  • Entry typically gives
  • spelling ... alternate spellings
  • POS, morphology (if irregular)
  • core definition (using defining vocab?)
  • pronunciation (using own transcription)
  • etymology
  • examples of usage
  • as justification for inclusion
  • as illustration of use (esp. learners
    dictionaries)
  • Entry typically doesnt give
  • help with spelling
  • morphology (if regular), especially derivational
  • subcategorization information
  • contrastive examples of use
  • indications of possible metaphorical extensions
    to meaning

5
Human dictionaries
  • Historically
  • bilingual dictionaries for translators
  • monolingual dictionary as (pre/proscriptive)
    definition of language, often polemical
  • OED (1884-1928) first dictionary on purely
    descriptive principle, relying on citations
  • Deficiencies and difficulties
  • What to include? (neologisms, slang)
  • Inclusion of names
  • Differentiating senses

6
Differentiating word senses
  • Dictionaries disagree widely
  • Probably no right answer
  • General principles (look for excuse to split vs
    look for reason to lump)
  • Keep related words of different POS together?
  • Etymology can be misleading (eg crane, pupil)
  • Metaphorical extension of original meaning how
    far do you go? (eg rose, bar)
  • Purpose of dictionary may help decide, eg
    translation

7
Citations
  • Senses and uses identified by collecting examples
    of use
  • Sent in on slips by informants
  • Lexicographers job is to collate these
  • Criteria for a new word (or new meaning)
  • Number of citations
  • Source of citations
  • Veracity of use

8
Corpus-based dictionaries
  • A collection of texts, usually collected with a
    specific purpose in mind
  • British National Corpus, attempt to capture a
    synchronic picture of BrE of the late 1980s (100m
    words)
  • COBUILD Bank of English dynamic monitor
    corpus used to help lexicographers
    identify/define usage

9
Machine-readable dictionaries
  • Machine means computer
  • Dictionary stored in a format which makes it
    manipulable on a computer
  • Originally, derived from MR version of print
    dictionary (from type-setters tapes)
  • Now the other way round data stored as a
    database from which hard copy can be printed
    (inter alia)

10
MRDs - advantages
  • Flexibility of access and presentation
  • Not bound to alphabetical listing
  • Information presented can be filtered
  • Can be searched as a database
  • Different versions (for different users, serving
    different purposes) can be produced
  • Increased storage capacity
  • More information can be stored, especially
  • Implicit information can be made explicit
  • More examples, including negative data

11
Lexicons for NLP
  • Have to state everything we need to know about
    the word
  • Phonology stress pattern, possible weak forms
  • Orthography spelling alternatives, hyphenation
  • Morphology inflectional paradigms, even if
    regular
  • Information about derivations
  • Syntax Explicit information about
    subcategorization and
  • eg syntactic/semantic features of arguments
  • Any special interpretation of tenses
  • Lexical combinatorics compounds, idioms
  • Semantics definition, semantic features,
    semantic relations
  • Pragmatics register, collocation, connotation

12
Lexicons for NLP - example
  • Information about derivations
  • Agentive derivation (-er) is very productive
  • Usually means the actor doing the action of a
    verb, e.g. swimmer, dancer, killer
  • Not available for some verbs, e.g. knower,
    cycler, sayer though cf soothsayer, hoper
  • May have a specialised meaning instead of or as
    well as the derived meaning, e.g. revolver,
    computer, washer, hitter
  • In some cases can mean the object undergoing the
    action (via ergative use of verb), e.g. taster

13
Subcategorization
  • Words are assigned to categories (ie parts of
    speech, POS), eg noun, verb
  • on basis of form, meaning, use
  • Syntactic behaviour is predictable from (or
    determined by) category
  • Within a category there are subcategories with
    specific patterns of behaviour, both syntactic
    and semantic, e.g.
  • transitive/intransitive verb ? direct object?
    passivize?

14
Subcategorization
  • Subcat frames indicate complement patterns and
    preferences, e.g.
  • subj, obj, double obj, prep-obj, infinitival
    complement, that complement etc
  • semantic features of complements, eg obj of eat
    normally edible
  • Subcat information can help to disambiguate
  • cf He told the man where the body was buried .
  • He found the place where the body was
    buried .
  • Much of this info can be captured in general
    rules





15
  • Have to state everything we need to know about
    the word, though not necessarily explicitly
  • There can be rules to capture inheritance of
    properties, e.g.
  • accomplishment prog tense implies incompletion
  • cf She was baking a cake when she dropped dead ?
    no cake
  • She was stroking the cat when she dropped dead

16
Exploiting human dictionaries in NLP
  • In all NLP applications, lexicon is major
    bottleneck
  • Availability of MRD versions of human
    dictionaries provided possible solution
  • Obviously, MRD gives list of words, and some
    information
  • Extract further information about verb frames by
    analysing the examples
  • Identify semantic features from definitions
  • eg a plant which..., a person who...
  • Identify hidden arguments
  • eg to lock to close sthg using a key
  • cf He locked the door. The key was heavy.
  • He emptied his pockets. The key was
    heavy.

17
Exploiting human dictionaries in NLP
  • Generic information about a word and its usage
    can be derived from definitions in which it
    occurs

Wine alcoholic drink made from fermented juices,
especially of grapes Vintage a seasons yield of
wine from a vineyard Red wine wine having a red
colour derived from the skins of the grapes used
... Vineyard an orchard where grapes are grown
for the purpose of wine making Pinot noir a dry
red Californian table wine Sake Japanese rice
wine Claret a dry red Bordeaux or Bordeaux-like
wine Sherry a sweet white wine from the Jerez
region of Spain Riesling a dessert wine made
from white grapes grown historically in Germany
...
18
Corpus-based lexicography revisited
  • Similarly, analysis of real examples can reveal
    patterns of usage
  • Identify primary meaning not always what youd
    expect (example of reckon)
  • Identify possible complementation patterns, and
    their relative frequency

19
Structured dictionaries
  • Special type of dictionary in which words are
    grouped together according to their meaning
    thesaurus
  • Classic example Rogets Thesaurus (1852)
  • Structured vocabulary much used in field of
    terminology
  • Also now a valuable resource for NLP Millers
    (Princeton) WordNet (1985)
Write a Comment
User Comments (0)
About PowerShow.com