Introduction to Computational Linguisitics - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Computational Linguisitics

Description:

An inventory of words is an essential component of programs for a wide variety ... had the idea of searching dictionaries conceptually rather than alphabetically. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 27
Provided by: MikeR2
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Computational Linguisitics


1
Introduction toComputational Linguisitics
  • The Lexicon

2
Introduction
  • An inventory of words is an essential component
    of programs for a wide variety of language
    sensitive applications, such as
  • Spellchecking, stylechecking
  • IR, IE, message understanding
  • parsing, generation, MT
  • TTS and STT
  • Such an inventory usually called a dictionary or
    lexicon.

3
Dictionaries
  • The purpose of a dictionary is to provide a wide
    range of information about words
  • Some of this is linguistic information, e.g.
    syntactic category, pronunciation, distribution.
  • But dictionaries also contain definitions of word
    senses thus providing knowledge about not just
    language but about the world itself.

4
What is "dog"?
  • dog (ANIMAL)   show phoneticsnoun C
  • a common four-legged animal, especially kept by
    people as a pet or to hunt or guard things
  • my pet dogwild dogsdog foodWe could hear dogs
    barking in the distance.(from Cambridge
    Advanced Learners Dictionary)

5
Senses of Dog
  • dog was found in the Cambridge Advanced Learner's
    Dictionary at the entries listed below.
  • dog (ANIMAL)
  • dog (PERSON)
  • dog (FOLLOW)
  • dog (PROBLEM)

different senses or lexemes for dog
6
Two Views of the Lexicongive rise to different
issues
  • Lexicon as word database
  • How to represent the word collection
  • Lookup given an arbitrary word, how to access
    the relevant entries
  • What information to provide and how to express
    it.
  • Lexicon as database about word senses
  • Representation of word sense
  • What are the relations between word senses?
  • How do word senses hook up with concept knowledge

7
Lexicon as Word Database Representing the Word
Collection
  • Some possible representations
  • Text file
  • Finite state automaton.
  • Other specialised data structure which allows for
    common prefixes, e.g. letter tree
  • Full form vs. lexeme morphological analysis

8
FSA for Sublexicon Fragment
o
t
h
e
s
e
a
i
t
s
9
Letter Tree
  • ltree( b, a, r, k, bark,
  • c, a, r, r, y, carry,
  • t, cat,
  • e, g, o, r, y,
    category,
  • d, e, l, a, y, delay,
  • h, e, l, p, help,
  • o, p, hop,
  • e, hope,
  • q, u, a, r, r, y, quarry,
  • i, z, quiz,
  • o, t, e, quote
  • ).

10
Full Form Dictionary
  • There is an entry for every possible word.
  • No need for morphological processing
  • Exceptions are handled automatically
  • OK when number of entries is not too large.
  • Repeated information.
  • Because languages have different morphological
    properties, full form is better for some
    languages than for others.

11
Morphological Analysis Lexicon
Input Word cats
Morphological Analysis
12
Morphological Analysis
  • Very roughly, morphological analysis of a word
    involves 2 subproblems
  • A segmentation problem how to get from the
    written text to the sequence of morphemes that
    make it up.
  • A morphotactic problem how to combine the
    individual morphemes together in a legitimate way.

13
Segmentation/MorphotacticSubproblems
  • Segmentation problem
  • enlargement gt en large ment
  • Morphotactic problem given what we know about
    en, large and ment, how can they be legitimately
    combined
  • enlargement gt (en large) ment
  • enlargement /gt en (large ment)
  • en ADJ gt V
  • V ment gt N

14
2-Level Morphology
  • In 1981 the four Ks (Kimmo Koskenniemi, Lauri
    Karttunen, Ronald M. Kaplan and Martin Kay) were
    working on morphological analysis (MA)
  • Basic idea was that MA is about computing
    relation between sets of strings at two levels
  • Surface Level (string of lexical words made from
    surface alphabet)
  • Lexical Level (string of morphemes made of
    lexical alphabet).
  • Relation can be computed using finite state
    transducers.
  • Reversibility of finite-state model

15
What Information to Provide
  • Specific Information eg "kicks"
  • Syntactic Information
  • POS verb
  • Tense pres
  • Number singular
  • Person 3
  • Type Transitive
  • Semantic Information
  • event-type Physical Action
  • type-of subject animate
  • type-of object physical

16
What Information to Provide
  • General Information
  • Class Attributes
  • Agreement has (Number, Gender)
  • Enumeration of possible values
  • Gender masc, fem
  • Number sing, plur
  • Class Relationships
  • Transitive isa Verb
  • Common isa Noun

17
Two Views of the Lexicongive rise to different
issues
  • Lexicon as word database
  • How to represent the word collection
  • Access given an arbitrary word, how to access
    the relevant entries
  • What information to provide and how to express
    it.
  • Lexicon as database about word senses
  • What are the relations between word senses?
  • How do word senses hook up with conceptual
    knowledge

18
WordNet
  • In 1985 a group of psychologists and linguists at
    Princeton had the idea of searching dictionaries
    conceptually rather than alphabetically.
  • Attempt to organise a dictionary in terms of word
    meanings rather than word forms.
  • What is the nature and organisation of the
    lexicalised concepts that words can express?
  • Distinction between word forms, word meanings,
    and entries.

19
Lexical Matrix
synonymy
entries
polysemy
20
WordNet
  • A key aspect of WordNet is that a given meaning
    or word sense is represented as the set of words
    that can be used to express it.
  • These meanings are called synsets sets of words
    with synonymous readings.
  • Synsets are established empirically according to
    a principle of substitutability that is
    relativised to context.

21
The Principle of Substitutability
  • Two expressions are synonymous if the
    substitution of one for another never alters the
    truth value of a sentence in which the
    substitution is made.
  • Two expressions are synonymous in linguistic
    context C if the substitution of one for the
    other in C does not alter the truth value.
  • e.g. plank/board in carpentry contexts

22
Lexical Matrix
entries
23
WordNet
  • In Wordnet, the synonymy relation between words
    is fundamental.
  • Synsets can be thought of as representing
    concepts which stand in various semantic
    relations to each other.
  • X Antonym Y meaning (synset) X is opposite to
    meaning (synset) Y (big, small)
  • X Hyponym Y like isa (e.g. dog, mammal)
  • X Meronym Y X is a part of Y (e.g. leg, man)

24
Lexicon as a Concept Graph
  • We can thus imagine the WordNet Lexicon as a
    gigantic graph whose nodes are synsets and whose
    arcs are semantic relations between synsets.
  • Such a structure can be regarded as a semantic
    map of the concepts used in a given language.
  • Many applications can be created using the
    WordNet graph as a resource

25
Using WordNet to Measure Semantic Orientations of
AdjectivesJaap Kamps, Maarten Marx, Robert J.
Mokken, Maarten de Rijke
26
Conclusion
  • Lexicon is a central building block of
    language-sensitive systems
  • Schizophrenic status of lexical information
    linguistic versus world knowledge.
  • As a wordlist, lexicon has to solve problem of
    representation and access. Morphological analysis
    can help to keep number of entries to a
    manageable level.
  • As a collection of definitions, lexicon has to
    deal with relationships between word meanings.
Write a Comment
User Comments (0)
About PowerShow.com