Title: Introduction to Computational Linguisitics
1Introduction toComputational Linguisitics
2Introduction
- An inventory of words is an essential component
of programs for a wide variety of language
sensitive applications, such as - Spellchecking, stylechecking
- IR, IE, message understanding
- parsing, generation, MT
- TTS and STT
- Such an inventory usually called a dictionary or
lexicon.
3Dictionaries
- The purpose of a dictionary is to provide a wide
range of information about words - Some of this is linguistic information, e.g.
syntactic category, pronunciation, distribution. - But dictionaries also contain definitions of word
senses thus providing knowledge about not just
language but about the world itself.
4What is "dog"?
- dog (ANIMAL) Â Â show phoneticsnoun C
- a common four-legged animal, especially kept by
people as a pet or to hunt or guard things - my pet dogwild dogsdog foodWe could hear dogs
barking in the distance.(from Cambridge
Advanced Learners Dictionary)
5Senses of Dog
- dog was found in the Cambridge Advanced Learner's
Dictionary at the entries listed below. - dog (ANIMAL)
- dog (PERSON)
- dog (FOLLOW)
- dog (PROBLEM)
different senses or lexemes for dog
6Two Views of the Lexicongive rise to different
issues
- Lexicon as word database
- How to represent the word collection
- Lookup given an arbitrary word, how to access
the relevant entries - What information to provide and how to express
it. - Lexicon as database about word senses
- Representation of word sense
- What are the relations between word senses?
- How do word senses hook up with concept knowledge
7Lexicon as Word Database Representing the Word
Collection
- Some possible representations
- Text file
- Finite state automaton.
- Other specialised data structure which allows for
common prefixes, e.g. letter tree - Full form vs. lexeme morphological analysis
8FSA for Sublexicon Fragment
o
t
h
e
s
e
a
i
t
s
9Letter Tree
- ltree( b, a, r, k, bark,
- c, a, r, r, y, carry,
- t, cat,
- e, g, o, r, y,
category, - d, e, l, a, y, delay,
- h, e, l, p, help,
- o, p, hop,
- e, hope,
- q, u, a, r, r, y, quarry,
- i, z, quiz,
- o, t, e, quote
- ).
10Full Form Dictionary
- There is an entry for every possible word.
- No need for morphological processing
- Exceptions are handled automatically
- OK when number of entries is not too large.
- Repeated information.
- Because languages have different morphological
properties, full form is better for some
languages than for others.
11Morphological Analysis Lexicon
Input Word cats
Morphological Analysis
12Morphological Analysis
- Very roughly, morphological analysis of a word
involves 2 subproblems - A segmentation problem how to get from the
written text to the sequence of morphemes that
make it up. - A morphotactic problem how to combine the
individual morphemes together in a legitimate way.
13Segmentation/MorphotacticSubproblems
- Segmentation problem
- enlargement gt en large ment
- Morphotactic problem given what we know about
en, large and ment, how can they be legitimately
combined - enlargement gt (en large) ment
- enlargement /gt en (large ment)
- en ADJ gt V
- V ment gt N
142-Level Morphology
- In 1981 the four Ks (Kimmo Koskenniemi, Lauri
Karttunen, Ronald M. Kaplan and Martin Kay) were
working on morphological analysis (MA) - Basic idea was that MA is about computing
relation between sets of strings at two levels - Surface Level (string of lexical words made from
surface alphabet) - Lexical Level (string of morphemes made of
lexical alphabet). - Relation can be computed using finite state
transducers. - Reversibility of finite-state model
15What Information to Provide
- Specific Information eg "kicks"
- Syntactic Information
- POS verb
- Tense pres
- Number singular
- Person 3
- Type Transitive
- Semantic Information
- event-type Physical Action
- type-of subject animate
- type-of object physical
16What Information to Provide
- General Information
- Class Attributes
- Agreement has (Number, Gender)
- Enumeration of possible values
- Gender masc, fem
- Number sing, plur
- Class Relationships
- Transitive isa Verb
- Common isa Noun
17Two Views of the Lexicongive rise to different
issues
- Lexicon as word database
- How to represent the word collection
- Access given an arbitrary word, how to access
the relevant entries - What information to provide and how to express
it. - Lexicon as database about word senses
- What are the relations between word senses?
- How do word senses hook up with conceptual
knowledge
18WordNet
- In 1985 a group of psychologists and linguists at
Princeton had the idea of searching dictionaries
conceptually rather than alphabetically. - Attempt to organise a dictionary in terms of word
meanings rather than word forms. - What is the nature and organisation of the
lexicalised concepts that words can express? - Distinction between word forms, word meanings,
and entries.
19Lexical Matrix
synonymy
entries
polysemy
20WordNet
- A key aspect of WordNet is that a given meaning
or word sense is represented as the set of words
that can be used to express it. - These meanings are called synsets sets of words
with synonymous readings. - Synsets are established empirically according to
a principle of substitutability that is
relativised to context.
21The Principle of Substitutability
- Two expressions are synonymous if the
substitution of one for another never alters the
truth value of a sentence in which the
substitution is made. - Two expressions are synonymous in linguistic
context C if the substitution of one for the
other in C does not alter the truth value. - e.g. plank/board in carpentry contexts
22Lexical Matrix
entries
23WordNet
- In Wordnet, the synonymy relation between words
is fundamental. - Synsets can be thought of as representing
concepts which stand in various semantic
relations to each other. - X Antonym Y meaning (synset) X is opposite to
meaning (synset) Y (big, small) - X Hyponym Y like isa (e.g. dog, mammal)
- X Meronym Y X is a part of Y (e.g. leg, man)
24Lexicon as a Concept Graph
- We can thus imagine the WordNet Lexicon as a
gigantic graph whose nodes are synsets and whose
arcs are semantic relations between synsets. - Such a structure can be regarded as a semantic
map of the concepts used in a given language. - Many applications can be created using the
WordNet graph as a resource
25Using WordNet to Measure Semantic Orientations of
AdjectivesJaap Kamps, Maarten Marx, Robert J.
Mokken, Maarten de Rijke
26Conclusion
- Lexicon is a central building block of
language-sensitive systems - Schizophrenic status of lexical information
linguistic versus world knowledge. - As a wordlist, lexicon has to solve problem of
representation and access. Morphological analysis
can help to keep number of entries to a
manageable level. - As a collection of definitions, lexicon has to
deal with relationships between word meanings.