Title: Tools and Algorithms for NLP
1Tools and Algorithms for NLP
- Bertrand Gaiffe and Guy Perrier
21 - Summary notions of linguistics and
computational linguistics
- Generalities about natural languages
- Formal languages and natural languages
- Computational linguistics and Natural Language
Processing
31.1 - Generalities about natural languages
- The human language ability to represent and
communicate knowledge
41.1 - Generalities about natural languages
- The human language operates in a society by means
of a system of signs, a natural language, to
produce utterances. - The basic signs of a natural language are its
words. A word is a pair of a phonological form
(the signifiant ) and a meaningful content
(the signifié ) (Ferdinand de Saussure , Cours
de linguistique générale 1916). - From sounds to meaning, every natural language
present different levels, which are autonomous
while interacting closely. Every level gives rise
to a linguistic field.
51.1 - Generalities about natural languages
- Phonetics deals with the physical aspects of
producing and percepting the sounds (phones) of
a natural language. - Phonology studies how sounds contribute to build
words via abstract units, phonemes, and how
phonemes are realized in utterances modulo
prosody. - Morphology concerns the way in which elementary
signs, morphemes, combine to build words.
61.1 - Generalities about natural languages
- Syntax concerns the combination of words to build
sentences. There are two main views of syntax
phrase structure grammars take constituent as the
basic concept whereas dependency grammars take
dependency between words as the basic concept. - Semantics concerns the meaning of linguistic
utterances independently of their context-use.
Logic is a usual framework for representing the
meaning of utterances. - Pragmatics concerns the meaning of linguistic
utterances relatively to their context-use
(discourse, reference resolution, communicative
structure, dialogue )
71.1 - Generalities about natural languages
- The grammar of a natural language is a system of
categories and rules governing the phonological,
morphological, syntactic, semantic and pragmatic
levels of the language. - The lexicon of a natural language is the set of
all its words with their linguistic properties. A
lexeme is an element of the lexicon and it
contains phonological, morphological, syntactic
and semantic information related to a word of the
language. - The grammar and the lexicon are complementary
both participate in the characterisation of the
language.
81.2 - Formal languages and natural languages
- A formal language L over a finite alphabet ? of
symbols is a part of the monoid ? of words
built from ? elements. - The class of languages defined over ? is equipped
with operations intersection, union
(disjunction), concatenation, complementation,
Kleene closure...
91.2 - Formal languages and natural languages
- If L is infinite, it is important to have a
computation procedure for recognizing L , that is
for deciding if any string from ? belongs to L.
If such a procedure exists, L is said to be
recursive. - If there exists only a computation procedure for
enumerating L, L is said to be recursively
enumerable. - A formal grammar is a concise definition of a
formal language under the form of initial data
and a procedure for generating the words of the
language from the initial data.
101.2 - Formal languages and natural languages
- In a formal language, symbols are concatenated to
build the words of the language in a potentially
infinite way. In a natural language, words are
concatenated to build the utterances of the
language in a potentially infinite way too. - In a formal language, words are double side
objects (form, meaning) pairs. In a natural
language, utterances are also double side
objects they are (sound, meaning) pairs. - Chomskys results on the formalization of
grammars for natural languages (1956) played a
great part in the development of the theory of
formal languages.
111.4 - Formal languages and natural languages
- Ambiguity is excluded from formal languages
whereas it has an important place in natural
languages an utterance is ambiguous if there
are multiple alternative linguistic structures
that can be built from it. According to the
source of this multiplicity, we distinguish
lexical, phonological, syntactic and semantic
ambiguity. - Formal languages are frozen whereas natural
languages are evolutionary. What is linked to
this property is that the border between
acceptable linguistic utterances and non
acceptable linguistic utterances is fuzzy and
mobile.
121.3 - Computational Linguistics and Natural
Language Processing
- Computational Linguistics (CL) and Natural
Language Processing (NLP) are the application of
mathematics and computer science to linguistics.
The former is science-oriented whereas the latter
is application-oriented. - They are driven by two paradigms
- The symbolic paradigm aims at modelling
pre-existent linguistic knowledge with symbolic
systems. - The stochastic paradigm aims at extracting
linguistic information from corpora with
stochastic methods. - Natural language processing is structured
according to two directions analysis from the
utterances to their pragmatic interpretation and
generation from pragmatic goals to the utterances
representing their linguistic realization.