Title: Natural Language Processing
1Natural Language Processing
- Artificial Intelligence
- Seminar Project
- Cristea Emilia, gr. 922
2Introduction
- Dave Bowman Open the pod bay doors, HALHAL
Im sorry Dave, Im afraid I cant do
that.(Stanley Kubrick and Arthur C. Clarke,
screenplay of 2001 A Space Odyssey) - The HAL 9000 computer from Stanley Kubricks film
2001 A Space Odyssey is one of the most
recognizable characters in 20th century cinema.
HAL is an artificial agent capable of decision
making, speaking and understanding English (and,
at a crucial moment in the plot, even reading
lips). Today it is clear that Arthur C. Clarke
was a little too optimistic in predicting when
such an entity would be available. But just far
off was he? - Minimally, such an agent would have to be capable
of interacting with humans via language, which
includes - understanding humans through speech recognition
and natural language understanding - communicating with humans through speech
synthesis and natural language generation - It would also need to do information retrieval,
information extraction and inference (drawing
conclusions based on known facts). - Although these problems are far from being
completely solved, much of the needed language
related technology is currently being developed
(some already available commercially). Solving
this kind of problems is the main concern of the
fields known as Natural Language Processing,
Computational Linguistics and Speech Recognition
and Synthesis.
3What is NLP?
- Natural language processing (NLP) is a subfield
of artificial intelligence and computational
linguistics. It studies the problems of automated
generation and understanding of natural human
languages. - Natural-language-generation systems convert
information from computer databases into
normal-sounding human language. - Natural-language-understanding systems convert
samples of human language into more formal
representations that are easier for computer
programs to manipulate.
Computational linguistics is an interdisciplinary
field dealing with the statistical and/or
rule-based modeling of natural language from a
computational perspective. This modeling is not
limited to any particular field of linguistics.
4- Traditionally, computational linguistics was
usually performed by computer scientists. - Recent research has shown that human language is
much more complex than previously thought, so
computational linguists often work as members of
interdisciplinary teams. - In general computational linguistics draws upon
the involvement of linguists, computer
scientists, experts in artificial intelligence,
cognitive psychologists, mathematicians, and
logicians, amongst others.
5Applications of NLP
- Natural Language Processing (NLP) is the use of
computers to process written and spoken language
for some practical, useful, purpose - to translate languages
- to get information from the web on text data
banks so as to answer questions - to carry on conversations with machines
- These are only examples of major types of NLP,
and there is also a huge range of lesser but
interesting applications, e.g. getting a computer
to decide if one newspaper story has been
rewritten from another or constructing a summary
for a certain text. - Language is the fabric of the web. The rapid
growth of the Internet/WWW and the emergence of
the information society poses exciting new
challenges to language technology. Although the
new media combine text, graphics, sound and
movies, the whole world of multimedia information
can only be structured, indexed and navigated
through language. - For browsing, navigating, filtering and
processing the information on the web, we need
software that can get at the contents of
documents. Language technology for content
management is a necessary precondition for
turning the wealth of digital information into
collective knowledge.
6Examples
- E.g. information retrieval
E.g. summarization
7More on NLP
- Natural Language Processing (NLP) is both a
modern computational technology and a method of
investigating and evaluating claims about human
language itself. - NLP normally has an emphasis on the role of
knowledge representations, that is to say the
need for representations of our knowledge of the
world in order to understand human language with
computers. - NLP is not simply applications but the core
technical methods and theories that the major
tasks above divide up into, such as Machine
Learning techniques. This last is closer to
Artificial Intelligence, and is an essential
component of NLP if computers are to engage in
realistic conversations they must, like us, have
an internal model of the humans they converse
with. - NLP is Challenging
- AI-complete To solve NLP, youd need to solve
all of the problems in AI. Natural-language
recognition seems to require extensive knowledge
about the outside world and the ability to
manipulate it. - Turing test Posits that engaging effectively
in linguistic behavior is a - sufficient condition for having achieved
intelligence.
8Problems in NLP
- Limitations In theory, natural-language
processing is a very attractive method of
human-computer interaction. Early systems such as
SHRDLU, working in restricted blocks worlds
with restricted vocabularies, worked extremely
well, leading researchers to excessive optimism,
which was soon lost when the systems were
extended to more realistic situations with
real-world ambiguity and complexity. - Concrete problems The sentences We gave the
monkeys the bananas because they were hungry and
We gave the monkeys the bananas because they
were over-ripe have the same surface grammatical
structure. However, the pronoun they refers to
monkeys in one sentence and bananas in the other,
and it is impossible to tell which without a
knowledge of the properties of monkeys and
bananas. - A string of words may be interpreted in different
ways. For example, the strings Time flies like
an arrow and Fruit flies like a banana may be
interpreted in a variety of ways. - The sentence Colorless green ideas sleep
furiously is grammatically correct but it is
nonsensical. Linguist Noam Chomsky concludes that
data-driven approaches will always suffer from a
lack of data, and hence are doomed to failure. gt
see Statistical NLP click to slide.
9Major Tasks in NLP
- Automatic summarization
- Foreign Language Reading Aid
- Foreign Language Writing Aid
- Information extraction
- Information retrieval
- Machine translation
- Named entity recognition
- Natural language generation
- Optical Character Recognition
- Question answering
- Speech recognition
- Spoken dialogue system
- Text simplification
- Text to speech
- Text-proofing
10Major Obstacle in NLP
- Ambiguity! - at all levels of analysis.
- Phonetics and phonology
- Concerns how words are related to the sounds
that realize them ( "I scream" vs. "ice cream). - Morphology
- Concerns how words are constructed from
sub-word units. - Syntax
- Concerns sentence structure
- Different syntactic structure implies
different interpretation. - Semantics
- Concerns what words mean and how these
meanings combine to form sentence meanings (e.g.
Jack invited Mary to the Halloween ball. -gt
dance vs. some big sphere with Halloween
decorations?). - Discourse
- Concerns how the immediately preceding
sentences affect the interpretation of the next
sentence.
11Statistical NLP
- Statistical natural-language processing uses
stochastic, probabilistic and statistical methods
to resolve some of the difficulties discussed
above, especially those which arise because
longer sentences are highly ambiguous when
processed with realistic grammars, yielding
thousands or millions of possible analyses. - Methods for disambiguation often involve the use
of corpora and Markov models. Statistical NLP
comprises all quantitative approaches to
automated language processing, including
probabilistic modeling, information theory, and
linear algebra. The technology for statistical
NLP comes mainly from machine learning and data
mining, both of which are fields of artificial
intelligence that involve learning from data.
12Statistical NLP vs. Linguistics
- We must not go overboard and mistakenly conclude
that the successes of statistical NLP render
linguistics irrelevant. The information and
insight that linguists, psychologists, and others
have gathered about language is invaluable in
creating high-performance broad-domain language
understanding systems. - Head-driven phrase structure grammar (HPSG)
formalism is a way of analyzing natural language
utterances that truly marries deep linguistic
information with computer science mechanisms,
such as unification and recursive data-types, for
representing and propagating this information
throughout the utterance's structure. - In sum, computational techniques and data-driven
methods are now an integral part both of building
systems capable of handling language in a
domain-independent, flexible, and graceful way,
and of improving our understanding of language
itself.
13Lexical semantics WordNet
- Handcrafted database of lexical relations.
- Three separate databases nouns verbs
adjectives and adverbs. - Each database is a set of lexical entries
(according to unique orthographic forms). - Set of senses associated with each entry
14Word sense disambiguation
- This is an NLP task with a long history but one
which has come to prominence in recent years as a
new, and very high level, application of
empirical and machine learning methods in NLP.
High levels of success have now been achieved
with both small selections of words in a corpus
and with the disambiguation of all content words.
The task now has its own competition, SENSEVAL
and has been extended to a range of languages. - Problem description Given a fixed set of senses
is associated with a lexical item, determine
which of them applies to a particular instance of
the lexical item. - Two fundamental approaches
WSD occurs during semantic analysis as a
side-effect of the elimination of ill-formed
semantic representations Stand-alone
approach WSD is performed independent of, and
prior to, compositional semantic analysis Makes
minimal assumptions about what information will
be available from other NLP processes
15Survey of WSD methods
- In general terms, word sense disambiguation (WSD)
involves the association of a given word in a
text or discourse with a definition or meaning
(sense) which is distinguishable from other
meanings potentially attributable to that word.
The task therefore necessarily involves two
steps - (1) the determination of all the different
senses for every word relevant (at least) to the
text or discourse under consideration and - (2) a means to assign each occurrence of a word
to the appropriate sense. - Much recent work on WSD relies on pre-defined
senses for step (1), including - a list of senses such as those found in everyday
dictionaries - a group of features, categories, or associated
words (e.g., synonyms, as in a thesaurus) - an entry in a transfer dictionary which includes
translations in another language etc. - The precise definition of a sense is, however, a
matter of considerable debate within the
community. The variety of approaches to defining
senses has raised recent concern about the
comparability of much WSD work.
16Dictionary-based approaches
- Rely on machine readable dictionaries
- Initial implementation of this kind of approach
is due to Michael Lesk (1986) - Given a word W to be disambiguated
- Retrieve all of the sense definitions, S, for
W from the MRD. - Compare each s in S to the dictionary
definitions of all the remaining words in the
context. - Select the sense s with the most overlap with
(the definitions of) these context words. - Example
- Word cone
- Context pine cone
- Sense definitions
- pine 1 kind of evergreen tree with needle-shaped
leaves - 2 waste away through sorrow or illness
- cone 1 solid body which narrows to a point
- 2 something of this shape whether solid or hollow
- 3 fruit of certain evergreen trees
- Accuracy of 50-70 on short samples of text from
Pride and Prejudice.
17Machine learning approaches
- Machine learning methods
- Supervised inductive learning
- Bootstrapping
- Unsupervised
- Emphasis is on acquiring the knowledge needed for
the task from data, rather than from human
analysts.
18WSD Applied Dialog systems
- A dialog system is a computer system intended to
converse with a human, with a coherent structure.
Dialog systems have employed text, speech,
graphics, haptics, gestures and other modes for
communication on both the input and output
channel. - Dialog systems welcome to the wonderful world of
chat-bots - Jabberwacky http//www.jabberwacky.com/
- Joan, Artificially Intelligent, speaking,
videocentric Avatar http//www.icogno.com/ - Alice, winner of the 2004 Loebner Prize
competition http//www.alicebot.org/ - Human Name a philosopherALICE Hans Moravec.
-).Human Who is Hans MoravecALICE He is a
famous robotics expert at Carnegie Mellon.Human
Does he like the movie you mentioned?ALICE
Suppose he does.Human What was the name of the
movie?ALICE It depends on the historical
context. - Human What is the meaning of life?ALICE
Actually, it's twenty-three, not forty-two.
19Bibliography
- SPEECH and LANGUAGE PROCESSING An Introduction
to Natural Language Processing, Computational
Linguistics, and Speech Recognition - By Daniel Jurafsky and James H. Martin,
Prentice Hall, 2000. - COMPUTATIONAL LINGUISTICS. Models, Resources,
Applications - By Igor Bolshakov and Alexander Gelbukh, Ciencia
de la Computación, 2004 - I'm sorry Dave, I'm afraid I can't do that
Linguistics, Statistics, and Natural Language
Processing circa 20011 - By Lillian Lee, Cornell University, In Computer
Science Reflections on the Field, Reflections
from the Field, 2004. - Natural Language Processing, Cornell University
http//www.cs.cornell.edu/ - The Stanford Natural Language Processing Group
http//nlp.stanford.edu/ - Association for the Advancement of Artificial
Intelligence (AAAI) (formerly the American
Association for A.I.) http//www.aaai.org/home.ht
ml - Natural Language Processing Research Group at the
University of Sheffield Department of Computer
Science http//nlp.shef.ac.uk - Open source NLP projects http//opennlp.sourcefor
ge.net/projects.html - Wikipedia http//en.wikipedia.org/
20- Hasta la vista, baby.
- - Terminator 2 Judgment Day