Title: Introduction to Natural Language Processing (NLP)
1Introduction to Natural Language Processing (NLP)
- Dekang Lin
- Department of Computing Science
- University of Alberta
- lindek_at_cs.ualberta.ca
2Outline
- What is NLP?
- Applications
- Challenges
- Linguistics Issues
- Course Overview
3Textbook
- Daniel Jurafsky and James H. Martin, Speech and
Language Processing, Prentice-Hall, 2000. - Note errata available on website check before
reading each chapter please
4What is Natural Language Processing?
- Natural Language Processing
- Process information contained in natural language
text. - Also known as Computational Linguistics (CL),
Human Language Technology (HLT), Natural Language
Engineering (NLE) - Can machines understand human language?
- Define understand
- Understanding is the ultimate goal. However, one
doesnt need to fully understand to be useful.
5Why Study NLP?
- A hallmark of human intelligence.
- Text is the largest repository of human knowledge
and is growing quickly. - emails, news articles, web pages, IM, scientific
articles, insurance claims, customer complaint
letters, transcripts of phone calls, technical
documents, government documents, patent
portfolios, court decisions, contracts, - Are we reading any faster than before?
6NLP Applications
- Question answering
- Who is the first Taiwanese president?
- Text Categorization/Routing
- e.g., customer e-mails.
- Text Mining
- Find everything that interacts with BRCA1.
- Machine (Assisted) Translation
- Language Teaching/Learning
- Usage checking
- Spelling correction
- Is that just dictionary lookup?
7(No Transcript)
8Challenges in NLP Ambiguity
- Words or phrases can often be understood in
multiple ways. - Teacher Strikes Idle Kids
- Killer Sentenced to Die for Second Time in 10
Years - They denied the petition for his release that was
signed by over 10,000 people. - child abuse expert/child computer expert
- Who does Mary love? (three-way ambiguous)
9Probabilistic/Statistical Resolution of
Ambiguities
- When there are ambiguities, choose the
interpretation with the highest probability. - Example how many times peoples say
- Mary loves
- the Mary love
- Which interpretation has the highest probability?
10Challenges in NLP Variations
- Syntactic Variations
- I was surprised that Kim lost
- It surprised me that Kim lost
- That Kim lost surprised me.
- The same meaning can be expressed in different
ways - Who wrote The Language Instinct?
- Steven Pinker, a MIT professor and author of The
Language Instinct,
11Subareas of Linguistics
- Morphology
- structures and patterns in words
- analyzes how words are formed from minimal units
of meaning, or morphemes, e.g., dogs dogs. - Syntax
- structures and patterns in phrases
- how phrases are formed by smaller phrases and
words
12Subareas of Linguistics
- Semantics the meaning of a word or phrase within
a sentence - How to represent meaning?
- Semantic network? Logic? Policy?
- How to construct meaning representation?
- Is meaning compositional?
- Pragmatics structures and patterns in discourses
- Co-reference resolution
- Jane races Mary on weekends. She often beats her.
- Implicatures
- How many times do you go skating each week?
- Speech acts
- Do you know the time?
13Morphology
- Morphology is concerned with the internal make-up
of words - Input The fearsome cats attacked the foolish
dog - Output The fear-some cat-s attack-ed the
fool-ish dog - Inflectional morphology
- Does not change the grammatical category of
words cats/cat-s, attacked/attack-ed - Derivational morphology
- May involve changes to grammatical categories
fearsome/fear-some, foolish/fool-ish
14Morphology Is not as Easy as It May Seem to be
- Examples from Woods et. al. 2000
- delegate (de leg ate) take the legs from
- caress (car ess) female car
- cashier (cashy er) more wealthy
- lacerate (lace rate) speed of tatting
- ratify (rat ify) infest with rodents
- infantry (infant ry) childish behavior
15A Turkish Example Oflazer Guzey 1994
- uygarlastiramayabileceklerimizdenmissinizcesine
- urgar/civilized las/BECOME tir/CAUS ama/NEG
yabil/POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL
mis/NARR siniz/2PL cesine/AS-IF - an adverb meaning roughly (behaving) as if you
were one of those whom we might not be able to
civilize.
16Why not just Use a Dictionary?
- How many words are there in a language?
- English OED 400K entries
- Turkish 600x106 forms
- Finnish 107 forms
- New words are being invented all the time
- e-mail
- IM
17Syntax is about Sentence Structures
- Sentences have structures and are made up of
constituents. - The constituents are phrases.
- A phrase consists of a head and modifiers.
- The category of the head determines the category
of the phrase - e.g., a phrase headed by a noun is a noun phrase
18Parsing
- Analyze the structure of a sentence
19S
S
VP
VP
NP
NP
NP
NP
N
N
V
N
N
V
A
N
Teacher strikes idle kids
Teacher strikes idle kids
20Syntax
- Syntax is the study of the regularities and
constraints of word order and phrase structure - How words are organized into phrases
- How phrases are combined into larger phrases
(including sentences).
21Course Overview Background Theories
- Linguistics
- Syntax
- Binding theory
- Probability and Information Theory
- Markov model
- Bayesian network
- EM (expectation/estimation maximization)
22Course Overview Enabling Technologies
- Stemming
- Reduce detects, detected, detecting, detect, to
the same form. - POS Tagging
- Determine for each word whether it is a noun,
adjective, verb, .. - Parsing
- sentence ? parse tree
- Word Sense Disambiguation
- orange juice vs. orange coat
- Learning from text
23Course Overview Applications
- Question Answering
- Machine Translation
- Text Mining/Information Extraction