Title: Python for NLP and the Natural Language Toolkit
1Python for NLP and the Natural Language Toolkit
- CS1573 AI Application Development, Spring 2003
- (modified from Edward Lopers notes)
2Outline
- Review Introduction to NLP (knowledge of
language, ambiguity, representations and
algorithms, applications)
- HW 2 discussion
- Tutorials Basics, Probability
3Python and Natural Language Processing
- Python is a great language for NLP
- Simple
- Easy to debug
- Exceptions
- Interpreted language
- Easy to structure
- Modules
- Object oriented programming
- Powerful string manipulation
4Modules and Packages
- Python modules package program code and data for
reuse. (Lutz) - Similar to library in C, package in Java.
- Python packages are hierarchical modules (i.e.,
modules that contain other modules). - Three commands for accessing modules
- import
- fromimport
- reload
5Modules and Packages import
- The import command loads a module
- Load the regular expression module
- gtgtgt import re
- To access the contents of a module, use dotted
names - Use the search method from the re module
- gtgtgt re.search(\w, str)
- To list the contents of a module, use dir
- gtgtgt dir(re)
- DOTALL, I, IGNORECASE,
6Modules and Packagesfromimport
- The fromimport command loads individual
functions and objects from a module - Load the search function from the re module
- gtgtgt from re import search
- Once an individual function or object is loaded
with fromimport, it can be used directly - Use the search method from the re module
- gtgtgt search (\w, str)
7Import vs. fromimport
- Import
- Keeps module functions separate from user
functions. - Requires the use of dotted names.
- Works with reload.
- fromimport
- Puts module functions and user functions
together. - More convenient names.
- Does not work with reload.
8Modules and Packages reload
- If you edit a module, you must use the reload
command before the changes become visible in
Python - gtgtgt import mymodule
- ...
- gtgtgt reload (mymodule)
- The reload command only affects modules that have
been loaded with import it does not update
individual functions and objects loaded with
from...import.
9Introduction to NLTK
- The Natural Language Toolkit (NLTK) provides
- Basic classes for representing data relevant to
natural language processing. - Standard interfaces for performing tasks, such as
tokenization, tagging, and parsing. - Standard implementations of each task, which can
be combined to solve complex problems.
10NLTK Example Modules
- nltk.token processing individual elements of
text, such as words or sentences. - nltk.probability modeling frequency
distributions and probabilistic systems. - nltk.tagger tagging tokens with supplemental
information, such as parts of speech or wordnet
sense tags. - nltk.parser high-level interface for parsing
texts. - nltk.chartparser a chart-based implementation of
the parser interface. - nltk.chunkparser a regular-expression based
surface parser.
11NLTK Top-Level Organization
- NLTK is organized as a flat hierarchy of packages
and modules. - Each module provides the tools necessary to
address a specific task - Modules contain two types of classes
- Data-oriented classes are used to represent
information relevant to natural language
processing. - Task-oriented classes encapsulate the resources
and methods needed to perform a specific task.
12To the First Tutorials
- Tokens and Tokenization
- Frequency Distributions
13The Token Module
- It is often useful to think of a text in terms of
smaller elements, such as words or sentences. - The nltk.token module defines classes for
representing and processing these smaller
elements. - What might be other useful smaller elements?
14Tokens and Types
- The term word can be used in two different ways
- To refer to an individual occurrence of a word
- To refer to an abstract vocabulary item
- For example, the sentence my dog likes his dog
contains five occurrences of words, but four
vocabulary items. - To avoid confusion use more precise terminology
- Word token an occurrence of a word
- Word Type a vocabulary item
15Tokens and Types (continued)
- In NLTK, tokens are constructed from their types
using the Token constructor
gtgtgt from
nltk.token import gtgtgt
my_word_type 'dog'
'dog'
gtgtgt my_word_token
Token(my_word_type) dog'_at_? - Token member functions include type and loc
16Text Locations
- A text location _at_ se specifies a region of a
text - s is the start index
- e is the end index
- The text location _at_ sespecifies the text
beginning at s, and including everything up to
(but not including) the text at e. - This definition is consistent with Python slice.
- Think of indices as appearing between elements
I saw a man
0 1 2 3 4 - Shorthand notation when location width 1.
17Text Locations (continued)
- Indices can be based on different units
- character
- word
- sentence
- Locations can be tagged with sources (files,
other text locations e.g., the first word of
the first sentence in the file) - Location member functions
- start
- end
- unit
- source
18Tokenization
- The simplest way to represent a text is with a
single string. - Difficult to process text in this format.
- Often, it is more convenient to work with a list
of tokens. - The task of converting a text from a single
string to a list of tokens is known as
tokenization.
19Tokenization (continued)
- Tokenization is harder that it seems
- Ill see you in New York.
- The aluminum-export ban.
- The simplest approach is to use graphic words
(i.e., separate words using whitespace) - Another approach is to use regular expressions to
specify which substrings are valid words. - NLTK provides a generic tokenization interface
TokenizerI
20TokenizerI
- Defines a single method, tokenize, which takes a
string and returns a list of tokens - Tokenize is independent of the level of
tokenization and the implementation algorithm
21Example
- from nltk.token import WSTokenizer
from nltk.draw.plot import Plot
Extract a list
of words from the corpus
corpus open('corpus.txt').read()
tokens WSTokenizer().tokenize(cor
pus) Count up how many
times each word length occurs wordlen_count_list
for token in tokens
wordlen
len(token.type())
Add zeros until
wordlen_count_list is long enough while wordlen
gt len(wordlen_count_list) wordlen_count_list
.append(0) Increment
the count for this word length wordlen_count_list
wordlen 1 Plot(wordlen_count_list)
22Next Tutorial Probability
- An experiment is any process which leads to a
well-defined outcome - A sample is any possible outcome of a given
experiment - Rolling a die?
23Outline
- Review Basics
- Probability
- Experiments and Samples
- Frequency Distributions
- Conditional Frequency Distributions
24Review NLTK Goals
- Classes for NLP data
- Interfaces for NLP tasks
- Implementations, easily combined (what is an
example?)
25Accessing NLTK
- What is the relation to Python?
26Words
- Types and Tokens
- Text Locations
- Member Functions
27Tokenization
- TokenizerI
- Implementations
- gtgtgt tokenizer WSTokenizer()
- gtgtgt tokenizer.tokenize(text_str) 'Hello'_at_0w,
'world.'_at_1w, 'This'_at_2w, 'is'_at_3w, 'a'_at_4w,
'test'_at_5w, 'file.'_at_6w
28Word Length Freq. Distribution Example
- from nltk.token import WSTokenizer
from nltk.probability import
SimpleFreqDist Extract a list
of words from the corpus
corpus open('corpus.txt').read()
tokens WSTokenizer().tokenize(cor
pus) Construct a
frequency distribution of word lengths
wordlen_freqs SimpleFreqDist()
for token in tokens
wordlen_freqs.inc(len(token.type()))
Extract the set of word
lengths found in the corpus wordlens
wordlen_freqs.samples()
29Frequency Distributions
- A frequency distribution records the number of
times each outcome of an experiment has occurred - gtgtgt freq_dist FreqDist()
gtgtgt for token in document
... freq_dist.inc(token.type()) - Constructor, then initialization by storing
experimental outcomes
30Methods
- The freq method returns the frequencey of a given
sample. - We can find the number of times a given sample
occured with the count method - We can find the total number of sample outcomes
recorded by a frequency distribution with the N
method - The samples method returns a list of all samples
that have been recorded as outcomes by a
frequency distribution - We can find the sample with the greatest number
of outcomes with the max method
31Examples of Methods
- gtgtgt freq_dist.count('the')
6 - gtgtgt freq_dist.freq('the')
0.012 - gtgtgt freq_dist.N() 500
- gtgtgt freq_dist.max()
the
32Simple Word Length Example
- gtgtgt from nltk.token import WSTokenizer
gtgtgt from nltk.probability import FreqDist
gtgtgt corpus open('corpus.txt').read()
gtgtgt tokens
WSTokenizer().tokenize(corpus)
What is the distribution of word lengths in a
corpus? gtgtgt freq_dist FreqDist()
gtgtgt for token in
tokens
... freq_dist.inc(len(token.type())) - What is the "outcome" for our experiment?
33Simple Word Length Example
- gtgtgt from nltk.token import WSTokenizer
gtgtgt from nltk.probability import FreqDist
gtgtgt corpus open('corpus.txt').read()
gtgtgt tokens
WSTokenizer().tokenize(corpus)
What is the distribution of word lengths in a
corpus? gtgtgt freq_dist FreqDist()
gtgtgt for token in
tokens
... freq_dist.inc(len(token.type())) - This length is the "outcome" for our experiment,
so we use inc() to increment its count in a
frequency distribution.
34Complex Word Length Example
- define vowels as "a", "e", "i", "o", and "u"
gtgtgt VOWELS ('a', 'e', 'i', 'o', 'u')
distribution for words ending in
vowels? gtgtgt freq_dist FreqDist()
gtgtgt for token in tokens
... if
token.type()-1.lower() in VOWELS
... freq_dist.inc(len(token.type())) - What is the condition?
35More Complex Example
- What is the distribution of word lengths for
words following words that end
in vowels? gtgtgt ended_in_vowel 0
Did last word end in vowel?
gtgtgt freq_dist FreqDist()
gtgtgt for token in
tokens
... if ended_in_vowel ...
Freq_dist.inc(len(token.type())) - ... ended_in_voweltoken.type()-1.lower()
in VOWELS
36Conditional Frequency Distributions
- A condition specifies the context in which an
experiment is performed - A conditional frequency distribution is a
collection of frequency distribtuions for the
same experiment, run under different conditions - The individual frequency distributions are
indexed by the condition. - NLTK ConditionalFreqDist class
- gtgtgt cfdist ConditionalFreqDist()
ltConditionalFreqDist with 0 conditionsgt
37Conditional Frequency Distributions (continued)
- To access the frequency distribution for a
condition, use the indexing operator
gtgtgt
cfdist'a'
ltFreqDist with 0 outcomesgt
- Record lengths of some words starting with 'a'
gtgtgt for word in 'apple and
arm'.split() ...
cfdist'a'.inc(len(word)) - How many are 3 characters long?
gtgtgt
cfdist'a'.freq(3)
0.66667 - To list accessed conditions, use the conditions
method - gtgtgt cfdist.conditions()
'a'
38Example Conditioning on a Words Initial Letter
- gtgtgt from nltk.token import WSTokenizer
gtgtgt from nltk.probability import
ConditionalFreqDist
gtgtgt from nltk.draw.plot import Plot
gtgtgt corpus
open('corpus.txt').read()
gtgtgt tokens WSTokenizer().tokenize(corpus)
gtgtgt cfdist ConditionalFreqDist()
39Example (continued)
- How does initial letter affect word length?
gtgtgt for token in tokens
... outcome
len(token.type()) - ... condition token.type()0.lowe
r() ... cfdistcondition.inc
(outcome) - What are the condition and the outcome?
40Example (continued)
- How does initial letter affect word length?
gtgtgt for token in tokens
... outcome
len(token.type()) - ... condition token.type()0.lowe
r() ... cfdistcondition.inc
(outcome) - What are the condition and the outcome?
- Condition the initial letter of the token
- Outcome its word length
41Prediction
- Prediction is the problem of deciding a likely
outcome for a given run of an experiment. - To predict the outcome, we first examine a
training corpus. - Training corpus
- The context and outcome for each run are known
- Given a new run, we choose the outcome that
occurred most frequently for the context - Conditional frequency distribution finds the most
frequent occurrrence
42Prediction Example Outline
- Record each outcome in the training corpus, using
the context that the experiment was under as the
condition
- Access the frequency distribution for a given
context with the indexing operator - Use the max() method to find the most likely
outcome
43Example Predicting Words
- Predict word's type, based on preceding word type
- gtgtgt from nltk.token import WSTokenizer
- gtgtgt from nltk.probability import
ConditionalFreqDist
gtgtgt corpus open('corpus.txt').re
ad() gtgtgt tokens
WSTokenizer().tokenize(corpus) gtgtgt cfdist
ConditionalFreqDist() empty
44Example (continued)
- gtgtgt context None
The type of the preceding word
gtgtgt for token in tokens
... outcome token.type()
... cfdistcontext.inc(outcome)
... context token.type()
45Example (continued)
- gtgtgt cfdist'prediction'.max()
'problems'
gtgtgt cfdist'problems'.max()
'in'
gtgtgt cfdist'in'.max()
'the - What are we predicting here?
46Example (continued)
- We predict the most likely word for any context
- Generation application
- gtgtgt word 'prediction'
gtgtgt for i in
range(15)
... print word,
...
word cfdistword.max()
- prediction problems in the frequency
distribution of the frequency distribution of the
frequency distribution of
47For Next Time
- HW3
- To run NLTK from unixs.cis.pitt.edu, you should
add /afs/cs.pitt.edu/projects/nltk/bin to your
search path - Regular Expressions (JM handout, NLTK tutorial)