Python for NLP and the Natural Language Toolkit - PowerPoint PPT Presentation

About This Presentation

Title:

Python for NLP and the Natural Language Toolkit

Description:

... you must use the reload command before the changes become visible in Python: ... with supplemental information, such as parts of speech or wordnet sense tags. ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 48

Provided by: lab256

Learn more at: https://people.cs.pitt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Python for NLP and the Natural Language Toolkit

1
Python for NLP and the Natural Language Toolkit

CS1573 AI Application Development, Spring 2003
(modified from Edward Lopers notes)

2
Outline

Review Introduction to NLP (knowledge of
language, ambiguity, representations and
algorithms, applications)
HW 2 discussion
Tutorials Basics, Probability

3
Python and Natural Language Processing

Python is a great language for NLP
Simple
Easy to debug
Exceptions
Interpreted language
Easy to structure
Modules
Object oriented programming
Powerful string manipulation

4
Modules and Packages

Python modules package program code and data for
reuse. (Lutz)
Similar to library in C, package in Java.
Python packages are hierarchical modules (i.e.,
modules that contain other modules).
Three commands for accessing modules
import
fromimport
reload

5
Modules and Packages import

The import command loads a module
Load the regular expression module
gtgtgt import re
To access the contents of a module, use dotted
names
Use the search method from the re module
gtgtgt re.search(\w, str)
To list the contents of a module, use dir
gtgtgt dir(re)
DOTALL, I, IGNORECASE,

6
Modules and Packagesfromimport

The fromimport command loads individual
functions and objects from a module
Load the search function from the re module
gtgtgt from re import search
Once an individual function or object is loaded
with fromimport, it can be used directly
Use the search method from the re module
gtgtgt search (\w, str)

7
Import vs. fromimport

Import
Keeps module functions separate from user
functions.
Requires the use of dotted names.
Works with reload.

fromimport
Puts module functions and user functions
together.
More convenient names.
Does not work with reload.

8
Modules and Packages reload

If you edit a module, you must use the reload
command before the changes become visible in
Python
gtgtgt import mymodule
...
gtgtgt reload (mymodule)
The reload command only affects modules that have
been loaded with import it does not update
individual functions and objects loaded with
from...import.

9
Introduction to NLTK

The Natural Language Toolkit (NLTK) provides
Basic classes for representing data relevant to
natural language processing.
Standard interfaces for performing tasks, such as
tokenization, tagging, and parsing.
Standard implementations of each task, which can
be combined to solve complex problems.

10
NLTK Example Modules

nltk.token processing individual elements of
text, such as words or sentences.
nltk.probability modeling frequency
distributions and probabilistic systems.
nltk.tagger tagging tokens with supplemental
information, such as parts of speech or wordnet
sense tags.
nltk.parser high-level interface for parsing
texts.
nltk.chartparser a chart-based implementation of
the parser interface.
nltk.chunkparser a regular-expression based
surface parser.

11
NLTK Top-Level Organization

NLTK is organized as a flat hierarchy of packages
and modules.
Each module provides the tools necessary to
address a specific task
Modules contain two types of classes
Data-oriented classes are used to represent
information relevant to natural language
processing.
Task-oriented classes encapsulate the resources
and methods needed to perform a specific task.

12
To the First Tutorials

Tokens and Tokenization
Frequency Distributions

13
The Token Module

It is often useful to think of a text in terms of
smaller elements, such as words or sentences.
The nltk.token module defines classes for
representing and processing these smaller
elements.
What might be other useful smaller elements?

14
Tokens and Types

The term word can be used in two different ways
To refer to an individual occurrence of a word
To refer to an abstract vocabulary item
For example, the sentence my dog likes his dog
contains five occurrences of words, but four
vocabulary items.
To avoid confusion use more precise terminology
Word token an occurrence of a word
Word Type a vocabulary item

15
Tokens and Types (continued)

In NLTK, tokens are constructed from their types
using the Token constructor
gtgtgt from
nltk.token import gtgtgt
my_word_type 'dog'
'dog'
gtgtgt my_word_token
Token(my_word_type) dog'_at_?
Token member functions include type and loc

16
Text Locations

A text location _at_ se specifies a region of a
text
s is the start index
e is the end index
The text location _at_ sespecifies the text
beginning at s, and including everything up to
(but not including) the text at e.
This definition is consistent with Python slice.
Think of indices as appearing between elements
I saw a man

0 1 2 3 4
Shorthand notation when location width 1.

17
Text Locations (continued)

Indices can be based on different units
character
word
sentence
Locations can be tagged with sources (files,
other text locations e.g., the first word of
the first sentence in the file)
Location member functions
start
end
unit
source

18
Tokenization

The simplest way to represent a text is with a
single string.
Difficult to process text in this format.
Often, it is more convenient to work with a list
of tokens.
The task of converting a text from a single
string to a list of tokens is known as
tokenization.

19
Tokenization (continued)

Tokenization is harder that it seems
Ill see you in New York.
The aluminum-export ban.
The simplest approach is to use graphic words
(i.e., separate words using whitespace)
Another approach is to use regular expressions to
specify which substrings are valid words.
NLTK provides a generic tokenization interface
TokenizerI

20
TokenizerI

Defines a single method, tokenize, which takes a
string and returns a list of tokens
Tokenize is independent of the level of
tokenization and the implementation algorithm

21
Example

from nltk.token import WSTokenizer
from nltk.draw.plot import Plot
Extract a list
of words from the corpus
corpus open('corpus.txt').read()
tokens WSTokenizer().tokenize(cor
pus) Count up how many
times each word length occurs wordlen_count_list

for token in tokens
wordlen
len(token.type())
Add zeros until
wordlen_count_list is long enough while wordlen
gt len(wordlen_count_list) wordlen_count_list
.append(0) Increment
the count for this word length wordlen_count_list
wordlen 1 Plot(wordlen_count_list)

22
Next Tutorial Probability

An experiment is any process which leads to a
well-defined outcome
A sample is any possible outcome of a given
experiment
Rolling a die?

23
Outline

Review Basics
Probability
Experiments and Samples
Frequency Distributions
Conditional Frequency Distributions

24
Review NLTK Goals

Classes for NLP data
Interfaces for NLP tasks
Implementations, easily combined (what is an
example?)

25
Accessing NLTK

What is the relation to Python?

26
Words

Types and Tokens
Text Locations
Member Functions

27
Tokenization

TokenizerI
Implementations
gtgtgt tokenizer WSTokenizer()
gtgtgt tokenizer.tokenize(text_str) 'Hello'_at_0w,
'world.'_at_1w, 'This'_at_2w, 'is'_at_3w, 'a'_at_4w,
'test'_at_5w, 'file.'_at_6w

28
Word Length Freq. Distribution Example

from nltk.token import WSTokenizer
from nltk.probability import
SimpleFreqDist Extract a list
of words from the corpus
corpus open('corpus.txt').read()
tokens WSTokenizer().tokenize(cor
pus) Construct a
frequency distribution of word lengths
wordlen_freqs SimpleFreqDist()
for token in tokens
wordlen_freqs.inc(len(token.type()))
Extract the set of word
lengths found in the corpus wordlens
wordlen_freqs.samples()

29
Frequency Distributions

A frequency distribution records the number of
times each outcome of an experiment has occurred
gtgtgt freq_dist FreqDist()
gtgtgt for token in document
... freq_dist.inc(token.type())
Constructor, then initialization by storing
experimental outcomes

30
Methods

The freq method returns the frequencey of a given
sample.
We can find the number of times a given sample
occured with the count method
We can find the total number of sample outcomes
recorded by a frequency distribution with the N
method
The samples method returns a list of all samples
that have been recorded as outcomes by a
frequency distribution
We can find the sample with the greatest number
of outcomes with the max method

31
Examples of Methods

gtgtgt freq_dist.count('the')
6
gtgtgt freq_dist.freq('the')
0.012
gtgtgt freq_dist.N() 500
gtgtgt freq_dist.max()
the

32
Simple Word Length Example

gtgtgt from nltk.token import WSTokenizer
gtgtgt from nltk.probability import FreqDist
gtgtgt corpus open('corpus.txt').read()
gtgtgt tokens
WSTokenizer().tokenize(corpus)
What is the distribution of word lengths in a
corpus? gtgtgt freq_dist FreqDist()
gtgtgt for token in
tokens
... freq_dist.inc(len(token.type()))
What is the "outcome" for our experiment?

33
Simple Word Length Example

gtgtgt from nltk.token import WSTokenizer
gtgtgt from nltk.probability import FreqDist
gtgtgt corpus open('corpus.txt').read()
gtgtgt tokens
WSTokenizer().tokenize(corpus)
What is the distribution of word lengths in a
corpus? gtgtgt freq_dist FreqDist()
gtgtgt for token in
tokens
... freq_dist.inc(len(token.type()))
This length is the "outcome" for our experiment,
so we use inc() to increment its count in a
frequency distribution.

34
Complex Word Length Example

define vowels as "a", "e", "i", "o", and "u"
gtgtgt VOWELS ('a', 'e', 'i', 'o', 'u')
distribution for words ending in
vowels? gtgtgt freq_dist FreqDist()
gtgtgt for token in tokens
... if
token.type()-1.lower() in VOWELS
... freq_dist.inc(len(token.type()))
What is the condition?

35
More Complex Example

What is the distribution of word lengths for
words following words that end
in vowels? gtgtgt ended_in_vowel 0
Did last word end in vowel?
gtgtgt freq_dist FreqDist()
gtgtgt for token in
tokens
... if ended_in_vowel ...
Freq_dist.inc(len(token.type()))
... ended_in_voweltoken.type()-1.lower()
in VOWELS

36
Conditional Frequency Distributions

A condition specifies the context in which an
experiment is performed
A conditional frequency distribution is a
collection of frequency distribtuions for the
same experiment, run under different conditions
The individual frequency distributions are
indexed by the condition.
NLTK ConditionalFreqDist class
gtgtgt cfdist ConditionalFreqDist()
ltConditionalFreqDist with 0 conditionsgt

37
Conditional Frequency Distributions (continued)

To access the frequency distribution for a
condition, use the indexing operator
gtgtgt
cfdist'a'
ltFreqDist with 0 outcomesgt
Record lengths of some words starting with 'a'
gtgtgt for word in 'apple and
arm'.split() ...
cfdist'a'.inc(len(word))
How many are 3 characters long?
gtgtgt
cfdist'a'.freq(3)
0.66667
To list accessed conditions, use the conditions
method
gtgtgt cfdist.conditions()
'a'

38
Example Conditioning on a Words Initial Letter

gtgtgt from nltk.token import WSTokenizer
gtgtgt from nltk.probability import
ConditionalFreqDist
gtgtgt from nltk.draw.plot import Plot

gtgtgt corpus
open('corpus.txt').read()
gtgtgt tokens WSTokenizer().tokenize(corpus)
gtgtgt cfdist ConditionalFreqDist()

39
Example (continued)

How does initial letter affect word length?
gtgtgt for token in tokens
... outcome
len(token.type())
... condition token.type()0.lowe
r() ... cfdistcondition.inc
(outcome)
What are the condition and the outcome?

40
Example (continued)

How does initial letter affect word length?
gtgtgt for token in tokens
... outcome
len(token.type())
... condition token.type()0.lowe
r() ... cfdistcondition.inc
(outcome)
What are the condition and the outcome?
Condition the initial letter of the token
Outcome its word length

41
Prediction

Prediction is the problem of deciding a likely
outcome for a given run of an experiment.
To predict the outcome, we first examine a
training corpus.
Training corpus
The context and outcome for each run are known
Given a new run, we choose the outcome that
occurred most frequently for the context
Conditional frequency distribution finds the most
frequent occurrrence

42
Prediction Example Outline

Record each outcome in the training corpus, using
the context that the experiment was under as the
condition
Access the frequency distribution for a given
context with the indexing operator
Use the max() method to find the most likely
outcome

43
Example Predicting Words

Predict word's type, based on preceding word type
gtgtgt from nltk.token import WSTokenizer
gtgtgt from nltk.probability import
ConditionalFreqDist
gtgtgt corpus open('corpus.txt').re
ad() gtgtgt tokens
WSTokenizer().tokenize(corpus) gtgtgt cfdist
ConditionalFreqDist() empty

44
Example (continued)

gtgtgt context None
The type of the preceding word
gtgtgt for token in tokens
... outcome token.type()
... cfdistcontext.inc(outcome)
... context token.type()

45
Example (continued)

gtgtgt cfdist'prediction'.max()
'problems'
gtgtgt cfdist'problems'.max()
'in'

gtgtgt cfdist'in'.max()
'the
What are we predicting here?

46
Example (continued)

We predict the most likely word for any context
Generation application
gtgtgt word 'prediction'
gtgtgt for i in
range(15)
... print word,
...
word cfdistword.max()
prediction problems in the frequency
distribution of the frequency distribution of the
frequency distribution of

47
For Next Time

HW3
To run NLTK from unixs.cis.pitt.edu, you should
add /afs/cs.pitt.edu/projects/nltk/bin to your
search path
Regular Expressions (JM handout, NLTK tutorial)

Write a Comment

User Comments (0)