Python for NLP and the Natural Language Toolkit - PowerPoint PPT Presentation

About This Presentation
Title:

Python for NLP and the Natural Language Toolkit

Description:

... you must use the reload command before the changes become visible in Python: ... with supplemental information, such as parts of speech or wordnet sense tags. ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 48
Provided by: lab256
Category:

less

Transcript and Presenter's Notes

Title: Python for NLP and the Natural Language Toolkit


1
Python for NLP and the Natural Language Toolkit
  • CS1573 AI Application Development, Spring 2003
  • (modified from Edward Lopers notes)

2
Outline
  • Review Introduction to NLP (knowledge of
    language, ambiguity, representations and
    algorithms, applications)
  • HW 2 discussion
  • Tutorials Basics, Probability

3
Python and Natural Language Processing
  • Python is a great language for NLP
  • Simple
  • Easy to debug
  • Exceptions
  • Interpreted language
  • Easy to structure
  • Modules
  • Object oriented programming
  • Powerful string manipulation

4
Modules and Packages
  • Python modules package program code and data for
    reuse. (Lutz)
  • Similar to library in C, package in Java.
  • Python packages are hierarchical modules (i.e.,
    modules that contain other modules).
  • Three commands for accessing modules
  • import
  • fromimport
  • reload

5
Modules and Packages import
  • The import command loads a module
  • Load the regular expression module
  • gtgtgt import re
  • To access the contents of a module, use dotted
    names
  • Use the search method from the re module
  • gtgtgt re.search(\w, str)
  • To list the contents of a module, use dir
  • gtgtgt dir(re)
  • DOTALL, I, IGNORECASE,

6
Modules and Packagesfromimport
  • The fromimport command loads individual
    functions and objects from a module
  • Load the search function from the re module
  • gtgtgt from re import search
  • Once an individual function or object is loaded
    with fromimport, it can be used directly
  • Use the search method from the re module
  • gtgtgt search (\w, str)

7
Import vs. fromimport
  • Import
  • Keeps module functions separate from user
    functions.
  • Requires the use of dotted names.
  • Works with reload.
  • fromimport
  • Puts module functions and user functions
    together.
  • More convenient names.
  • Does not work with reload.

8
Modules and Packages reload
  • If you edit a module, you must use the reload
    command before the changes become visible in
    Python
  • gtgtgt import mymodule
  • ...
  • gtgtgt reload (mymodule)
  • The reload command only affects modules that have
    been loaded with import it does not update
    individual functions and objects loaded with
    from...import.

9
Introduction to NLTK
  • The Natural Language Toolkit (NLTK) provides
  • Basic classes for representing data relevant to
    natural language processing.
  • Standard interfaces for performing tasks, such as
    tokenization, tagging, and parsing.
  • Standard implementations of each task, which can
    be combined to solve complex problems.

10
NLTK Example Modules
  • nltk.token processing individual elements of
    text, such as words or sentences.
  • nltk.probability modeling frequency
    distributions and probabilistic systems.
  • nltk.tagger tagging tokens with supplemental
    information, such as parts of speech or wordnet
    sense tags.
  • nltk.parser high-level interface for parsing
    texts.
  • nltk.chartparser a chart-based implementation of
    the parser interface.
  • nltk.chunkparser a regular-expression based
    surface parser.

11
NLTK Top-Level Organization
  • NLTK is organized as a flat hierarchy of packages
    and modules.
  • Each module provides the tools necessary to
    address a specific task
  • Modules contain two types of classes
  • Data-oriented classes are used to represent
    information relevant to natural language
    processing.
  • Task-oriented classes encapsulate the resources
    and methods needed to perform a specific task.

12
To the First Tutorials
  • Tokens and Tokenization
  • Frequency Distributions

13
The Token Module
  • It is often useful to think of a text in terms of
    smaller elements, such as words or sentences.
  • The nltk.token module defines classes for
    representing and processing these smaller
    elements.
  • What might be other useful smaller elements?

14
Tokens and Types
  • The term word can be used in two different ways
  • To refer to an individual occurrence of a word
  • To refer to an abstract vocabulary item
  • For example, the sentence my dog likes his dog
    contains five occurrences of words, but four
    vocabulary items.
  • To avoid confusion use more precise terminology
  • Word token an occurrence of a word
  • Word Type a vocabulary item

15
Tokens and Types (continued)
  • In NLTK, tokens are constructed from their types
    using the Token constructor
    gtgtgt from
    nltk.token import gtgtgt
    my_word_type 'dog'
    'dog'
    gtgtgt my_word_token
    Token(my_word_type) dog'_at_?
  • Token member functions include type and loc

16
Text Locations
  • A text location _at_ se specifies a region of a
    text
  • s is the start index
  • e is the end index
  • The text location _at_ sespecifies the text
    beginning at s, and including everything up to
    (but not including) the text at e.
  • This definition is consistent with Python slice.
  • Think of indices as appearing between elements
    I saw a man

    0 1 2 3 4
  • Shorthand notation when location width 1.

17
Text Locations (continued)
  • Indices can be based on different units
  • character
  • word
  • sentence
  • Locations can be tagged with sources (files,
    other text locations e.g., the first word of
    the first sentence in the file)
  • Location member functions
  • start
  • end
  • unit
  • source

18
Tokenization
  • The simplest way to represent a text is with a
    single string.
  • Difficult to process text in this format.
  • Often, it is more convenient to work with a list
    of tokens.
  • The task of converting a text from a single
    string to a list of tokens is known as
    tokenization.

19
Tokenization (continued)
  • Tokenization is harder that it seems
  • Ill see you in New York.
  • The aluminum-export ban.
  • The simplest approach is to use graphic words
    (i.e., separate words using whitespace)
  • Another approach is to use regular expressions to
    specify which substrings are valid words.
  • NLTK provides a generic tokenization interface
    TokenizerI

20
TokenizerI
  • Defines a single method, tokenize, which takes a
    string and returns a list of tokens
  • Tokenize is independent of the level of
    tokenization and the implementation algorithm

21
Example
  • from nltk.token import WSTokenizer
    from nltk.draw.plot import Plot
    Extract a list
    of words from the corpus
    corpus open('corpus.txt').read()
    tokens WSTokenizer().tokenize(cor
    pus) Count up how many
    times each word length occurs wordlen_count_list

    for token in tokens
    wordlen
    len(token.type())
    Add zeros until
    wordlen_count_list is long enough while wordlen
    gt len(wordlen_count_list) wordlen_count_list
    .append(0) Increment
    the count for this word length wordlen_count_list
    wordlen 1 Plot(wordlen_count_list)

22
Next Tutorial Probability
  • An experiment is any process which leads to a
    well-defined outcome
  • A sample is any possible outcome of a given
    experiment
  • Rolling a die?

23
Outline
  • Review Basics
  • Probability
  • Experiments and Samples
  • Frequency Distributions
  • Conditional Frequency Distributions

24
Review NLTK Goals
  • Classes for NLP data
  • Interfaces for NLP tasks
  • Implementations, easily combined (what is an
    example?)

25
Accessing NLTK
  • What is the relation to Python?

26
Words
  • Types and Tokens
  • Text Locations
  • Member Functions

27
Tokenization
  • TokenizerI
  • Implementations
  • gtgtgt tokenizer WSTokenizer()
  • gtgtgt tokenizer.tokenize(text_str) 'Hello'_at_0w,
    'world.'_at_1w, 'This'_at_2w, 'is'_at_3w, 'a'_at_4w,
    'test'_at_5w, 'file.'_at_6w

28
Word Length Freq. Distribution Example
  • from nltk.token import WSTokenizer
    from nltk.probability import
    SimpleFreqDist Extract a list
    of words from the corpus
    corpus open('corpus.txt').read()
    tokens WSTokenizer().tokenize(cor
    pus) Construct a
    frequency distribution of word lengths
    wordlen_freqs SimpleFreqDist()
    for token in tokens
    wordlen_freqs.inc(len(token.type()))
    Extract the set of word
    lengths found in the corpus wordlens
    wordlen_freqs.samples()

29
Frequency Distributions
  • A frequency distribution records the number of
    times each outcome of an experiment has occurred
  • gtgtgt freq_dist FreqDist()
    gtgtgt for token in document
    ... freq_dist.inc(token.type())
  • Constructor, then initialization by storing
    experimental outcomes

30
Methods
  • The freq method returns the frequencey of a given
    sample.
  • We can find the number of times a given sample
    occured with the count method
  • We can find the total number of sample outcomes
    recorded by a frequency distribution with the N
    method
  • The samples method returns a list of all samples
    that have been recorded as outcomes by a
    frequency distribution
  • We can find the sample with the greatest number
    of outcomes with the max method

31
Examples of Methods
  • gtgtgt freq_dist.count('the')
    6
  • gtgtgt freq_dist.freq('the')
    0.012
  • gtgtgt freq_dist.N() 500
  • gtgtgt freq_dist.max()
    the

32
Simple Word Length Example
  • gtgtgt from nltk.token import WSTokenizer
    gtgtgt from nltk.probability import FreqDist
    gtgtgt corpus open('corpus.txt').read()
    gtgtgt tokens
    WSTokenizer().tokenize(corpus)
    What is the distribution of word lengths in a
    corpus? gtgtgt freq_dist FreqDist()
    gtgtgt for token in
    tokens
    ... freq_dist.inc(len(token.type()))
  • What is the "outcome" for our experiment?

33
Simple Word Length Example
  • gtgtgt from nltk.token import WSTokenizer
    gtgtgt from nltk.probability import FreqDist
    gtgtgt corpus open('corpus.txt').read()
    gtgtgt tokens
    WSTokenizer().tokenize(corpus)
    What is the distribution of word lengths in a
    corpus? gtgtgt freq_dist FreqDist()
    gtgtgt for token in
    tokens
    ... freq_dist.inc(len(token.type()))
  • This length is the "outcome" for our experiment,
    so we use inc() to increment its count in a
    frequency distribution.

34
Complex Word Length Example
  • define vowels as "a", "e", "i", "o", and "u"
    gtgtgt VOWELS ('a', 'e', 'i', 'o', 'u')
    distribution for words ending in
    vowels? gtgtgt freq_dist FreqDist()
    gtgtgt for token in tokens
    ... if
    token.type()-1.lower() in VOWELS
    ... freq_dist.inc(len(token.type()))
  • What is the condition?

35
More Complex Example
  • What is the distribution of word lengths for
    words following words that end
    in vowels? gtgtgt ended_in_vowel 0
    Did last word end in vowel?
    gtgtgt freq_dist FreqDist()
    gtgtgt for token in
    tokens
    ... if ended_in_vowel ...
    Freq_dist.inc(len(token.type()))
  • ... ended_in_voweltoken.type()-1.lower()
    in VOWELS

36
Conditional Frequency Distributions
  • A condition specifies the context in which an
    experiment is performed
  • A conditional frequency distribution is a
    collection of frequency distribtuions for the
    same experiment, run under different conditions
  • The individual frequency distributions are
    indexed by the condition.
  • NLTK ConditionalFreqDist class
  • gtgtgt cfdist ConditionalFreqDist()
    ltConditionalFreqDist with 0 conditionsgt

37
Conditional Frequency Distributions (continued)
  • To access the frequency distribution for a
    condition, use the indexing operator
    gtgtgt
    cfdist'a'
    ltFreqDist with 0 outcomesgt
  • Record lengths of some words starting with 'a'
    gtgtgt for word in 'apple and
    arm'.split() ...
    cfdist'a'.inc(len(word))
  • How many are 3 characters long?
    gtgtgt
    cfdist'a'.freq(3)
    0.66667
  • To list accessed conditions, use the conditions
    method
  • gtgtgt cfdist.conditions()
    'a'

38
Example Conditioning on a Words Initial Letter
  • gtgtgt from nltk.token import WSTokenizer
    gtgtgt from nltk.probability import
    ConditionalFreqDist
    gtgtgt from nltk.draw.plot import Plot


    gtgtgt corpus
    open('corpus.txt').read()
    gtgtgt tokens WSTokenizer().tokenize(corpus)
    gtgtgt cfdist ConditionalFreqDist()

39
Example (continued)
  • How does initial letter affect word length?
    gtgtgt for token in tokens
    ... outcome
    len(token.type())
  • ... condition token.type()0.lowe
    r() ... cfdistcondition.inc
    (outcome)
  • What are the condition and the outcome?

40
Example (continued)
  • How does initial letter affect word length?
    gtgtgt for token in tokens
    ... outcome
    len(token.type())
  • ... condition token.type()0.lowe
    r() ... cfdistcondition.inc
    (outcome)
  • What are the condition and the outcome?
  • Condition the initial letter of the token
  • Outcome its word length

41
Prediction
  • Prediction is the problem of deciding a likely
    outcome for a given run of an experiment.
  • To predict the outcome, we first examine a
    training corpus.
  • Training corpus
  • The context and outcome for each run are known
  • Given a new run, we choose the outcome that
    occurred most frequently for the context
  • Conditional frequency distribution finds the most
    frequent occurrrence

42
Prediction Example Outline
  • Record each outcome in the training corpus, using
    the context that the experiment was under as the
    condition
  • Access the frequency distribution for a given
    context with the indexing operator
  • Use the max() method to find the most likely
    outcome

43
Example Predicting Words
  • Predict word's type, based on preceding word type
  • gtgtgt from nltk.token import WSTokenizer
  • gtgtgt from nltk.probability import
    ConditionalFreqDist
    gtgtgt corpus open('corpus.txt').re
    ad() gtgtgt tokens
    WSTokenizer().tokenize(corpus) gtgtgt cfdist
    ConditionalFreqDist() empty

44
Example (continued)
  • gtgtgt context None
    The type of the preceding word
    gtgtgt for token in tokens
    ... outcome token.type()
    ... cfdistcontext.inc(outcome)
    ... context token.type()

45
Example (continued)
  • gtgtgt cfdist'prediction'.max()
    'problems'
    gtgtgt cfdist'problems'.max()
    'in'

    gtgtgt cfdist'in'.max()
    'the
  • What are we predicting here?

46
Example (continued)
  • We predict the most likely word for any context
  • Generation application
  • gtgtgt word 'prediction'
    gtgtgt for i in
    range(15)
    ... print word,
    ...
    word cfdistword.max()
  • prediction problems in the frequency
    distribution of the frequency distribution of the
    frequency distribution of

47
For Next Time
  • HW3
  • To run NLTK from unixs.cis.pitt.edu, you should
    add /afs/cs.pitt.edu/projects/nltk/bin to your
    search path
  • Regular Expressions (JM handout, NLTK tutorial)
Write a Comment
User Comments (0)
About PowerShow.com