CSA2050: Introduction to Computational Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

CSA2050: Introduction to Computational Linguistics

Description:

Python supports modules and packages, which encourages program ... Named after Monty Python. Open Source and free. Download from www.python.org. April 2005 ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 19
Provided by: MikeR2
Category:

less

Transcript and Presenter's Notes

Title: CSA2050: Introduction to Computational Linguistics


1
CSA2050 Introduction to Computational Linguistics
  • NLTK

2
NLTK
  • A software package for manipulating linguistic
    data and performing NLP tasks
  • Advanced tasks are possible from an early stage
  • Permits projects at various levels
  • Consistent interfaces
  • Facilitates reusability of modules
  • Implemented in Python

3
Chart Parsing with NLTK
4
Why Python
  • Popular languages for NLP courses
  • Prolog (clean, learning curve, slow)
  • Perl (quick, syntax).
  • Why Python is better suited
  • Easy to learn, clean syntax
  • Interpreted, supporting rapid prototyping
  • Object oriented
  • Powerful

5
NLTK Structure
  • NLTK is implemented as a set of minimally
    independent modules.
  • Core modules
  • Basic data types
  • Task Modules
  • Tokenising
  • Parsing
  • Other NLP tasks

6
Token Class
  • The token class to encode information about NL
    texts.
  • Each token instance represents a unit of text
    such as a word, a text, or a document.
  • A given instance is defined by a partial mapping
    from property names to property values.

7
The TEXT Property
  • The TEXT property is used to encode a tokens
    text content.
  • gtgtgt from nltk.token import
  • gtgtgt Token(TEXT"Hello World!")
  • ltHello World!gt

8
TAG
  • The TAG property is used to encode a tokens part
    of speech tag
  • gtgtgt Token(TEXT"python",TAG"NN")
  • ltpython/NNgt

9
SUBTOKENS
  • The SUBTOKENS property is used to store a
    tokenized text
  • gtgtgt from nltk.tokenizer import
  • gtgtgt tok Token(TEXT"Hello World!")
  • gtgtgt WhitespaceTokenizer().tokenize(tok)
  • gtgtgt print tokSUBTOKENS)
  • ltHellogt, ltWorld!gt

10
Augmenting the Tokenwith Information
  • Language processing tasks are formulated as
    annotations and transformations involving tokens
    which add properties to the Token data structure.
  • word-sense disambiguation
  • chunking
  • parsing

11
Blackboard Architecture
  • Typically these modifications are monotonic
    they add information but do not delete it.
  • Tokens serve as a blackboard where information
    about a piece of text is collated.
  • This architecture contrasts with the more typical
    pipeline architecture where each stage
    destructively modifies the input information.
  • This approach was chosen because it gives greater
    flexibility when combining tasks into a single
    system.

12
Other Core Modules
  • probability module defines classes for
    probability distributions and statistical
    smoothing techniques.
  • cfg module defines classes for encoding context
    free grammars (normal and probabilistic)
  • The corpus module defines classes for reading and
    processing different corpora.

13
Using Brown Corpus
  • gtgtgt from nltk.corpus import brown
  • gtgtgt brown.groups()
  • skill and hobbies, popular lore,
  • humor, fiction mystery, ...
  • gtgtgt brown.items(humor)
  • (cr01, cr02, cr03, cr04, cr05,
  • cr06, cr07, cr08, cr09)
  • gtgtgt brown.tokenize(cr01)
  • ltltIt/ppsgt, ltwas/bedzgt, ltamong/ingt,
  • ltthese/dtsgt, ltthat/csgt, ltHinkle/npgt,
  • ltidentified/vbdgt, lta/atgt, ...gt

14
Penn Treebank
  • gtgtgt from nltk.corpus import treebank
  • gtgtgt treebank.groups()
  • (raw, tagged, parsed, merged)
  • gtgtgt treebank.items(parsed)
  • wsj_0001.prd, wsj_0002.prd, ...
  • gtgtgt item parsed/wsj_0001.prd
  • gtgtgt sentences treebank.tokenize(item)
  • gtgtgt for sent in sentencesSUBTOKENS
  • ... print sent.pp() pretty-print
  • (S
  • (NP-SBJ
  • (NP ltPierregt ltVinkengt)
  • (ADJP
  • (NP lt61gt ltyearsgt)
  • ltoldgt
  • ) ...

15
Processing Modules
  • Each language processing algorithm is implemented
    as a class.
  • For example, the ChartParser and Recu
    rsiveDescentParser classes each define a single
    algorithm for parsing a text.
  • Each processing module defines an interface.
  • Interface classes are named with a trailing
    capital i, e.g. ParserI.
  • Such interface classes define one or more action
    methods that perform the task the module is
    supposed to perform.

16
parse method parse_n method
17
What is Python
  • Python is an interpreted, object-oriented,
    programming language with dynamic semantics.
  • Attractive for Rapid Application Development
  • Easy to learn syntax emphasizes readability and
    therefore reduces the cost of program
    maintenance.
  • Python supports modules and packages, which
    encourages program modularity and code reuse.
  • Developed by Guido van Rossum in the early 1990s
  • Named after Monty Python
  • Open Source and free.
  • Download from www.python.org

18
Why Python
  • Prolog
  • clean, learning curve, slow
  • Lisp
  • old, syntax, big
  • Perl
  • quick,
  • C
Write a Comment
User Comments (0)
About PowerShow.com