PPT – Text statistics 7 Day 30 - 11/05/14 PowerPoint presentation

About This Presentation

Title:

Text statistics 7 Day 30 - 11/05/14

Description:

Title: LING 681 Intro to Comp Ling Subject: computational linguistics, natural language processing Author: Harry Howard Last modified by: Harry Howard – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 22

Provided by: Harry235

Learn more at: http://www.tulane.edu

Category:

more less

Transcript and Presenter's Notes

Title: Text statistics 7 Day 30 - 11/05/14

1
Text statistics 7Day 30 - 11/05/14

LING 3820 6820
Natural Language Processing
Harry Howard
Tulane University

2
Course organization

http//www.tulane.edu/howard/LING3820/
The syllabus is under construction.
http//www.tulane.edu/howard/CompCultEN/
Chapter numbering
3.7. How to deal with non-English characters
4.5. How to create a pattern with Unicode
characters
6. Control

3
Final project
4
Open Spyder
5
Review
6
ConditionalFreqDist

gtgtgt from nltk.corpus import brown
gtgtgt from nltk.probability import
ConditionalFreqDist
gtgtgt cat 'news', 'romance'
gtgtgt catWord (c,w)
for c in cat
for w in brown.words(categoriesc)
gtgtgt cfdConditionalFreqDist(catWord)

7
Conditional frequency distribution
8
A more interesting example
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
sci fi 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13
9
Conditions categories, sample modal verbs

from nltk.corpus import brown
from nltk.probability import ConditionalFreqDist
gtgtgt cat 'news', 'religion', 'hobbies',
'science_fiction', 'romance', 'humor'
gtgtgt mod 'can', 'could', 'may', 'might',
'must', 'will'
gtgtgt catWord (c,w)
for c in cat
for w in brown.words(categoriesc)
if w in mod
gtgtgt cfd ConditionalFreqDist(catWord)
gtgtgt cfd.tabulate()
gtgtgt cfd.plot()

10
cfd.tabulate()

can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
science_fiction 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13

11
cfd.plot()
12
Another example

The task is to find the frequency of 'America'
and 'citizen' in NLTK's corpus of presedential
inaugural addresses
gtgtgt from nltk.corpus import inaugural
gtgtgt inaugural.fileids()
'1789-Washington.txt', '1793-Washington.txt',
'1797-Adams.txt', ..., '2009-Obama.txt'

13
cfd2.plot()
14
First try