Text statistics 7 Day 30 - 11/05/14 - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Text statistics 7 Day 30 - 11/05/14

Description:

Title: LING 681 Intro to Comp Ling Subject: computational linguistics, natural language processing Author: Harry Howard Last modified by: Harry Howard – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 22
Provided by: Harry235
Learn more at: http://www.tulane.edu
Category:

less

Transcript and Presenter's Notes

Title: Text statistics 7 Day 30 - 11/05/14


1
Text statistics 7Day 30 - 11/05/14
  • LING 3820 6820
  • Natural Language Processing
  • Harry Howard
  • Tulane University

2
Course organization
  • http//www.tulane.edu/howard/LING3820/
  • The syllabus is under construction.
  • http//www.tulane.edu/howard/CompCultEN/
  • Chapter numbering
  • 3.7. How to deal with non-English characters
  • 4.5. How to create a pattern with Unicode
    characters
  • 6. Control

3
Final project
4
Open Spyder
5
Review
6
ConditionalFreqDist
  1. gtgtgt from nltk.corpus import brown
  2. gtgtgt from nltk.probability import
    ConditionalFreqDist
  3. gtgtgt cat 'news', 'romance'
  4. gtgtgt catWord (c,w)
  5. for c in cat
  6. for w in brown.words(categoriesc)
  7. gtgtgt cfdConditionalFreqDist(catWord)

7
Conditional frequency distribution
8
A more interesting example
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
sci fi 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13
9
Conditions categories, sample modal verbs
  1. from nltk.corpus import brown
  2. from nltk.probability import ConditionalFreqDist
  3. gtgtgt cat 'news', 'religion', 'hobbies',
    'science_fiction', 'romance', 'humor'
  4. gtgtgt mod 'can', 'could', 'may', 'might',
    'must', 'will'
  5. gtgtgt catWord (c,w)
  6. for c in cat
  7. for w in brown.words(categoriesc)
  8. if w in mod
  9. gtgtgt cfd ConditionalFreqDist(catWord)
  10. gtgtgt cfd.tabulate()
  11. gtgtgt cfd.plot()

10
cfd.tabulate()
  • can could may might must will
  • news 93 86 66 38 50 389
  • religion 82 59 78 12 54 71
  • hobbies 268 58 131 22 83 264
  • science_fiction 16 49 4 12 8 16
  • romance 74 193 11 51 45 43
  • humor 16 30 8 8 9 13

11
cfd.plot()
12
Another example
  • The task is to find the frequency of 'America'
    and 'citizen' in NLTK's corpus of presedential
    inaugural addresses
  • gtgtgt from nltk.corpus import inaugural
  • gtgtgt inaugural.fileids()
  • '1789-Washington.txt', '1793-Washington.txt',
    '1797-Adams.txt', ..., '2009-Obama.txt'

13
cfd2.plot()
14
First try
  1. from nltk.corpus import inaugural
  2. from nltk.probability import ConditionalFreqDist
  3. keys 'america', 'citizen'
  4. keyYear (w, title4)
  5. for title in inaugural.fileids()
  6. for w in inaugural.words(title)
  7. if w.lower() in keys
  8. cfd2 ConditionalFreqDist(keyYear)
  9. cfd2.plot()

15
cfd2.plot()
16
Second try
  1. from nltk.corpus import inaugural
  2. from nltk.probability import ConditionalFreqDist
  3. keys 'america', 'citizen'
  4. keyYear (key, title4)
  5. for title in inaugural.fileids()
  6. for w in inaugural.words(title)
  7. for k in keys
  8. if w.lower().startswith(k)
  9. cfd3 ConditionalFreqDist(keyYear)
  10. cfd3.plot()

17
dfc3.plot()
18
Stemming
19
Third try
  1. from nltk.stem.snowball import EnglishStemmer
  2. stemmer EnglishStemmer()
  3. from nltk.corpus import inaugural
  4. from nltk.probability import ConditionalFreqDist
  5. keys 'america', 'citizen'
  6. keyYear (w, title4)
  7. for title in inaugural.fileids()
  8. for w in inaugural.words(title)
  9. if stemmer.stem(w) in keys
  10. cfd4 ConditionalFreqDist(keyYear)
  11. cfd4.plot()

20
cfd4.plot()
21
Next time
  • Twitter
Write a Comment
User Comments (0)
About PowerShow.com