I256:%20Applied%20Natural%20Language%20Processing - PowerPoint PPT Presentation

About This Presentation
Title:

I256:%20Applied%20Natural%20Language%20Processing

Description:

Use discussion before a film to predict its opening weekend box office scores ... Good predictor for opening weekend, but not for longer term ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 48
Provided by: coursesIs
Category:

less

Transcript and Presenter's Notes

Title: I256:%20Applied%20Natural%20Language%20Processing


1
I256 Applied Natural Language Processing
Marti Hearst Nov 8, 2006    
2
Today
  • Comparing term clustering and category output
  • Clustering in Weka
  • Data mining from blogs

3
LDA
  • Latent Dirchelet Allocation
  • Blei, Ng, Jordan, JLMR 03.
  • LDA is a hierarchical probabilistic model of
    documents.
  • LDA allows you to analyze of corpus, and extract
    the topics that combined to form its documents.
  • http//www.cs.princeton.edu/blei/lda-c/
  • Not really clustering, but in the soft
    clustering ballpark.

4
LDA on Recipes
  • http//orange.sims.berkeley.edu/cgi-bin/flamenco.c
    gi/recipes-newblei/Flamenco

5
LDA on Recipes
  • http//orange.sims.berkeley.edu/cgi-bin/flamenco.c
    gi/recipes-newblei/Flamenco

6
CastaNet
  • (Semi)automated facet creation
  • Stoica Hearst
  • Build up from WordNet
  • Algorithm is fully automatic but we think you can
    improve results manually afterwards.

7
CastaNet on Recipes
  • http//orange.sims.berkeley.edu/cgi-bin/flamenco.c
    gi/recipes-automated/Flamenco

8
CastaNet on Recipes
  • http//orange.sims.berkeley.edu/cgi-bin/flamenco.c
    gi/recipes-automated/Flamenco

9
TopicSeek on Enron Email
  • Technique pLSI (probabilistic LSI, Hofmann 99)
  • Hand-picked example for website
  • http//topicseek.com/enron.html

10
TopicSeek on Medline
  • Technique pLSI (probabilistic LSI, Hofmann 99)
  • Hand-picked example for website
  • http//topicseek.com/pubmed.html

11
CastaNet on Medline Journal Titles
  • http//orange.sims.berkeley.edu/cgi-bin/flamenco.c
    gi/medicine-automated/Flamenco

12
Clustering in Weka
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Looking at Clustering Results
  • Weka lets you save cluster results to an ARFF
    file
  • I wrote some python code to process this file and
    pull out the Subject headings for each newsgroup
    posting in each cluster.

17
15-way clustering
18
(No Transcript)
19
Cobweb clustering
20
(No Transcript)
21
Blog Analysis
  • Whats special about blogs?

22
Blog analysis sites
  • http//dijest.com/bc/
  • Called blogcount lots of stats and news about
    blogs
  • http//blogcensus.net/?pagetools
  • Language, location, marketshare
  • http//www.perseus.com/blogsurvey/
  • Stats about biggest blogs, demographics
  • http//www.weblogs.com/
  • Notify when new content posted
  • http//blogpulse.com/
  • Trends and recent popular topics

23
Blogs vs. Newsgroups
  • Posting about products what can we tell?
  • Blog
  • Newsgroup

Example from Glance, Hurst, and Tomokiyo 04
24
Analyzing Blogs for Market Data
  • Idea examine comments about a product (or a
    products competition or market) in an automated
    fashion.
  • Application area handheld electronic devices.

Figure from Glance, Hurst, Nigam, Siegler,
Stockton, Tomokiyo, KDD05
25
Analyzing Blogs for Market Data
Figure from Glance, Hurst, Nigam, Siegler,
Stockton, Tomokiyo, KDD05
26
Technology used
  • Post segmentation
  • Important phrases
  • Foreground vs. background corpus
  • Background text about product
  • Foreground certain negative paragraphs about
    product
  • Sentiment classification
  • What do people talk about when saying negative
    things about product X?
  • Social network analysis (on discussion boards)
  • What does this group of people talk about when
    saying negative things about product X?
  • Author dispersion
  • Many people talking about it, or just a few?

27
Example
  • What common phrases to people use when saying
    negative things about product X?

28
Example
  • What do people in this group say when saying
    negative things about product X?

29
Example
  • What do people in this group say when saying
    negative things about product X?

30
Predicting Film Sales
  • Idea
  • Use discussion before a film to predict its
    opening weekend box office scores
  • Use discussion afterwards to predict longer-term
    sales
  • Separate out topic labels from sentiment labels
  • Outcome
  • Good predictor for opening weekend, but not for
    longer term
  • Observation the nature of discussion gets (and
    thus harder to analyze) after the film has been
    out a while.

Example from Mishne Glance, 2006
31
Predicting Film Sales
Example from Mishne Glance, 2006
32
Prediction Film Sales
Example from Mishne Glance, 2006
33
Predicting Film Sales
Example from Mishne Glance, 2006
34
Analyzing Political Blogs
  • Analyze
  • Who links to whom
  • What the popularity profile looks like
  • A powerlaw/Zipf/Pareto, of course
  • Look at structure of topic-specific blogs
  • By inbound links

Image from blogsphere ecosystem via Shirky
35
Analyzing Political Blogs
  • Earlier work examined books bought together in
    pairs at major retailers
  • Krebs, Divided we Stand??? http//www.orgnet.com/l
    eftright.html
  • In other domains the groupings are more
    distributed.

36
http//www.orgnet.com/booknet.html
37
http//www.orgnet.com/leftright.html from Jan 2003
38
http//www.orgnet.com/divided.html from 2004
election
39
Analyzing Political Blogs
  • Study by Adamic and Glance, 2005
  • Analyzed 40 most popular political blogs
  • 2 months preceding 2004 US presidential election
  • Also study 1000 political blogs on a one day
    snapshot
  • Findings for the latter
  • Liberal and conservative blogs had distinct lists
    of favorate news sources, people, and topics,
    with some overlap on current news
  • Use labels from aggregator sources
  • Linking patterns were indeed pretty internal (91
    stayed within political leaning)
  • More and more frequent linking among
    conservatives
  • 82 conservative linked out vs. 74 of liberal

40
Analyzing Political Blogs
  • For the 40 most popular blogs
  • Looked for echo chamber effect
  • The conservative blogs are more tightly
    interlinked.
  • Question do they repeat the same concepts more?
  • Measured textual similarity among blog posts
  • Slightly stronger within a political leaning than
    between, but not one orientation more than the
    other.
  • Looked for interaction with mainstream media
  • Found strong distinctions between which sources
    cited

41
Image from Adamic Glance 200
42
Image from Adamic Glance 200
43
Image from Adamic Glance 200
44
Image from Adamic Glance 200
45
Image from Adamic Glance 200
46
Image from Adamic Glance 200
47
Next Time
  • Sentiment and Opinion Analysis
Write a Comment
User Comments (0)
About PowerShow.com