News and Blog - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

News and Blog

Description:

Spidering: Lydia spiders and parses thousands of online news sources, including ... Spidering - text is retrieved from a given site on a daily basis using semi ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 43
Provided by: ziza1
Category:
Tags: blog | news | spidering

less

Transcript and Presenter's Notes

Title: News and Blog


1
News and Blog Analysis with Lydia
PI Steven Skiena Postdoc Charles Ward IC
Advisor Arthur Becker
2
Lydia
  • Lydia performs named entity recognition and
    analysis over large text corpora.
  • Spidering Lydia spiders and parses thousands of
    online news sources, including over 500 daily US
    newspapers.
  • Named Entity Recognition Lydia identifies and
    classifies occurrences of proper entities
    (people, places, companies, etc.)
  • Sentiment Analysis Lydia assigns sentiment
    scores to identified entities using shallow NLP
    techniques.
  • Data Analysis Lydia digests marked-up text and
    produces usable entity statistics.
  • Synonym Set Identification Lydia performs
    unsupervised coreference analysis to identify the
    aliases of entities.

3
www.textmap.com
4
(No Transcript)
5
Time Series
  • Reference Time Series
  • George Bush References, 1987-2004 (New York
    Times)
  • Sentiment Time Series
  • Enron General Sentiment Index, 1996-2005 (New
    York Times)

6
Time Series
  • New York Yankees References, 1987-2004 (New York
    Times)
  • Jeremiah Wright References, Jan 2008 - Jun. 2008
    (US Newspapers)

7
Time Series
  • Barack Obama Sentiment Rank, Jan. 2008 - Feb.
    2008 (US Newspapers)
  • John McCain Sentiment Rank, Jan. 2008 - Feb. 2008
    (US Newspapers)

8
Eliot Spitzers Very Bad Day
9
Juxtapositions
10
Juxtapositions
11
Juxtapositions
12
Geographic Biases
13
Geographic Bias
14
Geographic Biases
15
International News
16
Lydia Architecture
  • Spidering - text is retrieved from a given site
    on a daily basis using semi-custom spidering
    agents.
  • Normalization - clean text is extracted with
    semi-custom parsers and formatted for our
    pipeline
  • Text Markup parts of the clean text are
    annotated for storage and analysis
  • Data Analysis and Visualization - we aggregate
    entity frequency and relational data for a
    variety of statistical analyses.

17
Text Markup
  • We apply natural language processing (NLP)
    techniques to annotate interesting features of
    the document.
  • Full parsing techniques are too slow to keep up
    with our volume of text, so we employ shallow
    parsing instead.
  • We can currently markup approximately 2000
    newspapers per day per CPU.

18
Input
Dr. Judith Rodin, the former president of the
University of Pennsylvania, will become president
of the Rockefeller Foundation next year, the
foundation announced yesterday in New York. She
will take over in March 2005, succeeding Gordon
Conway, the foundation's first non-American
president. Mr. Conway announced last year that he
would retire at 66 in December and return to
Britain, where his children and grandchildren
live.
19
Sentence and Paragraph Identification
ltpgt Dr. Judith Rodin, the former president of the
University of Pennsylvania, will become president
of the Rockefeller Foundation next year, the
foundation announced yesterday in New
York. lt/pgt ltpgt She will take over in March 2005,
succeeding Gordon Conway, the foundation's first
non-American president. Mr. Conway announced last
year that he would retire at 66 in December and
return to Britain, where his children and
grandchildren live. lt/pgt
20
Part Of Speech Tagging
ltpgt Dr./NNP Judith/NNP Rodin/NNP ,/, the/DT
former/JJ president/NN of/IN the/DT
University/NNP of/IN Pennsylvania/NNP ,/, will/MD
become/VB president/NN of/IN the/DT
Rockefeller/NNP Foundation/NN next/JJ year/NN ,/,
the/DT foundation/NN announced/VBD yesterday/RB
in/IN New/NNP York/NNP./. lt/pgt ltpgt She/PRP
will/MD take/VB over/IN in/IN March/NNP 2005/CD
,/, succeeding/VBG Gordon/NNP Conway/NNP ,/,
the/DT foundation/NN 's/POS first/JJ
non-American/JJ president/NN ./. Mr./NNP
Conway/NNP announced/VBD last/JJ year/NN that/IN
he/PRP would/MD retire/VB at/IN 66/CD in/IN
December/NNP and/CC return/NN to/TO Britain/NNP
,/, where/WRB his/PRP children/NNS and/CC
grandchildren/NNS live/VBP ./. lt/pgt
21
Proper Noun Extraction
ltpgt ltpngt Dr./NNP Judith/NNP Rodin/NNP lt/pngt ,/,
the/DT former/JJ president/NN of/IN the/DT ltpngt
University/NNP lt/pngt of/IN ltpngt Pennsylvania/NNP
lt/pngt ,/, will/MD become/VB president/NN of/IN
the/DT ltpngt Rockefeller/NNP lt/pngt Foundation/NN
next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN ltpngt New/NNP
York/NNP lt/pngt ./. lt/pgt ltpgt She/PRP will/MD
take/VB over/IN in/IN March/NNP 2005/CD ,/,
succeeding/VBG ltpngt Gordon/NNP Conway/NNP lt/pngt
,/, the/DT foundation/NN 's/POS first/JJ
non-American/JJ president/NN ./. ltpngt Mr./NNP
Conway/NNP lt/pngt announced/VBD last/JJ year/NN
that/IN he/PRP would/MD retire/VB at/IN 66/CD
in/IN December/NNP and/CC return/NN to/TO ltpngt
Britain/NNP lt/pngt ,/, where/WRB his/PRP
children/NNS and/CC grandchildren/NNS live/VBP
./. lt/pgt
22
Actor Classification
ltpgt ltpn category "PERSON"gt Dr./NNP Judith/NNP
Rodin/NNP lt/pngt ,/, the/DT former/JJ president/NN
of/IN the/DT ltpn category "UNKNOWN"gt
University/NNP lt/pngt of/IN ltpn category
"STATE"gt Pennsylvania/NNP lt/pngt ,/, will/MD
become/VB president/NN of/IN the/DT ltpn category
"UNKNOWN"gt Rockefeller/NNP lt/pngt Foundation/NN
next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN ltpn category
CITYgt New/NNP York/NNP lt/pngt ./. lt/pgt ltpgt She/PR
P will/MD take/VB over/IN in/IN ltembedded_dategt
March/NNP 2005/CD lt/embedded_dategt ,/,
succeeding/VBG ltpn category "PERSON"gt
Gordon/NNP Conway/NNP lt/pngt ,/, the/DT
foundation/NN 's/POS ltnum type "ORDINAL"gt
first/JJ lt/numgt non-American/JJ president/NN
./. ltpn category "PERSON"gt Mr./NNP Conway/NNP
lt/pngt announced/VBD last/JJ year/NN that/IN
he/PRP would/MD retire/VB at/IN ltnum type
"CARDINAL"gt 66/CD lt/numgt in/IN ltembedded_dategt
December/NNP lt/embedded_dategt and/CC return/NN
to/TO ltpn category "COUNTRY"gt Britain/NNP lt/pngt
,/, where/WRB his/PRP children/NNS and/CC
grandchildren/NNS live/VBP ./. lt/pgt
23
Rewrite Rules
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITYgt New York lt/pngt . lt/pgt ltpgt She will take
over in ltembedded_dategt March 2005
lt/embedded_dategt , succeeding ltpn category
"PERSON"gt Gordon Conway lt/pngt , the foundation 's
ltnum type "ORDINAL"gt first lt/numgt non-American
president . ltappellationgt Mr. lt/appellationgt ltpn
category "PERSON"gt Conway lt/pngt announced last
year that he would retire at ltnum type
"CARDINAL"gt 66 lt/numgt in ltembedded_dategt December
lt/embedded_dategt and return to ltpn category
"COUNTRY"gt Britain lt/pngt , where his children and
grandchildren live . lt/pgt
24
Alias Expansion
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITYgt New York lt/pngt . lt/pgt ltpgt She will take
over in ltembedded_dategt March 2005
lt/embedded_dategt , succeeding ltpn category
"PERSON"gt Gordon Conway lt/pngt , the foundation 's
ltnum type "ORDINAL"gt first lt/numgt non-American
president . ltappellationgt Mr. lt/appellationgt ltpn
category "PERSON"gt Gordon Conway lt/pngt
announced last year that he would retire at ltnum
type "CARDINAL"gt 66 lt/numgt in ltembedded_dategt
December lt/embedded_dategt and return to ltpn
category "COUNTRY"gt Britain lt/pngt , where his
children and grandchildren live. lt/pgt
25
Geography Normalization
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITY, STATE, COUNTRYgt New York City, New York,
USA lt/pngt . lt/pgt ltpgt She will take over in
ltembedded_dategt March 2005 lt/embedded_dategt ,
succeeding ltpn category "PERSON"gt Gordon Conway
lt/pngt , the foundation 's ltnum type "ORDINAL"gt
first lt/numgt non-American president
. ltappellationgt Mr. lt/appellationgt ltpn category
"PERSON"gt Gordon Conway lt/pngt announced last
year that he would retire at ltnum type
"CARDINAL"gt 66 lt/numgt in ltembedded_dategt December
lt/embedded_dategt and return to ltpn category
"COUNTRY"gt Britain lt/pngt , where his children and
grandchildren live. lt/pgt
26
Sentiment Markup
  • Sentiment analysis lets us to measure how
    positively/negatively an entity is regarded, not
    just how much it is talked about.
  • We use large-scale statistical analysis instead
    of careful NLP of individual sentences.
  • We expand small seed lists of /- terms into
    large vocabularies using Wordnet and
    path-counting algorithms.

27
Multilingual Sentiment Analysis
  • Implementing language-specific sentiment analysis
    requires
  • Language-specific NLP software (e.g., POS tagger)
  • Language-specific linguistic resources (e.g.,
    WordNet)
  • Instead, we couple machine translation with our
    English sentiment analysis.

English sentiment analyzer
Foreign text
Translator
28
Other Lydia Features/Experiments
  • Synsets Unsupervised coreferential entity
    identification
  • Movie gross prediction Improving prediction
    models using entity news sentiment
  • Concordance-based entity search finding
    relevant entities to arbitrary text queries
  • Constructing social networks from juxtapositions
    grow communities from seed sets of entities
    compared to Wikipedia data.

29
Data Analysis (Old way)
  • The most interesting analysis occurs after
    markup, using our MySQL database of all
    occurrences of interesting entities.
  • Each days worth of analysis yields about 10
    million occurrences of about 1 million different
    entities, so efficiency matters...
  • Linkage of each occurrence to source and time
    facilitates a variety of interesting analyses.

30
Lydia System Block Diagram (Old)
31
Freedonia
  • Freedonia is the new data analysis and
    visualization component of Lydia.
  • Old way
  • Perl and a centralized MySQL Database
  • Did not scale well (essentially sequential)
  • Cumbersome to perform experiments
  • Slow
  • New way
  • Java and Apache Hadoop Map/Reduce Framework
  • Scales well
  • Fast
  • Facilitates experimentation

32
Lydia System Block Diagram (New)
33
Freedonia (New way)
  • Main Components
  • Map/Reduce jobs to process marked up text
  • A build system to manage dependencies of M/R
    jobs, and facilitate incremental updates
  • An API/Server to allow flexible queries of the
    computed statistics stored as indexed sorted map
    files.
  • A Website/Visualization front end

34
Build System
35
Statistics API/Server
  • Statistics stored in sorted key-indexed files
    (map files)
  • API provides methods to retrieve aggregated and
    filtered data from the statistics.
  • By date
  • By source set
  • By article type

36
Freedonia Status
  • Nearing completion of working initial system
  • All major features of old system implemented
  • All major features provided through APIs capable
    of answering queries dynamically
  • Can retrieve and graph time series for arbitrary
    source and date ranges in a few seconds.
  • Capable of computing all statistics for the 1TB
    four year dailies corpus in days, rather than
    months to years while preserving fine
    granularity and all marked up entities.
  • 74 million entities
  • 4 billion juxtaposition relationships

37
Experiments using Freedonia
  • Freedonia facilitates interesting experiments
    with data at two levels
  • The API allows easy access to many types of
    interesting entity statistics.
  • New interesting statistics can be easily created
    using the M/R jobs and the build system.
  • Example group statistics

38
Experiments (continued)
  • Muslim CEL Group References, 1987-2004 (New York
    Times)
  • Three M/R jobs allow the efficient aggregation of
    large groups of entities over all statistics we
    already compute.
  • Ethnicity classification by name based on
    Wikipedia data.

39
Experiments (continued)
40
Experiments (continued)
  • What places are people associated with?
  • How does this correlate with the predicted
    ethnicity? (US and Canada removed)

41
Future Work
  • Version 1
  • Complete statistics extraction pipeline with
    daily incremental updates will replace old
    infrastructure.
  • Experiments
  • Group statistics
  • How do groups interact in the news?
  • How are groups temporally and geographically
    biased?
  • Information Diffusion
  • How do news sources interact?
  • What information flow models best fit our
    different text corpora?
  • Do opinion leaders exist?
  • How do news sources and blogs differ in these
    respects?

42
Thanks to
  • Steven Skiena
  • Arthur Becker
  • Mikhail Bautin
  • Anurag Ambekar, Yeongmi Jeon, Andrew Mehler,
    Dmytro Molkov, Shashank Naik, Akshay Patil,
    Swapna Reddy, Shrikant Shanbag, Wenbin Zhang
Write a Comment
User Comments (0)
About PowerShow.com