Title: News and Blog
1News and Blog Analysis with Lydia
PI Steven Skiena Postdoc Charles Ward IC
Advisor Arthur Becker
2Lydia
- Lydia performs named entity recognition and
analysis over large text corpora. - Spidering Lydia spiders and parses thousands of
online news sources, including over 500 daily US
newspapers. - Named Entity Recognition Lydia identifies and
classifies occurrences of proper entities
(people, places, companies, etc.) - Sentiment Analysis Lydia assigns sentiment
scores to identified entities using shallow NLP
techniques. - Data Analysis Lydia digests marked-up text and
produces usable entity statistics. - Synonym Set Identification Lydia performs
unsupervised coreference analysis to identify the
aliases of entities.
3www.textmap.com
4(No Transcript)
5Time Series
- Reference Time Series
- George Bush References, 1987-2004 (New York
Times) - Sentiment Time Series
- Enron General Sentiment Index, 1996-2005 (New
York Times)
6Time Series
- New York Yankees References, 1987-2004 (New York
Times) - Jeremiah Wright References, Jan 2008 - Jun. 2008
(US Newspapers)
7Time Series
- Barack Obama Sentiment Rank, Jan. 2008 - Feb.
2008 (US Newspapers) - John McCain Sentiment Rank, Jan. 2008 - Feb. 2008
(US Newspapers)
8Eliot Spitzers Very Bad Day
9Juxtapositions
10Juxtapositions
11Juxtapositions
12Geographic Biases
13Geographic Bias
14Geographic Biases
15International News
16Lydia Architecture
- Spidering - text is retrieved from a given site
on a daily basis using semi-custom spidering
agents. - Normalization - clean text is extracted with
semi-custom parsers and formatted for our
pipeline - Text Markup parts of the clean text are
annotated for storage and analysis - Data Analysis and Visualization - we aggregate
entity frequency and relational data for a
variety of statistical analyses.
17Text Markup
- We apply natural language processing (NLP)
techniques to annotate interesting features of
the document. - Full parsing techniques are too slow to keep up
with our volume of text, so we employ shallow
parsing instead. - We can currently markup approximately 2000
newspapers per day per CPU.
18Input
Dr. Judith Rodin, the former president of the
University of Pennsylvania, will become president
of the Rockefeller Foundation next year, the
foundation announced yesterday in New York. She
will take over in March 2005, succeeding Gordon
Conway, the foundation's first non-American
president. Mr. Conway announced last year that he
would retire at 66 in December and return to
Britain, where his children and grandchildren
live.
19Sentence and Paragraph Identification
ltpgt Dr. Judith Rodin, the former president of the
University of Pennsylvania, will become president
of the Rockefeller Foundation next year, the
foundation announced yesterday in New
York. lt/pgt ltpgt She will take over in March 2005,
succeeding Gordon Conway, the foundation's first
non-American president. Mr. Conway announced last
year that he would retire at 66 in December and
return to Britain, where his children and
grandchildren live. lt/pgt
20Part Of Speech Tagging
ltpgt Dr./NNP Judith/NNP Rodin/NNP ,/, the/DT
former/JJ president/NN of/IN the/DT
University/NNP of/IN Pennsylvania/NNP ,/, will/MD
become/VB president/NN of/IN the/DT
Rockefeller/NNP Foundation/NN next/JJ year/NN ,/,
the/DT foundation/NN announced/VBD yesterday/RB
in/IN New/NNP York/NNP./. lt/pgt ltpgt She/PRP
will/MD take/VB over/IN in/IN March/NNP 2005/CD
,/, succeeding/VBG Gordon/NNP Conway/NNP ,/,
the/DT foundation/NN 's/POS first/JJ
non-American/JJ president/NN ./. Mr./NNP
Conway/NNP announced/VBD last/JJ year/NN that/IN
he/PRP would/MD retire/VB at/IN 66/CD in/IN
December/NNP and/CC return/NN to/TO Britain/NNP
,/, where/WRB his/PRP children/NNS and/CC
grandchildren/NNS live/VBP ./. lt/pgt
21Proper Noun Extraction
ltpgt ltpngt Dr./NNP Judith/NNP Rodin/NNP lt/pngt ,/,
the/DT former/JJ president/NN of/IN the/DT ltpngt
University/NNP lt/pngt of/IN ltpngt Pennsylvania/NNP
lt/pngt ,/, will/MD become/VB president/NN of/IN
the/DT ltpngt Rockefeller/NNP lt/pngt Foundation/NN
next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN ltpngt New/NNP
York/NNP lt/pngt ./. lt/pgt ltpgt She/PRP will/MD
take/VB over/IN in/IN March/NNP 2005/CD ,/,
succeeding/VBG ltpngt Gordon/NNP Conway/NNP lt/pngt
,/, the/DT foundation/NN 's/POS first/JJ
non-American/JJ president/NN ./. ltpngt Mr./NNP
Conway/NNP lt/pngt announced/VBD last/JJ year/NN
that/IN he/PRP would/MD retire/VB at/IN 66/CD
in/IN December/NNP and/CC return/NN to/TO ltpngt
Britain/NNP lt/pngt ,/, where/WRB his/PRP
children/NNS and/CC grandchildren/NNS live/VBP
./. lt/pgt
22Actor Classification
ltpgt ltpn category "PERSON"gt Dr./NNP Judith/NNP
Rodin/NNP lt/pngt ,/, the/DT former/JJ president/NN
of/IN the/DT ltpn category "UNKNOWN"gt
University/NNP lt/pngt of/IN ltpn category
"STATE"gt Pennsylvania/NNP lt/pngt ,/, will/MD
become/VB president/NN of/IN the/DT ltpn category
"UNKNOWN"gt Rockefeller/NNP lt/pngt Foundation/NN
next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN ltpn category
CITYgt New/NNP York/NNP lt/pngt ./. lt/pgt ltpgt She/PR
P will/MD take/VB over/IN in/IN ltembedded_dategt
March/NNP 2005/CD lt/embedded_dategt ,/,
succeeding/VBG ltpn category "PERSON"gt
Gordon/NNP Conway/NNP lt/pngt ,/, the/DT
foundation/NN 's/POS ltnum type "ORDINAL"gt
first/JJ lt/numgt non-American/JJ president/NN
./. ltpn category "PERSON"gt Mr./NNP Conway/NNP
lt/pngt announced/VBD last/JJ year/NN that/IN
he/PRP would/MD retire/VB at/IN ltnum type
"CARDINAL"gt 66/CD lt/numgt in/IN ltembedded_dategt
December/NNP lt/embedded_dategt and/CC return/NN
to/TO ltpn category "COUNTRY"gt Britain/NNP lt/pngt
,/, where/WRB his/PRP children/NNS and/CC
grandchildren/NNS live/VBP ./. lt/pgt
23Rewrite Rules
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITYgt New York lt/pngt . lt/pgt ltpgt She will take
over in ltembedded_dategt March 2005
lt/embedded_dategt , succeeding ltpn category
"PERSON"gt Gordon Conway lt/pngt , the foundation 's
ltnum type "ORDINAL"gt first lt/numgt non-American
president . ltappellationgt Mr. lt/appellationgt ltpn
category "PERSON"gt Conway lt/pngt announced last
year that he would retire at ltnum type
"CARDINAL"gt 66 lt/numgt in ltembedded_dategt December
lt/embedded_dategt and return to ltpn category
"COUNTRY"gt Britain lt/pngt , where his children and
grandchildren live . lt/pgt
24Alias Expansion
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITYgt New York lt/pngt . lt/pgt ltpgt She will take
over in ltembedded_dategt March 2005
lt/embedded_dategt , succeeding ltpn category
"PERSON"gt Gordon Conway lt/pngt , the foundation 's
ltnum type "ORDINAL"gt first lt/numgt non-American
president . ltappellationgt Mr. lt/appellationgt ltpn
category "PERSON"gt Gordon Conway lt/pngt
announced last year that he would retire at ltnum
type "CARDINAL"gt 66 lt/numgt in ltembedded_dategt
December lt/embedded_dategt and return to ltpn
category "COUNTRY"gt Britain lt/pngt , where his
children and grandchildren live. lt/pgt
25Geography Normalization
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITY, STATE, COUNTRYgt New York City, New York,
USA lt/pngt . lt/pgt ltpgt She will take over in
ltembedded_dategt March 2005 lt/embedded_dategt ,
succeeding ltpn category "PERSON"gt Gordon Conway
lt/pngt , the foundation 's ltnum type "ORDINAL"gt
first lt/numgt non-American president
. ltappellationgt Mr. lt/appellationgt ltpn category
"PERSON"gt Gordon Conway lt/pngt announced last
year that he would retire at ltnum type
"CARDINAL"gt 66 lt/numgt in ltembedded_dategt December
lt/embedded_dategt and return to ltpn category
"COUNTRY"gt Britain lt/pngt , where his children and
grandchildren live. lt/pgt
26Sentiment Markup
- Sentiment analysis lets us to measure how
positively/negatively an entity is regarded, not
just how much it is talked about. - We use large-scale statistical analysis instead
of careful NLP of individual sentences. - We expand small seed lists of /- terms into
large vocabularies using Wordnet and
path-counting algorithms.
27Multilingual Sentiment Analysis
- Implementing language-specific sentiment analysis
requires - Language-specific NLP software (e.g., POS tagger)
- Language-specific linguistic resources (e.g.,
WordNet) - Instead, we couple machine translation with our
English sentiment analysis.
English sentiment analyzer
Foreign text
Translator
28Other Lydia Features/Experiments
- Synsets Unsupervised coreferential entity
identification - Movie gross prediction Improving prediction
models using entity news sentiment - Concordance-based entity search finding
relevant entities to arbitrary text queries - Constructing social networks from juxtapositions
grow communities from seed sets of entities
compared to Wikipedia data.
29Data Analysis (Old way)
- The most interesting analysis occurs after
markup, using our MySQL database of all
occurrences of interesting entities. - Each days worth of analysis yields about 10
million occurrences of about 1 million different
entities, so efficiency matters... - Linkage of each occurrence to source and time
facilitates a variety of interesting analyses.
30Lydia System Block Diagram (Old)
31Freedonia
- Freedonia is the new data analysis and
visualization component of Lydia. - Old way
- Perl and a centralized MySQL Database
- Did not scale well (essentially sequential)
- Cumbersome to perform experiments
- Slow
- New way
- Java and Apache Hadoop Map/Reduce Framework
- Scales well
- Fast
- Facilitates experimentation
32Lydia System Block Diagram (New)
33Freedonia (New way)
- Main Components
- Map/Reduce jobs to process marked up text
- A build system to manage dependencies of M/R
jobs, and facilitate incremental updates - An API/Server to allow flexible queries of the
computed statistics stored as indexed sorted map
files. - A Website/Visualization front end
34Build System
35Statistics API/Server
- Statistics stored in sorted key-indexed files
(map files) - API provides methods to retrieve aggregated and
filtered data from the statistics. - By date
- By source set
- By article type
36Freedonia Status
- Nearing completion of working initial system
- All major features of old system implemented
- All major features provided through APIs capable
of answering queries dynamically - Can retrieve and graph time series for arbitrary
source and date ranges in a few seconds. - Capable of computing all statistics for the 1TB
four year dailies corpus in days, rather than
months to years while preserving fine
granularity and all marked up entities. - 74 million entities
- 4 billion juxtaposition relationships
37Experiments using Freedonia
- Freedonia facilitates interesting experiments
with data at two levels - The API allows easy access to many types of
interesting entity statistics. - New interesting statistics can be easily created
using the M/R jobs and the build system. - Example group statistics
38Experiments (continued)
- Muslim CEL Group References, 1987-2004 (New York
Times) - Three M/R jobs allow the efficient aggregation of
large groups of entities over all statistics we
already compute. - Ethnicity classification by name based on
Wikipedia data.
39Experiments (continued)
40Experiments (continued)
- What places are people associated with?
- How does this correlate with the predicted
ethnicity? (US and Canada removed)
41Future Work
- Version 1
- Complete statistics extraction pipeline with
daily incremental updates will replace old
infrastructure. - Experiments
- Group statistics
- How do groups interact in the news?
- How are groups temporally and geographically
biased? - Information Diffusion
- How do news sources interact?
- What information flow models best fit our
different text corpora? - Do opinion leaders exist?
- How do news sources and blogs differ in these
respects?
42Thanks to
- Steven Skiena
- Arthur Becker
- Mikhail Bautin
- Anurag Ambekar, Yeongmi Jeon, Andrew Mehler,
Dmytro Molkov, Shashank Naik, Akshay Patil,
Swapna Reddy, Shrikant Shanbag, Wenbin Zhang