Title: Some Linguistic Implications of the SHARES Project
1System of Hypermatrix Analysis, Retrieval,
Evaluation and Summarisation
Some Linguistic Implications ofthe SHARES Project
J. BanerjeeRDUESUniversity of Liverpool
2Principle of Lexical Cohesion Hypermatrix
Structure
- Basic hypothesis Similar patterns of lexical
repetition occur in texts on similar topics - Networks of lexical repetition are used to
identify closely related sentences across texts,
and thereby similar texts - The hypermatrix structure identifies links
between repeated words and bonds between
closely linked sentences - The number of bonded sentences between a pair
of texts gives their match score. This is
higher for texts on similar topics
3Links Bonds
Indonesia and Malaysia have taken their first
sips of the bitter medicine of economic
retrenchment, scaling back their growth plans
halting expensive building projects and
announcing austerity measures.
The central bank voted again on Tuesday not to
raise interest rates, apparently in the belief
that economic growth is now moderating and that
any tightening would risk further destabilizing
Asia at a time of political unrest in Indonesia.
Links 3, Bond 1 if Link Threshold 3
4Links Bonds
Article A
S1 w1 w2 w3 w4 w1 w5 w6 w7 w8
S2 w9 w10 w1 w3 w4 w11 w12
w5 w13
S8 w9 w2 w14 w15 w3 w16 w5
w17 w18
Article B
S9 w19 w20 w3 w4 w5 w11 w12
w5 w13 w21
Links 3 Bond 1
5Links Bonds
Article A
S1 w1 w2 w3 w4 w1 w5 w6 w7 w8
S2 w9 w10 w1 w3 w4 w11 w12
w5 w13
S8 w9 w2 w14 w15 w3 w16 w5
w17 w18
Article B
S9 w19 w20 w3 w4 w5 w11 w12
w5 w13 w21
Links 3 Bond 1
6Links Bonds
S1 w1 w2 w3 w4 w1 w5 w6 w7 w8
Article A
S2 w9 w10 w1 w3 w4 w11 w12
w5 w13
Article B
S8 w9 w2 w14 w15 w3 w16 w5
w17 w18
S9 w19 w20 w3 w4 w5 w11 w12
w5 w13 w21
Links 3 Bond 1
7Links Bonds
S1 w1 w2 w3 w4 w1 w5 w6 w7 w8
Article A
S2 w9 w10 w1 w3 w4 w11 w12
w5 w13
Article B
S8 w9 w2 w14 w15 w3 w16 w5
w17 w18
S9 w19 w20 w3 w4 w5 w11 w12
w5 w13 w21
Links 6 Bond 1
8Visualisation Interface
9TDT2 Corpus
- Corpus designed for US Topic Detection
Tracking programme - Consists of US newspaper articles (NYT), radio
broadcasts (VOA, APW), television broadcasts
(CNN, ABC) - 64,527 articles 1,111,445 sentences 20,232,752
tokens - 100 specified topics
- 8040 articles with topics assigned (Some
articles with 2 or more topics , Some articles
misclassified
10Mini Test Corpus
- 33 articles, 11 topics with 3 articles per topic
- 1259 sentences, 27948 tokens, 5999 types
- Topic Groups
- A Asian Economic Crisis
- B Monica Lewinsky
- C McVeighs Navy Dismissal Fight
- D Fossetts Balloon Ride
- E Pope Visits Cuba
- F 1998 Winter Olympics
- G Current Conflict with Iraq
- H Violence in Algeria
- I Quality of Life, NYC
- J Superbowl 98
- K China Air Crash
11Weighting Issues Short Sentences
1136 Gosh, the dreaded AFC streak is over. 1137
Denver is finally king and so is the AFC. 1138
Brett Favre, the Packers quarterback, and his
teammates tried desperately until the
finish. 1139 The Packers had the ball at the
Denver 31 with 32 seconds left and faced a
fourth-and-6. 1140 Favre dropped back. 1141 The
pocket pressure was intense. 1142 He looked for
tight end Mark Chmura, but John Mobley was there
to swat the ball away. 1143 Thus, for Denver,
for the AFC, so was all of that misery. 1144
Swatted away. 1145 John Elway, the Denver
quarterback, got his ring in his fourth Super
Bowl try and the Broncos reign.
12Weighting Issues Long Sentences
236 In a case that has unified gay-rights groups
and advocates of cyberspace privacy, the petty
officer has filed suit in U.S. District Court in
Washington to save his Navy career, arguing that
his dismissal is the result of a violation both
of federal privacy laws and of the Defense
Department's don't ask, don't tell policy, which
was supposed to put an end to aggressive
campaigns to ferret out homosexuals in the
military. (72 tokens)
13Weighting Issues Frequent Words
- Article A (Asian Economic Crisis) 1 New
York/ers It's been quite shocking to see the
situation deteriorate to the extent that it has
said Leslie Richardson, the managing director of
the Asian Equities Division for SocGen-Crosby
Securities in New York - Article B (Quality
of Life in NYC) 10 New York/ers e.g. A
contract dispute has strained the mayor's
relations with the rank-and-file officers, who
have balked at ticketing jaywalking, which is
illegal but practiced by most New Yorkers. A
Marist College poll this week placed the mayor
first among registered voters in New York as a
possible candidate in a Republican presidential
primary Gov. George Pataki came in fourth He
added that he would push for city workers to
treat New Yorkers politely and for children in
the public elementary schools to take civics
classes an idea welcomed Wednesday by Schools
Chancellor Rudy Crew. These poses, political
experts said, are aimed at that aspect of the
city that inspires a mixture of fear and
fascination among suburban sensibilities, like
the character of the out-of-control New York
cabdriver that David Letterman has popularized
across the nation. In an address Wednesday
announcing the second phase of his quality of
life campaign for New York City.
14Weighting Theme Rheme
15Weighting Theme Rheme
16Weighting Strategies
- Bond counts are given weights based on factors
such as - Observed vs. expected frequencies of linking
words - (so that rare word links weighted higher than
common - word links) e.g. expected frequency of New
York in Article B 1.42 - observed frequency 10 (high value indicates
likely topic word) - Z scores of linking words
- e.g. New York 0.84 (low value indicates
uniform distribution i.e. non-topic word) - Document length
-
- Sentence length
-
17Link Matrix
18Weighted Link Matrix
19Similarity Scores
- Inter-document similarity score is generated
- Bond counts weighted by relative word frequency
in reference TDT corpus and by document
length - Weighted bond counts aggregated over sentence
pairs within - any document pair
- Weighted bond counts scaled to similarity scores
such that - Similarity Score 0 if two documents have
no bonds - 1 if two documents are as bonded
- with each other as they are with
- themselves
-
20Similarity Score Matrix
21Document Similarity Representation
22Future Work
- Thesaural input to improve recall
- Investigation of lemmatisation
- Large-scale testing
- Proper noun identification/ resolution