Topic Detection and Tracking TDT - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Topic Detection and Tracking TDT

Description:

Combining Semantic and Syntactic Document Classifiers to Improve First Story Detection ... following words {plane, airplane, pilot, cockpit, airhostess, wing, engine} ... – PowerPoint PPT presentation

Number of Views:408

Avg rating:3.0/5.0

Slides: 53

Provided by: rea51

Category:

more less

Transcript and Presenter's Notes

Title: Topic Detection and Tracking TDT

1
Topic Detection and Tracking (TDT)

CSCI 6403 PRESENTATION

Weizheng Gao Xuhao Lai Litao Ou
2
Outline

Introduction
Combining Semantic and Syntactic Document
Classifiers to Improve First Story Detection
First Story Detection In TDT Is Hard
Topic Detection System in Broadcast News

3
What is TDT?
?. INTRODUCTION 1

TDT refers to automatic techniques for finding
topically related material in streams of data.

For example
4
To this
(blocks with the same color are stories about the
same event in several media)
5
Research Applications Defined in the TDT

Story Segmentation
Topic Tracking
Topic Detection
First Story Detection
Link Detection

6
1. Story Segmentation

Detect changes between topically cohesive
sections

7
2. Topic Tracking

Keep track of stories similar to a set of
example stories

8
3. Topic Detection

Build clusters of stories that discuss the same
topic

9
4. First Story Detection

Detect if a story is the first story of a
topic

10
5. Link Detection

Detect whether or not two stories are topically
linked

11
?. Combining Semantic and Syntactic Document
Classifiers to Improve First Story Detection 2

A Document Representation Strategy Using Lexical
Chains
Detection Using Two Classifiers
Conclusions

12
A Document Representation Strategy Using Lexical
Chains

The cohesive structure can be explored and
represented by lexical chains
For example airplanes A typical
chain might consist of the
following words plane, airplane,
pilot, cockpit, airhostess,
wing, engine
Chain words address two linguistic problems
associated with traditional syntactic
representations
synonymy polysemy

13
A Document Representation Strategy Using Lexical
Chains (cont)

Terms must be added to the most recently updated
chain.
Proper nouns is the second element of combined
document representation

Prompt the correct disambiguation of a word
based on the context in which it was used.
14
Detection Using Two Classifiers

1. Convert the current document into a weighted
chain word vector and a weighted proper noun
vector.
2. The first document becomes the first cluster.
3. Subsequent incoming documents are compared
with previously created clusters.
4. Find the most similar cluster and discover
whether this document satisfies the similarity
condition. If not, the document is declared to
discuss a new event, or it will form the seed of
a new cluster.
5. This process continues until all documents
have been classified.

15
Conclusions

The results show that an increase in system
effectiveness is achieved when lexical chain
(semantic) representations are used in
conjunction with proper noun (syntactic)
representations.

?A miss occurs when the system fails to detect a
new event. ? False alarms occur when the system
indicates a story contains a new event when it
does not.
16
?. First Story Detection In TDT Is Hard 3
17
Two Tasks of TDT

Topic Tracking
In Tracking, the system is given a small number
Nt, of stories that are known to be on the same
event-based news topic. The system then monitors
the stream of subsequent news stories for ones
that are on the same topic.

18
Two Tasks of TDT

First Story Detection (FSD)
It also monitors a stream of arriving news
stories. However, the task is to mark each story
as first or not first to indicate whether or
not it is the first one discussing a news topic.
The system provides a score for each story,
where a high score indicates confidence that a
story is first

19
Tracking

TDT tracking task is fundamentally similar to
IRs information filtering task
Each begins with a representation of a topic and
then monitors a stream of arriving documents,
making decisions about documents as they arrive
(without a deferral period).

20
Topic

Filtering is subject based
Tracking is event based
No user feedback after tracking begins

21
TDT-2 Corpus

Approximately 60,000 news stories from January
through June of 1998
First four months of data for parameter tuning
for tracking
Final two months for evaluation

22
Tracking System

A vector model for representing stories
A vector centroid for representing topics
Incoming stories are compared to the topic
centroid
On-topic
Off-topic

23
FSD System

Same as the tracking system
Incoming stories are compared to every story that
had appeared in the past
If the new story exceeds a threshold with any one
of the stories, it is considered old, else it is
considered new

24
TDT Assumption And Relevance

Stories are on a single topic
Multiple topics are not judged
IR query
TDT topic

25
Evaluation Measures(Effectiveness)
26
Evaluation Measures

Recall Precision
Miss False alarm
Richness

27
Bounds on FSD

One possible solution to FSD is to apply tracking
technology
Intuitively, the system marks the first story of
the corpus with a very high score, if the second
story tracks, it is assigned a low FSD score

28
Relating tracking and FSD

The probability we miss the first story for topic
i

29
Relating tracking and FSD

The topic-weighted average value

30
Relating tracking and FSD (assume that topic
error rates are independent)

The lower bound possibility of a FSD alarm for
topic i

31
Relating tracking and FSD (assume that topic
error rates are independent)

The upper bound

32
Expected FSD Performance(Figure 2)
33
Expected FSD Performance(Figure 3)
34
Difficulty of improving FSD(Figure 4)
35
Complexity Analysis

Possible to reduce the TDT FSD problem to the TDT
tracking problem
Tracking problem is NP-complete, then FSD is hard
Knowing about such relationships may help avoid
redundant research or unnecessary investigative
dead-ends

36
Project Implementation withFirst Story Detection
based on Tracking (Part III)

Corpus A.txt, M.txt
First Story Detection based on Tracking
Evaluation Measures Recall, Precision, Miss,
False alarm, Richness

37
?. Topic Detection System in Broadcast News 4
INTRODUCTION

Concerned with unsupervised grouping of news
according to topic
Create story groupings through clustering
Involved with the stories on the same topic
Incremental k-means algorithm
Probabilistic Similarity metric
Selection and thresholding metrics
Experimental results

38
Incremental k-means algorithm

Find the closer cluster and decide whether to
merge
Iterate through the stories and make changes
during each iteration
Correct poor initial clusters
Computational requirement is less imposing

39
Probabilistic Similarity Metric

Utilize the BBN topic spotting metric
Calculate P(CS) for topic detection
Derived from the Bays rule
Assuming that the story words are conditionally
independent, we get
where p (snC) is the probability that a word in
a story on the topic
represented by cluster C would be sn.
where p(C) is the apriori probability that any
new story will be

relevant to cluster C.

40
Two-state model for a topic

BBN topic spotting metric two-state model
Model p (snC)
One state is a distribution of the words in all
of the stories in the group
The other is a distribution from the whole
corpus

41
Clustering Metric

1. Selection Metric
Take a story and output cluster scores
BBN topic spotting metric finds the most
topical cluster to a story
The selection metric could be chosen such that
Where sm are the story words p(smC) is
computed according to the above model D(S,C) is
a justifiable metric for doing cluster selection

42
Experiment Evidence
A data set of clusters was extracted from each
of TDT-1, 2, 3 Each cluster contains stories on
one topic The misclassification rates for each
data set are given in above table Probabilistic
metric is candidate for the selection problem

43

2. Thresholding Metric
Determine whether or not a story belongs in a
cluster
Combine scores and features from the system
Score Normalization
Cosine distance metrics are naturally
normalized
Length-normed Tspot
Simply divides the log probability by story
length
Mean/sd-normed Tspot
Depends very little on the length of story
Normalized score is also reasonable

44
Corpus and Evaluation
? Linguistic Data Consortium (LDC) released a
corpus referred to as the TDT-2 corpus ? Consist
of 60,000 stories and subdivided into three
two-month sets from both newswire and audio
sources ? An annotator determines which of the
predefine topics are relevant to the story ? A
judgment YES,BRIEF, and NO ? Official
evaluation metric (weighted cost function)
45
Weighted Cost function

Topic-weighted score
Count each topics contribution to the total
cost equally
Story-weighted score
Count each storys contribution to the total
cost equally

CD is final detection cost,
PM and Cm are the probability and cost of
miss
PFA and CFA are the probability and cost of
the false accept
PT is the a priori probability of a topic
Note official evaluation is based on
topic-weighted score

46
Effect of Transcription Method
ASR (audio source) transcripts tend to have high
error rate of about 23,but are relatively
consistent CCAP( closed-captioned data)
transcripts have a smaller error rate, but
inconsistent NWT (newswire stories) transcripts
have a lowest error rate
47
Different normalization schemes
? Combination 1 depends on both cosine distance
and Length-normalized Tspot ? Combination 2
depends on Cosine distance and Mean-normalized
Tspot

48
Differences Between Data Sets
? Jan-Feb data has a few very broad topics and a
few focused ones( affects the system
performance) ? Mar-Apr data roughly 1/8 the
number of labeled stories than the Jan-Feb data
set ? May-Jun set contains roughly 3 times the
number of labeled stories Mar-Apr ?
Table3Results showing the correlation of CD with
average topic size (using CCAPNWT data)
49
Subset Experiments

The effect of multi-topic stories that contain
non-annotated topics
The data set used is Mar-APR CCAPNWT
Create a data subset that contains only the
stories that were annotated YES for one topic

50
Conclusion

Cluster news stories according to topic
Use K-means clustering algorithm to group the
stories
The clustering algorithm requires two types of
clustering metrics selection and thresholding
System uses BBN metric for the selection metric
System uses a hybrid of the BBN metric with a
conventional cosine distance metric for
threshholding

51
References

1 http//www.nist.gov/speech/tests/tdt/
2 Nicola Stokes, Joe Carthy, Combining Semantic
and Syntactic Document Classifiers to Improve
First Story Detection, Department of Computer
Science, University College Dubin
3 James Allan, Victor Lavrenko, and Hubert Jin,
First Story Detection In TDT Is Hard,
http//ciir.cs.umass.edu/pubfiles/ir-206.pdf
4 Frederick Walls, Hubert Jin, Sreenivasa
Sista, and Richard Schwartz, Topic Detection in
Broadcast News, http//www.nist.gov/speech/publica
tions/darpa99/pdf/tdt320.pdf