Title: Topic Detection and Tracking TDT
1Topic Detection and Tracking (TDT)
Weizheng Gao Xuhao Lai Litao Ou
2Outline
- Introduction
- Combining Semantic and Syntactic Document
Classifiers to Improve First Story Detection - First Story Detection In TDT Is Hard
- Topic Detection System in Broadcast News
3What is TDT?
?. INTRODUCTION 1
- TDT refers to automatic techniques for finding
topically related material in streams of data.
For example
4To this
(blocks with the same color are stories about the
same event in several media)
5Research Applications Defined in the TDT
- Story Segmentation
- Topic Tracking
- Topic Detection
- First Story Detection
- Link Detection
61. Story Segmentation
- Detect changes between topically cohesive
sections
72. Topic Tracking
- Keep track of stories similar to a set of
example stories
83. Topic Detection
- Build clusters of stories that discuss the same
topic
94. First Story Detection
- Detect if a story is the first story of a
topic
105. Link Detection
- Detect whether or not two stories are topically
linked
11?. Combining Semantic and Syntactic Document
Classifiers to Improve First Story Detection 2
- A Document Representation Strategy Using Lexical
Chains - Detection Using Two Classifiers
- Conclusions
12A Document Representation Strategy Using Lexical
Chains
- The cohesive structure can be explored and
represented by lexical chains - For example airplanes A typical
chain might consist of the - following words plane, airplane,
pilot, cockpit, airhostess, - wing, engine
- Chain words address two linguistic problems
associated with traditional syntactic
representations - synonymy polysemy
13A Document Representation Strategy Using Lexical
Chains (cont)
- Terms must be added to the most recently updated
chain. - Proper nouns is the second element of combined
document representation
Prompt the correct disambiguation of a word
based on the context in which it was used.
14Detection Using Two Classifiers
- 1. Convert the current document into a weighted
chain word vector and a weighted proper noun
vector. - 2. The first document becomes the first cluster.
- 3. Subsequent incoming documents are compared
with previously created clusters. - 4. Find the most similar cluster and discover
whether this document satisfies the similarity
condition. If not, the document is declared to
discuss a new event, or it will form the seed of
a new cluster. - 5. This process continues until all documents
have been classified.
15Conclusions
- The results show that an increase in system
effectiveness is achieved when lexical chain
(semantic) representations are used in
conjunction with proper noun (syntactic)
representations.
?A miss occurs when the system fails to detect a
new event. ? False alarms occur when the system
indicates a story contains a new event when it
does not.
16?. First Story Detection In TDT Is Hard 3
17Two Tasks of TDT
- Topic Tracking
- In Tracking, the system is given a small number
Nt, of stories that are known to be on the same
event-based news topic. The system then monitors
the stream of subsequent news stories for ones
that are on the same topic.
18Two Tasks of TDT
- First Story Detection (FSD)
- It also monitors a stream of arriving news
stories. However, the task is to mark each story
as first or not first to indicate whether or
not it is the first one discussing a news topic. - The system provides a score for each story,
where a high score indicates confidence that a
story is first
19Tracking
- TDT tracking task is fundamentally similar to
IRs information filtering task - Each begins with a representation of a topic and
then monitors a stream of arriving documents,
making decisions about documents as they arrive
(without a deferral period).
20Topic
- Filtering is subject based
- Tracking is event based
- No user feedback after tracking begins
21TDT-2 Corpus
- Approximately 60,000 news stories from January
through June of 1998 - First four months of data for parameter tuning
for tracking - Final two months for evaluation
22Tracking System
- A vector model for representing stories
- A vector centroid for representing topics
- Incoming stories are compared to the topic
centroid - On-topic
- Off-topic
23FSD System
- Same as the tracking system
- Incoming stories are compared to every story that
had appeared in the past - If the new story exceeds a threshold with any one
of the stories, it is considered old, else it is
considered new
24TDT Assumption And Relevance
- Stories are on a single topic
- Multiple topics are not judged
- IR query
- TDT topic
25Evaluation Measures(Effectiveness)
26Evaluation Measures
- Recall Precision
- Miss False alarm
- Richness
27Bounds on FSD
- One possible solution to FSD is to apply tracking
technology - Intuitively, the system marks the first story of
the corpus with a very high score, if the second
story tracks, it is assigned a low FSD score
28Relating tracking and FSD
- The probability we miss the first story for topic
i
29Relating tracking and FSD
- The topic-weighted average value
30Relating tracking and FSD (assume that topic
error rates are independent)
- The lower bound possibility of a FSD alarm for
topic i
31Relating tracking and FSD (assume that topic
error rates are independent)
32Expected FSD Performance(Figure 2)
33Expected FSD Performance(Figure 3)
34Difficulty of improving FSD(Figure 4)
35Complexity Analysis
- Possible to reduce the TDT FSD problem to the TDT
tracking problem - Tracking problem is NP-complete, then FSD is hard
- Knowing about such relationships may help avoid
redundant research or unnecessary investigative
dead-ends
36Project Implementation withFirst Story Detection
based on Tracking (Part III)
- Corpus A.txt, M.txt
- First Story Detection based on Tracking
- Evaluation Measures Recall, Precision, Miss,
False alarm, Richness
37?. Topic Detection System in Broadcast News 4
INTRODUCTION
- Concerned with unsupervised grouping of news
according to topic - Create story groupings through clustering
- Involved with the stories on the same topic
- Incremental k-means algorithm
- Probabilistic Similarity metric
- Selection and thresholding metrics
- Experimental results
38Incremental k-means algorithm
- Find the closer cluster and decide whether to
merge - Iterate through the stories and make changes
during each iteration - Correct poor initial clusters
- Computational requirement is less imposing
-
39Probabilistic Similarity Metric
- Utilize the BBN topic spotting metric
- Calculate P(CS) for topic detection
- Derived from the Bays rule
- Assuming that the story words are conditionally
independent, we get -
- where p (snC) is the probability that a word in
a story on the topic - represented by cluster C would be sn.
- where p(C) is the apriori probability that any
new story will be
relevant to cluster C.
40Two-state model for a topic
- BBN topic spotting metric two-state model
- Model p (snC)
- One state is a distribution of the words in all
of the stories in the group - The other is a distribution from the whole
corpus
41Clustering Metric
- 1. Selection Metric
- Take a story and output cluster scores
- BBN topic spotting metric finds the most
topical cluster to a story - The selection metric could be chosen such that
-
- Where sm are the story words p(smC) is
computed according to the above model D(S,C) is
a justifiable metric for doing cluster selection
42Experiment Evidence
A data set of clusters was extracted from each
of TDT-1, 2, 3 Each cluster contains stories on
one topic The misclassification rates for each
data set are given in above table Probabilistic
metric is candidate for the selection problem
43- 2. Thresholding Metric
- Determine whether or not a story belongs in a
cluster - Combine scores and features from the system
- Score Normalization
- Cosine distance metrics are naturally
normalized - Length-normed Tspot
- Simply divides the log probability by story
length - Mean/sd-normed Tspot
- Depends very little on the length of story
- Normalized score is also reasonable
44Corpus and Evaluation
? Linguistic Data Consortium (LDC) released a
corpus referred to as the TDT-2 corpus ? Consist
of 60,000 stories and subdivided into three
two-month sets from both newswire and audio
sources ? An annotator determines which of the
predefine topics are relevant to the story ? A
judgment YES,BRIEF, and NO ? Official
evaluation metric (weighted cost function)
45Weighted Cost function
- Topic-weighted score
- Count each topics contribution to the total
cost equally - Story-weighted score
- Count each storys contribution to the total
cost equally
- CD is final detection cost,
- PM and Cm are the probability and cost of
miss - PFA and CFA are the probability and cost of
the false accept - PT is the a priori probability of a topic
- Note official evaluation is based on
topic-weighted score
46Effect of Transcription Method
ASR (audio source) transcripts tend to have high
error rate of about 23,but are relatively
consistent CCAP( closed-captioned data)
transcripts have a smaller error rate, but
inconsistent NWT (newswire stories) transcripts
have a lowest error rate
47Different normalization schemes
? Combination 1 depends on both cosine distance
and Length-normalized Tspot ? Combination 2
depends on Cosine distance and Mean-normalized
Tspot
48Differences Between Data Sets
? Jan-Feb data has a few very broad topics and a
few focused ones( affects the system
performance) ? Mar-Apr data roughly 1/8 the
number of labeled stories than the Jan-Feb data
set ? May-Jun set contains roughly 3 times the
number of labeled stories Mar-Apr ?
Table3Results showing the correlation of CD with
average topic size (using CCAPNWT data)
49 Subset Experiments
- The effect of multi-topic stories that contain
non-annotated topics - The data set used is Mar-APR CCAPNWT
- Create a data subset that contains only the
stories that were annotated YES for one topic
50Conclusion
- Cluster news stories according to topic
- Use K-means clustering algorithm to group the
stories - The clustering algorithm requires two types of
clustering metrics selection and thresholding - System uses BBN metric for the selection metric
- System uses a hybrid of the BBN metric with a
conventional cosine distance metric for
threshholding
51References
- 1 http//www.nist.gov/speech/tests/tdt/
- 2 Nicola Stokes, Joe Carthy, Combining Semantic
and Syntactic Document Classifiers to Improve
First Story Detection, Department of Computer
Science, University College Dubin - 3 James Allan, Victor Lavrenko, and Hubert Jin,
First Story Detection In TDT Is Hard,
http//ciir.cs.umass.edu/pubfiles/ir-206.pdf - 4 Frederick Walls, Hubert Jin, Sreenivasa
Sista, and Richard Schwartz, Topic Detection in
Broadcast News, http//www.nist.gov/speech/publica
tions/darpa99/pdf/tdt320.pdf
52Thank you!