Automatic FMD News Monitoring and Classification - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Automatic FMD News Monitoring and Classification

Description:

The FMD Lab manually collects FMD news on the Web to monitor the status of the ... New Zealand, world, sport, business & entertainment. 41. ABC Digital. 46 ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 28

Provided by: syndr

Category:

more less

Transcript and Presenter's Notes

Title: Automatic FMD News Monitoring and Classification

1
Automatic FMD News Monitoring and Classification

Hsinchun Chen Yulei Zhang Cathy Larson
The BioPortal Team
Artificial Intelligence Lab
University of Arizona
Acknowledgements NSF ITR Program DHS/UADA UC
Davis FMD Lab

2
Introduction

Global FMD impact UK 30B 2002.
The FMD Lab manually collects FMD news on the Web
to monitor the status of the FMD around the
world.

3
An Example of Outbreak News from Cattlenetwork

4
An Example of Outbreak Report from OIE
5
Data Sources for Syndromic Surveillance (contd)

Different Types of Data Sources for Syndromic
Surveillance (Yan et al., 2008)

In this study, we focus on public sources for FMD
surveillance, which are news reports and bulletin
notification.
6
News-based Surveillance Systems (contd)

DIB/ARGUS monitors news sources and identifies
social disruption indicators using a Bayesian
model.
ProMED-mail is a manually based system. Reports
are often contributed by ProMED-mail subscribers
and reviewed by a team of experts.
MiTAP monitors many diseases and categorizes the
information on disease level. It does not focus
on FMD and it dose not classify news categories.
In this study, we aim to build an automatic
news-based surveillance and classification system
for FMD.

7
Research Questions

This study is aimed at building an automatic FMD
news monitoring and classification system. We
examine different machine learning techniques and
different feature sets for on-line FMD news
classification and develop the following research
questions.
Q1. How can we monitor FMD news websites and
bulletin notification?
Q2. Can the combination of Bag of Words, Noun
Phrases, and Name Entities features outperform
the Bag of Words features in FMD news
classification?
Q3. Can the feature selection method help improve
the performance of FMD news classification?
Q4. Which machine learning method performs better
for FMD news classification?

8
Framework of On-line News Monitoring and
Classification

Data Collection
Web Spidering
Use spidering algorithms to gather web pages from
on-line news sources.
News Filtering
Filter out unrelated news.
Document Representation and Feature Selection
Document Representation
Transform documents from full text version to
document vectors.
Word Filtering
Filter out noisy words from the document vectors.
Stemming and Stopwording
Do word stemming and remove stop words.
Feature Evaluation and Generation
Search the feature space and evaluate different
features.
Feature Selection
Select the features with high evaluation score.
Classification and Evaluation

9
System Design
10
Data Collection
Important FMD News On-line Sources Identified by
the FMD Lab (UC Davis)
We have developed spider programs to monitor
these important data sources.
11
Manual Collection of FMD News (contd)

Top 10 Websites in the Manual Collection

12
News Spidering

For the 31 important FMD news on-line sources
identified by the FMD lab (UC Davis), we monitor
29 of them (Yahoo and Google not included).
We have gathered more than 180,000 news files
from the 29 on-line sources.
For the websites we identified from the FMD news
manual collection, we monitor the top 101
websites (each has more than 5 pieces of news in
the collection).
We have gathered more than 650,000 news files
from the 101 websites.
The 29 FMD News on-line sources are among the 101
websites we identified.

13
News Filtering

After collecting news from FMD related news
websites by using spidering programs, we need to
filter out the news unrelated to FMD.
We use keyword based filtering method. All the
news items that do not contain FMD keywords are
filtered out as unrelated news.

14
Document Representation and Feature Selection

In this study, we use three document
representation methods, Bag of Words, Noun
Phrases and Name Entities.
First, we build two baseline feature subsets
(BFS). One is composed of Bag of Words features
(FeatureBFS-BW ), and the other is composed
of the combination of Bag of Words, Noun
Phrases and Name Entities features
(FeatureBFS-Comb). Features which appear more
than 5 times
in the whole collection are taken into
account.
Then we build another two feature subsets,
FeatureSFS-BW and
FeatureSFS-Comb, by conducting feature
selection on Feature BFS-BW and
Feature BFS-Comb using Correlation-based
Feature Selection and
Best First Search.
We plan to compare the performance of these four
feature subsets.

15
CFS (Correlation-based Feature Selection)

The advantage of CFS (Hall, 1998 Hall, 2000) is
that it evaluates the group of attributes
together rather than individual attributes (Hall
and Holmes, 2003).
CFS uses a subset evaluation heuristic which
assigns high scores to subsets containing
attributes that are highly correlated with the
class and have low intercorrelation with each
other.
Hall and Holmes (2003) compared six feature
selection methods and found that CFS performed
well consistently on both Naïve Bayes and C4.5
Decision Trees. They suggested that CFS is a good
overall performer. They also found that CFS chose
fewer features and thus performed faster than the
others.

16
Three Categories of FMD News

The differences among some of the seven
categories of FMD news are subtle.
Through discussion, the FMD Lab has confirmed
that it is reasonable to combine these seven
categories into the following 3 categories.
Category 1 FMD outbreak related
FMD outbreak in some location
FMD suspected in some location
FMD outbreak confirmed
FMD Follow-up report
Category 2 FMD control program related
FMD control program in some location
Category 3 FMD social and economic consequences
and general information
FMD social and economic consequences
FMD General Information

17
Classification and Evaluation

Classification
In this study, we choose the four most widely
used machine learning algorithms, K-Nearest
Neighbour, Learned Bayes Net, Naïve Bayesian and
Support Vector Machine.
Support Vector Machine was introduced by
(Joachims, 1998). Many studies have shown that it
achieves top-notch performance in text
classification (Fabrizio, 2002).
However, Support Vector Machine does not always
perform as the best. In this study, we plan to
compare SVM with the other three baseline
algorithms.
Evaluation
For each news article, a classifier's prediction
is compared with the real category assigned by
domain experts for evaluation.

18
Evaluation Metrics

Standard classification performance metrics used
in information retrieval machine learning
studies.
Overall correctness
Correctness for class i

Class 1 FMD outbreak related Class 2 FMD
control program related Class 3 FMD social and
economic consequences and general information
19
Testbed for Classification

There are 2832 news items about FMD collected by
the FMD Lab (UC Davis). The news are from
10/06/2004 to 01/20/2007.
1158 pieces of these FMD news havent been
classified.
1674 FMD news items have been classified into the
seven categories by domain experts.
We map the seven categories of the 1674 pieces of
news into three categories. Then we use these
1674 pieces of news to train the classifier.
We use 10-fold cross validation as the evaluation
approach.

20
Hypotheses

H1 The expanded representation of of Bag of
Words, Noun Phrases and Name Entities features
outperforms the Bag of Words features in FMD news
classification.
H1a FeatureBFS-CombgtFeatureBFS-BW
H1b FeatureSFS-CombgtFeatureSFS-BW
H2 The selected feature subsets outperform the
baseline feature subsets in FMD news
classification.
H2a FeatureSFS-BWgtFeatureBFS-BW
H2b FeatureSFS-CombgtFeatureBFS-Comb

21
Hypotheses (contd)

H3 Support Vector Machine (SVM) outperforms
K-Nearest Neighbour (KNN), Learned Bayesian Net
(LBN) and Naïve Bayesian (NB) for selected
feature subsets in FMD news classification.
H3a SVM gt KNN on FeatureSFS-BW
H3b SVM gt KNN on FeatureSFS-Comb
H3c SVM gt LBN on FeatureSFS-BW
H3d SVM gt LBN on FeatureSFS-Comb
H3e SVM gt NB on FeatureSFS-BW
H3f SVM gt NB on FeatureSFS-Comb

22
Results - Overall Correctness (Accuracy)
23
Hypothesis Testing H1 Representations
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
24
Hypothesis Testing H2 Feature Selection
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
25
Hypothesis Testing H3 Classifiers
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
26
Average Precision, Recall and F-measure

SVM on FeatureSFS-Comb achieves the highest
average precision, recall and F-measure.

27
Conclusions

In this study, we have described a general
framework of building domain specific news
monitoring and classification system. Our
experimental study is based on FMD news. However,
this framework can also be adopted in other
domains.
We have compared the classification performance
of different machine learning algorithms by using
different feature sets. We find that SVM achieves
the highest performance on FMD news when using
the combination features after conducting feature
selection.
We believe the automatic FMD news monitoring and
classification system will be helpful for FMD
surveillance around the world.