Title: Automatic FMD News Monitoring and Classification
1Automatic FMD News Monitoring and Classification
- Hsinchun Chen Yulei Zhang Cathy Larson
- The BioPortal Team
- Artificial Intelligence Lab
- University of Arizona
- Acknowledgements NSF ITR Program DHS/UADA UC
Davis FMD Lab
2Introduction
- Global FMD impact UK 30B 2002.
- The FMD Lab manually collects FMD news on the Web
to monitor the status of the FMD around the
world.
3An Example of Outbreak News from Cattlenetwork
4An Example of Outbreak Report from OIE
5Data Sources for Syndromic Surveillance (contd)
- Different Types of Data Sources for Syndromic
Surveillance (Yan et al., 2008)
In this study, we focus on public sources for FMD
surveillance, which are news reports and bulletin
notification.
6News-based Surveillance Systems (contd)
- DIB/ARGUS monitors news sources and identifies
social disruption indicators using a Bayesian
model. - ProMED-mail is a manually based system. Reports
are often contributed by ProMED-mail subscribers
and reviewed by a team of experts. - MiTAP monitors many diseases and categorizes the
information on disease level. It does not focus
on FMD and it dose not classify news categories. - In this study, we aim to build an automatic
news-based surveillance and classification system
for FMD.
7Research Questions
- This study is aimed at building an automatic FMD
news monitoring and classification system. We
examine different machine learning techniques and
different feature sets for on-line FMD news
classification and develop the following research
questions. - Q1. How can we monitor FMD news websites and
bulletin notification? - Q2. Can the combination of Bag of Words, Noun
Phrases, and Name Entities features outperform
the Bag of Words features in FMD news
classification? - Q3. Can the feature selection method help improve
the performance of FMD news classification? - Q4. Which machine learning method performs better
for FMD news classification?
8Framework of On-line News Monitoring and
Classification
- Data Collection
- Web Spidering
- Use spidering algorithms to gather web pages from
on-line news sources. - News Filtering
- Filter out unrelated news.
- Document Representation and Feature Selection
- Document Representation
- Transform documents from full text version to
document vectors. - Word Filtering
- Filter out noisy words from the document vectors.
- Stemming and Stopwording
- Do word stemming and remove stop words.
- Feature Evaluation and Generation
- Search the feature space and evaluate different
features. - Feature Selection
- Select the features with high evaluation score.
- Classification and Evaluation
9System Design
10Data Collection
Important FMD News On-line Sources Identified by
the FMD Lab (UC Davis)
We have developed spider programs to monitor
these important data sources.
11Manual Collection of FMD News (contd)
- Top 10 Websites in the Manual Collection
12News Spidering
- For the 31 important FMD news on-line sources
identified by the FMD lab (UC Davis), we monitor
29 of them (Yahoo and Google not included). - We have gathered more than 180,000 news files
from the 29 on-line sources. - For the websites we identified from the FMD news
manual collection, we monitor the top 101
websites (each has more than 5 pieces of news in
the collection). - We have gathered more than 650,000 news files
from the 101 websites. - The 29 FMD News on-line sources are among the 101
websites we identified.
13News Filtering
- After collecting news from FMD related news
websites by using spidering programs, we need to
filter out the news unrelated to FMD. - We use keyword based filtering method. All the
news items that do not contain FMD keywords are
filtered out as unrelated news.
14Document Representation and Feature Selection
- In this study, we use three document
representation methods, Bag of Words, Noun
Phrases and Name Entities. - First, we build two baseline feature subsets
(BFS). One is composed of Bag of Words features
(FeatureBFS-BW ), and the other is composed - of the combination of Bag of Words, Noun
Phrases and Name Entities features
(FeatureBFS-Comb). Features which appear more
than 5 times - in the whole collection are taken into
account. - Then we build another two feature subsets,
FeatureSFS-BW and - FeatureSFS-Comb, by conducting feature
selection on Feature BFS-BW and - Feature BFS-Comb using Correlation-based
Feature Selection and - Best First Search.
- We plan to compare the performance of these four
feature subsets.
15CFS (Correlation-based Feature Selection)
- The advantage of CFS (Hall, 1998 Hall, 2000) is
that it evaluates the group of attributes
together rather than individual attributes (Hall
and Holmes, 2003). - CFS uses a subset evaluation heuristic which
assigns high scores to subsets containing
attributes that are highly correlated with the
class and have low intercorrelation with each
other. - Hall and Holmes (2003) compared six feature
selection methods and found that CFS performed
well consistently on both Naïve Bayes and C4.5
Decision Trees. They suggested that CFS is a good
overall performer. They also found that CFS chose
fewer features and thus performed faster than the
others.
16Three Categories of FMD News
- The differences among some of the seven
categories of FMD news are subtle. - Through discussion, the FMD Lab has confirmed
that it is reasonable to combine these seven
categories into the following 3 categories. - Category 1 FMD outbreak related
- FMD outbreak in some location
- FMD suspected in some location
- FMD outbreak confirmed
- FMD Follow-up report
- Category 2 FMD control program related
- FMD control program in some location
- Category 3 FMD social and economic consequences
and general information - FMD social and economic consequences
- FMD General Information
17Classification and Evaluation
- Classification
- In this study, we choose the four most widely
used machine learning algorithms, K-Nearest
Neighbour, Learned Bayes Net, Naïve Bayesian and
Support Vector Machine. - Support Vector Machine was introduced by
(Joachims, 1998). Many studies have shown that it
achieves top-notch performance in text
classification (Fabrizio, 2002). - However, Support Vector Machine does not always
perform as the best. In this study, we plan to
compare SVM with the other three baseline
algorithms. - Evaluation
- For each news article, a classifier's prediction
is compared with the real category assigned by
domain experts for evaluation.
18Evaluation Metrics
- Standard classification performance metrics used
in information retrieval machine learning
studies. - Overall correctness
- Correctness for class i
Class 1 FMD outbreak related Class 2 FMD
control program related Class 3 FMD social and
economic consequences and general information
19Testbed for Classification
- There are 2832 news items about FMD collected by
the FMD Lab (UC Davis). The news are from
10/06/2004 to 01/20/2007. - 1158 pieces of these FMD news havent been
classified. - 1674 FMD news items have been classified into the
seven categories by domain experts. - We map the seven categories of the 1674 pieces of
news into three categories. Then we use these
1674 pieces of news to train the classifier. - We use 10-fold cross validation as the evaluation
approach.
20Hypotheses
- H1 The expanded representation of of Bag of
Words, Noun Phrases and Name Entities features
outperforms the Bag of Words features in FMD news
classification. - H1a FeatureBFS-CombgtFeatureBFS-BW
- H1b FeatureSFS-CombgtFeatureSFS-BW
- H2 The selected feature subsets outperform the
baseline feature subsets in FMD news
classification. - H2a FeatureSFS-BWgtFeatureBFS-BW
- H2b FeatureSFS-CombgtFeatureBFS-Comb
21Hypotheses (contd)
- H3 Support Vector Machine (SVM) outperforms
K-Nearest Neighbour (KNN), Learned Bayesian Net
(LBN) and Naïve Bayesian (NB) for selected
feature subsets in FMD news classification. - H3a SVM gt KNN on FeatureSFS-BW
- H3b SVM gt KNN on FeatureSFS-Comb
- H3c SVM gt LBN on FeatureSFS-BW
- H3d SVM gt LBN on FeatureSFS-Comb
- H3e SVM gt NB on FeatureSFS-BW
- H3f SVM gt NB on FeatureSFS-Comb
22Results - Overall Correctness (Accuracy)
23Hypothesis Testing H1 Representations
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
24Hypothesis Testing H2 Feature Selection
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
25Hypothesis Testing H3 Classifiers
Result for Pairwise t test significant with
alpha 0.10 significant with alpha
0.05 significant with alpha 0.01 The
underlined values mean that the results do not
confirm the hypotheses.
26Average Precision, Recall and F-measure
- SVM on FeatureSFS-Comb achieves the highest
average precision, recall and F-measure.
27Conclusions
- In this study, we have described a general
framework of building domain specific news
monitoring and classification system. Our
experimental study is based on FMD news. However,
this framework can also be adopted in other
domains. - We have compared the classification performance
of different machine learning algorithms by using
different feature sets. We find that SVM achieves
the highest performance on FMD news when using
the combination features after conducting feature
selection. - We believe the automatic FMD news monitoring and
classification system will be helpful for FMD
surveillance around the world.